Course overview

Syllabus, schedule, and policies

Who's who

  • Instructors: Prof. Alan Mislove and Prof. Christo Wilson
  • Contact: amislove@ccs.neu.edu and cbw@ccs.neu.edu
    • Use cs5750f13-staff@ccs.neu.edu for everything other than personal issues
  • Office: 250 WVH (Mislove) and 348 WVH (Wilson)
  • Office Hours: 4:30-6:00 Mondays (Mislove) and TBA (Wilson)
  • Teaching Assistant: Arash Molavi Kakhki
  • Contact: cs5750f13-staff@ccs.neu.edu
  • Office: 266 WVH
  • Office Hours: TBA

Course information

Paper Presentations and Summaries

  • Course is primarily research-oriented; you must be prepared and participate
  • Require 500-word summary of readings by midnight before class
    • Represents 10% of your grade, in aggregate
    • Must be ASCII text
  • Students will be presenting the papers we discuss
  • High-quality presentation, approx. 15-20 minutes
  • Also bring a list of discussion questions
    • You are in charge of the class!
    • Represents 10% of your grade, in aggregate
  • Sign-up sheet linked from Piazza

Homeworks

  • There are four homeworks throughout the semester
    • Practical implementations of concepts in class
    • Hands-on experience with real data, systems, etc
  • Will be completed in teams of two
    • You will pick your teammate
    • You can switch teammates between projects
    • Both teammates must put in equal effort
  • You are given four slip days to use on homeworks
    • Can turn in late with no penalty
    • All team members must have a slip day to use it
  • First homework handed out today!

Course Project

  • There is a course research project you must complete with a partner
    • Topic of your own choosing, in consultation with course staff
    • Research project, not a programming project
    • 40% of your overall grade!
  • Example ideas of course projects:
    • Building an iOS/Android app to conduct an experiment
    • Using Amazon Mechanical Turk to solve a problem
    • Analysis of large-scale Twitter data
  • Upcoming deadlines for course project:
    • Meet with instructors: Sept. 30
    • Project proposal (~5 pages): Oct. 9
    • Interim report (~10 pages): Nov. 6
    • Project presentation (~20 min): Nov. 20 - Dec. 4
    • Final report (~20 pages): TBA

Exams

There are no exams.

Turnin system

  • All summaries and homeworks will be submitted using our turnin system
  • You must have a CCIS Linux account
    • If not, register for one immediately
  • Emailed submissions will get no credit
  • You must first register with the system:
  • bash$ /course/cs5750f13/bin/register-student ID
    About to register user 'USER' with student ID 'ID'.  Is this correct? [yn]
    
  • Then, you can submit assignments:
  • bash$ /course/cs5750f13/bin/turnin summary02 ~/cs5750/summary02.txt
      Added file summary02.txt (28392 bytes)
    Successfully submitted summary02 for user amislove (confirmation ZiwKE5).
    Submitted a total of 1 files (28392 bytes) in 0 directories.
    

On cheating

  • Do not cheat.
  • Ask the course staff if you are unsure.

Lecture 1

Basic Network Theory

Why study networks?

  • Networks represent interaction among entities
    • Entities are nodes in the network
    • Edges represent interaction between two entities
  • Examples include
    • Information transmission
    • Credit and financial flows
    • Friendship
    • Trust and distrust
    • Disease and epidemics
  • Interactions (therefore networks) usually have structure

Citations between blogs during 2004 US election

Description
Source: Adamic and Glance, IWLD'05

Friendship between members of a karate club

Description
Source: Zachary, Anth. Res. 1977

Co-authorship networks of scientists

Description
Source: Porter et al., ArXiv, 2009

Connections between routers on the Internet

Description
Source: Bill Cheswick, Lumeta Corp

Connections between friends

Nodes are colored by body weight

Description
Source: Christakis and Fowler, NEJM 2007

Interactions between proteins in human cells

Description
Source: Simmonis and Vidal, J. Biology 2009

Transportation networks

Description

7 bridges of Köningsburg

Description

  • Graph theory born with Euler's bridges problem
  • No way to cross all bridges exactly once and return to starting point
    • Convert city map into graph; prove no complete cycle exists

How do we represent networks?

  • A network (graph) is a set of vertices (nodes) connected by edges (links)
  • Typically represented as \(G=(V,E)\)
    • \(V=\{v_1, v_2, ... v_n\}\) is the set of vertices
    • \(E=\{e_1, e_2, ... e_n\}\) is the set of edges, with \(e_i=(v_{from}, v_{to})\)
  • Graphs can be directed or undirected
    • Links in directed graphs have a direction:
    • Links in undirected graphs have no direction:
    • When are directed/undirected graphs appropriate?

Network features

  • Edges may have a weight attached to them

    • Can represent distance, current usage, cost, strength
    • Often used to find "low-cost" paths in the network
  • Multiple edges may be allowed between nodes
  • An edge may go from a node to itself
    • This is known as a self-loop

Walks, paths, and cycles

  • We often reason about sequences of edges on graphs
  • We can define the following for \(G=(V,E)\):
    • A walk is a sequence of edges \(\{v_i,v_j\},\{v_j,v_k\},...\{v_m,v_n\}\)
    • A trail is a walk with distinct edges
    • A path is a walk with distinct nodes
      • The sequence \(\{v_i, v_j, ... v_n\}\) contains no duplicates
    • A cycle is a path that start and end at the same node
    • A geodesic is a path containing the minimum number of edges
  • The length of a walk is the number of edges on the walk
  • For directed graphs, can only follow edges in intended direction

Random walks

  • Recall a walk is a series of sequential edges \(\{v_i,v_j\},\{v_j,v_k\},...\{v_m,v_n\}\)
  • Often use random walks to understand graph structure
    • Walk formed by starting a vertex \(v_i\)
    • At each step, select random outgoing edge from current vertex
  • Probability of selecting each outgoing edge is $$\frac{1}{{\rm outdegree}(v)}$$
  • Used in many algorithms (e.g., PageRank)

Stationary distribution and mixing time

  • One application of random walks: stationary distribution
  • Suppose we take longer and longer random walks....
    • Eventually probability of being at any node stabilizes
    • This distribution is called stationary distribution
  • Can be used to measure mixing time: how long to reach stationary distribution?
    • Example of a graph with high mixing time?
    • Low mixing time?
  • Can also define:
    • Hit time \(h(u,v)\): expected random walk length from \(u\) to \(v\)
    • Commute time \(c(u,v) = h(u,v) + h(v,u)\)

Shortest paths

  • A geodesic is also known as a shortest path
    • Path with minimal hops from \(A\) to \(B\)
  • Often, we want to find such a path
    • Djikstra's algorithm is useful for this
    • Gives shortest paths to all destinations from single source
  • Can also measure average path length of a graph: $$\frac{\sum_{u,v\in V; u\neq v} {\rm path\_length}(u,v)}{N * (N-1)}$$ (pairs of nodes with no path between them are ignored)

Radius, diameter, and eccentricity

  • Often interested in "how far" nodes are in a graph
    • Average path length on doesn't say much about worst off/best off nodes
  • Define eccentricity of a node to be $$e(v) = \max_{w\in V, w\neq v} {\rm path\_length}(v,w)$$
  • Then can define:
    • The radius of a graph is $$\min_{v\in V} e(v)$$
    • The diameter of a graph is $$\max_{v\in V} e(v)$$

Connectivity and components

  • An undirected graph is connected if $$\forall_{v_i, v_j} \exists\ {\rm path}(v_i, v_j)$$
  • If not, a graph can be decomposed in components
  • A directed graph can
    • Weakly connected if the graph is connected viewed undirected
    • Strongly connected if $$\forall_{v_i, v_j} \exists\ {\rm directedpath}(v_i, v_j)$$
  • Example of a weakly but not strongly connected graph:

Special graph structures

Complete graph
Ring
Star
 
Tree
Bipartite
Planar

Subgraphs

  • \(S\) is a subgraph of \(G\) if $$S=(V_S,E_S)\ \ \ {\rm and}\ \ \ V_S\subseteq V_G, E_S\subseteq E_G$$ (if so, \(G\) is called a supergraph of \(S\))
  • \(S\) is a spanning subgraph if \(V_S=V_G\)
    • \(S\) is a spanning tree if \(S\) is a spanning subgraph and a tree
  • The subgraph induced by \(V^* \subset V\) as the edge set $$\{(u,v)\in E\ \ {\rm s.t.}\ \ u\in V^*, v\in V^*\}$$
  • A subgraph is a clique if it is complete

Isomorphism

  • Two graphs \(G\) and \(H\) are isomorphic if there exists a bijection $$f : V_G \rightarrow V_H$$ and $$(v_i, v_j) \in E_G \iff (f(v_i), f(v_j)) \in E_H$$
  • Graph isomorphism problem is determining whether such a bijection exists
    • Does it sound hard or easy?

Isomorphism example

  • \(f(a)=1, f(b)=6, f(c)=8, f(d)=3, f(g)=5, f(h)=2, f(i)=4, f(j)=7\)

Neighborhood and degree

  • The neighborhood of node \(i\) is the set of nodes \(i\) is connected to $$N(i) = \{v\ \ {\rm s.t.}\ \ (i, v)\in E\}$$
  • The degree of \(i\) is the size of the neighborhood \(|N(i)|\)
  • For directed graphs, slightly more complicated
    • Define outdegree as the number of outgoing edges from \(i\)
    • Define indegree as the number of incoming edges to \(i\)
  • Define average [out,in]degree as the average across all nodes

Degree distributions

  • The degree distribution \(p(d)\) is the probability distribution of degrees
    • \(p(1) = \frac{1}{6}\), \(p(2) = \frac{1}{2}\), \(p(3) = \frac{1}{6}\), \(p(d>4) = 0\)
  • The \(k_{nn}\) distribution is average neighbors' degree based on degree
    • \(k_{nn}(1) = 3\)
    • \(k_{nn}(2) = \frac{\frac{2+3}{2} + \frac{2+3}{2} + \frac{2+2}{2}}{3} = \frac{7}{3}\)
    • \(k_{nn}(3) = \frac{2+2+1}{3} = \frac{5}{3}\)
    • \(k_{nn}(d>4)\ {\rm undefined}\)

Assortativity

  • Captures linking behavior of nodes
    • Can be used to measure "homophily" of any node property
    • Typically used to measure degree correlation (i.e., do high degree nodes link to other high degree nodes?)
  • Defined as Pearson correlation coefficient of node degrees
    • Range is \(r\in [-1,1]\), 0 represents no correlation
    • Graph with \(r<0\) is disassortative, \(r>0\) is assortative
  • For directed networks, can define \(r(in,out)\), \(r(in,in)\), etc
  • Why would assortativity be useful to understand?

Clustering

  • Often interested level of node clustering
    • Informally, how often are my friends also friends?
  • First, define the clustering coefficient of a node \(i\) as $$c(i) = {n \over d_i (d_i-1)}$$ where \(d_i\) is the degree of \(i\)
  • Then, average clustering coefficient is $$C(G) = {\sum_{v\in V} c(v) \over |V|}$$
  • What is the avg. clustering coefficient of the graph shown?

Centrality

  • Often interested in "importance" of nodes; referred to as centrality
    • Many ways to measure; most accepted is betweenness centrality
    • Essentially, how much does this node connect others?
  • Defined as $$g(v) = \frac{\sum_{s\neq v\neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}}{(N-1)(N-2)}$$ where
    • \(N\) is the number of nodes
    • \(\sigma_{st}\) is number of shortest paths from \(s\) to \(t\)
    • \(\sigma_{st}(v)\) is the number of these that pass through \(v\)

Betweenness centrality example

abcde
a-a-ba-b-ca-b-c-d,
a-b-e-d
a-b-e
bb-a-b-cb-c-d,
b-e-d
b-e
cc-b-ac-b-c-dc-b-e,
c-d-e
dd-c-b-a,
d-e-b-a
d-c-b,
d-e-b
d-c-d-e
ce-b-ae-be-b-c,
e-d-c
e-d-
$$\begin{array}{rcl} g(c) & = & \frac{\frac{\sigma_{ab}(c)}{\sigma_{ab}} + \frac{\sigma_{ad}(c)}{\sigma_{ad}} + \frac{\sigma_{ae}(c)}{\sigma_{ae}} +\ ...}{4 * 3}\\ & = & \frac{0 + 0.5 + 0 +\ ...}{12}\\ & = & \frac{1}{6}\\ \end{array}$$

Degeneracy and \(k\)-cores

  • Often interested in how graphs break down (i.e., how resilient is a graph?)
  • Can define \(k\)-core of a graph as
    • A maximal connected subgraph (i.e., the largest subgraph)
    • Where all vertices have degree \(k\)
  • How to determine if a \(k\)-core exists? (for a fixed \(k\))
    • Recursively remove all vertices with degree < \(k\)
    • If you are left with no vertices, no \(k\)-core exists
  • What kind of graph would have a large \(k\)-core? A small \(k\)-core?

\(k\)-core example

k:

Representing graphs

  • Adjancency matrix $$\left[ \begin{array}{ccccc} 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 1\\ \end{array} \right]$$
  • Adjancency list $$\begin{array}{lllll} a & b\\b & a & c & d & e\\c & b & e\\ ...\\\end{array}$$
  • Edge list $$\begin{array}{ll} a & b\\b & c \\ b & d\\ ...\\\end{array}$$

Credits