## Who's who

• Instructors: Prof. Alan Mislove and Prof. Christo Wilson
• Contact: amislove@ccs.neu.edu and cbw@ccs.neu.edu
• Use cs5750f13-staff@ccs.neu.edu for everything other than personal issues
• Office: 250 WVH (Mislove) and 348 WVH (Wilson)
• Office Hours: 4:30-6:00 Mondays (Mislove) and TBA (Wilson)
• Teaching Assistant: Arash Molavi Kakhki
• Contact: cs5750f13-staff@ccs.neu.edu
• Office: 266 WVH
• Office Hours: TBA

## Paper Presentations and Summaries

• Course is primarily research-oriented; you must be prepared and participate
• Require 500-word summary of readings by midnight before class
• Must be ASCII text
• Students will be presenting the papers we discuss
• High-quality presentation, approx. 15-20 minutes
• Also bring a list of discussion questions
• You are in charge of the class!
• Sign-up sheet linked from Piazza

## Homeworks

• There are four homeworks throughout the semester
• Practical implementations of concepts in class
• Hands-on experience with real data, systems, etc
• Will be completed in teams of two
• You will pick your teammate
• You can switch teammates between projects
• Both teammates must put in equal effort
• You are given four slip days to use on homeworks
• Can turn in late with no penalty
• All team members must have a slip day to use it
• First homework handed out today!

## Course Project

• There is a course research project you must complete with a partner
• Topic of your own choosing, in consultation with course staff
• Research project, not a programming project
• Example ideas of course projects:
• Building an iOS/Android app to conduct an experiment
• Using Amazon Mechanical Turk to solve a problem
• Analysis of large-scale Twitter data
• Upcoming deadlines for course project:
• Meet with instructors: Sept. 30
• Project proposal (~5 pages): Oct. 9
• Interim report (~10 pages): Nov. 6
• Project presentation (~20 min): Nov. 20 - Dec. 4
• Final report (~20 pages): TBA

## Exams

There are no exams.

## Turnin system

• All summaries and homeworks will be submitted using our turnin system
• You must have a CCIS Linux account
• If not, register for one immediately
• Emailed submissions will get no credit
• You must first register with the system:
• bash$/course/cs5750f13/bin/register-student ID About to register user 'USER' with student ID 'ID'. Is this correct? [yn]  • Then, you can submit assignments: • bash$ /course/cs5750f13/bin/turnin summary02 ~/cs5750/summary02.txt
Successfully submitted summary02 for user amislove (confirmation ZiwKE5).
Submitted a total of 1 files (28392 bytes) in 0 directories.


## On cheating

• Do not cheat.
• Ask the course staff if you are unsure.

## Why study networks?

• Networks represent interaction among entities
• Entities are nodes in the network
• Edges represent interaction between two entities
• Examples include
• Information transmission
• Credit and financial flows
• Friendship
• Trust and distrust
• Disease and epidemics
• Interactions (therefore networks) usually have structure

## Friendship between members of a karate club

Source: Zachary, Anth. Res. 1977

## Co-authorship networks of scientists

Source: Porter et al., ArXiv, 2009

## Connections between routers on the Internet

Source: Bill Cheswick, Lumeta Corp

## Connections between friends

### Nodes are colored by body weight

Source: Christakis and Fowler, NEJM 2007

## Interactions between proteins in human cells

Source: Simmonis and Vidal, J. Biology 2009

## 7 bridges of Köningsburg

• Graph theory born with Euler's bridges problem
• No way to cross all bridges exactly once and return to starting point
• Convert city map into graph; prove no complete cycle exists

## How do we represent networks?

• A network (graph) is a set of vertices (nodes) connected by edges (links)
• Typically represented as $$G=(V,E)$$
• $$V=\{v_1, v_2, ... v_n\}$$ is the set of vertices
• $$E=\{e_1, e_2, ... e_n\}$$ is the set of edges, with $$e_i=(v_{from}, v_{to})$$
• Graphs can be directed or undirected
• Links in directed graphs have a direction:
• Links in undirected graphs have no direction:
• When are directed/undirected graphs appropriate?

## Network features

• Edges may have a weight attached to them

• Can represent distance, current usage, cost, strength
• Often used to find "low-cost" paths in the network
• Multiple edges may be allowed between nodes
• An edge may go from a node to itself
• This is known as a self-loop

## Walks, paths, and cycles

• We often reason about sequences of edges on graphs
• We can define the following for $$G=(V,E)$$:
• A walk is a sequence of edges $$\{v_i,v_j\},\{v_j,v_k\},...\{v_m,v_n\}$$
• A trail is a walk with distinct edges
• A path is a walk with distinct nodes
• The sequence $$\{v_i, v_j, ... v_n\}$$ contains no duplicates
• A cycle is a path that start and end at the same node
• A geodesic is a path containing the minimum number of edges
• The length of a walk is the number of edges on the walk
• For directed graphs, can only follow edges in intended direction

## Random walks

• Recall a walk is a series of sequential edges $$\{v_i,v_j\},\{v_j,v_k\},...\{v_m,v_n\}$$
• Often use random walks to understand graph structure
• Walk formed by starting a vertex $$v_i$$
• At each step, select random outgoing edge from current vertex
• Probability of selecting each outgoing edge is $$\frac{1}{{\rm outdegree}(v)}$$
• Used in many algorithms (e.g., PageRank)

## Stationary distribution and mixing time

• One application of random walks: stationary distribution
• Suppose we take longer and longer random walks....
• Eventually probability of being at any node stabilizes
• This distribution is called stationary distribution
• Can be used to measure mixing time: how long to reach stationary distribution?
• Example of a graph with high mixing time?
• Low mixing time?
• Can also define:
• Hit time $$h(u,v)$$: expected random walk length from $$u$$ to $$v$$
• Commute time $$c(u,v) = h(u,v) + h(v,u)$$

## Shortest paths

• A geodesic is also known as a shortest path
• Path with minimal hops from $$A$$ to $$B$$
• Often, we want to find such a path
• Djikstra's algorithm is useful for this
• Gives shortest paths to all destinations from single source
• Can also measure average path length of a graph: $$\frac{\sum_{u,v\in V; u\neq v} {\rm path\_length}(u,v)}{N * (N-1)}$$ (pairs of nodes with no path between them are ignored)

• Often interested in "how far" nodes are in a graph
• Average path length on doesn't say much about worst off/best off nodes
• Define eccentricity of a node to be $$e(v) = \max_{w\in V, w\neq v} {\rm path\_length}(v,w)$$
• Then can define:
• The radius of a graph is $$\min_{v\in V} e(v)$$
• The diameter of a graph is $$\max_{v\in V} e(v)$$

## Connectivity and components

• An undirected graph is connected if $$\forall_{v_i, v_j} \exists\ {\rm path}(v_i, v_j)$$
• If not, a graph can be decomposed in components
• A directed graph can
• Weakly connected if the graph is connected viewed undirected
• Strongly connected if $$\forall_{v_i, v_j} \exists\ {\rm directedpath}(v_i, v_j)$$
• Example of a weakly but not strongly connected graph:

Complete graph
Ring
Star

Tree
Bipartite
Planar

## Subgraphs

• $$S$$ is a subgraph of $$G$$ if $$S=(V_S,E_S)\ \ \ {\rm and}\ \ \ V_S\subseteq V_G, E_S\subseteq E_G$$ (if so, $$G$$ is called a supergraph of $$S$$)
• $$S$$ is a spanning subgraph if $$V_S=V_G$$
• $$S$$ is a spanning tree if $$S$$ is a spanning subgraph and a tree
• The subgraph induced by $$V^* \subset V$$ as the edge set $$\{(u,v)\in E\ \ {\rm s.t.}\ \ u\in V^*, v\in V^*\}$$
• A subgraph is a clique if it is complete

## Isomorphism

• Two graphs $$G$$ and $$H$$ are isomorphic if there exists a bijection $$f : V_G \rightarrow V_H$$ and $$(v_i, v_j) \in E_G \iff (f(v_i), f(v_j)) \in E_H$$
• Graph isomorphism problem is determining whether such a bijection exists
• Does it sound hard or easy?

## Isomorphism example

• $$f(a)=1, f(b)=6, f(c)=8, f(d)=3, f(g)=5, f(h)=2, f(i)=4, f(j)=7$$

## Neighborhood and degree

• The neighborhood of node $$i$$ is the set of nodes $$i$$ is connected to $$N(i) = \{v\ \ {\rm s.t.}\ \ (i, v)\in E\}$$
• The degree of $$i$$ is the size of the neighborhood $$|N(i)|$$
• For directed graphs, slightly more complicated
• Define outdegree as the number of outgoing edges from $$i$$
• Define indegree as the number of incoming edges to $$i$$
• Define average [out,in]degree as the average across all nodes

## Degree distributions

• The degree distribution $$p(d)$$ is the probability distribution of degrees
• $$p(1) = \frac{1}{6}$$, $$p(2) = \frac{1}{2}$$, $$p(3) = \frac{1}{6}$$, $$p(d>4) = 0$$
• The $$k_{nn}$$ distribution is average neighbors' degree based on degree
• $$k_{nn}(1) = 3$$
• $$k_{nn}(2) = \frac{\frac{2+3}{2} + \frac{2+3}{2} + \frac{2+2}{2}}{3} = \frac{7}{3}$$
• $$k_{nn}(3) = \frac{2+2+1}{3} = \frac{5}{3}$$
• $$k_{nn}(d>4)\ {\rm undefined}$$

## Assortativity

• Captures linking behavior of nodes
• Can be used to measure "homophily" of any node property
• Typically used to measure degree correlation (i.e., do high degree nodes link to other high degree nodes?)
• Defined as Pearson correlation coefficient of node degrees
• Range is $$r\in [-1,1]$$, 0 represents no correlation
• Graph with $$r<0$$ is disassortative, $$r>0$$ is assortative
• For directed networks, can define $$r(in,out)$$, $$r(in,in)$$, etc
• Why would assortativity be useful to understand?

## Clustering

• Often interested level of node clustering
• Informally, how often are my friends also friends?
• First, define the clustering coefficient of a node $$i$$ as $$c(i) = {n \over d_i (d_i-1)}$$ where $$d_i$$ is the degree of $$i$$
• Then, average clustering coefficient is $$C(G) = {\sum_{v\in V} c(v) \over |V|}$$
• What is the avg. clustering coefficient of the graph shown?

## Centrality

• Often interested in "importance" of nodes; referred to as centrality
• Many ways to measure; most accepted is betweenness centrality
• Essentially, how much does this node connect others?
• Defined as $$g(v) = \frac{\sum_{s\neq v\neq t} \frac{\sigma_{st}(v)}{\sigma_{st}}}{(N-1)(N-2)}$$ where
• $$N$$ is the number of nodes
• $$\sigma_{st}$$ is number of shortest paths from $$s$$ to $$t$$
• $$\sigma_{st}(v)$$ is the number of these that pass through $$v$$

## Betweenness centrality example

a b c d e - a-b a-b-c a-b-c-d, a-b-e-d a-b-e b-a - b-c b-c-d, b-e-d b-e c-b-a c-b - c-d c-b-e, c-d-e d-c-b-a, d-e-b-a d-c-b, d-e-b d-c - d-e e-b-a e-b e-b-c, e-d-c e-d -
$$\begin{array}{rcl} g(c) & = & \frac{\frac{\sigma_{ab}(c)}{\sigma_{ab}} + \frac{\sigma_{ad}(c)}{\sigma_{ad}} + \frac{\sigma_{ae}(c)}{\sigma_{ae}} +\ ...}{4 * 3}\\ & = & \frac{0 + 0.5 + 0 +\ ...}{12}\\ & = & \frac{1}{6}\\ \end{array}$$

## Degeneracy and $$k$$-cores

• Often interested in how graphs break down (i.e., how resilient is a graph?)
• Can define $$k$$-core of a graph as
• A maximal connected subgraph (i.e., the largest subgraph)
• Where all vertices have degree $$k$$
• How to determine if a $$k$$-core exists? (for a fixed $$k$$)
• Recursively remove all vertices with degree < $$k$$
• If you are left with no vertices, no $$k$$-core exists
• What kind of graph would have a large $$k$$-core? A small $$k$$-core?

k:

## Representing graphs

• Adjancency matrix $$\left[ \begin{array}{ccccc} 1 & 1 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 1\\ \end{array} \right]$$
• Adjancency list $$\begin{array}{lllll} a & b\\b & a & c & d & e\\c & b & e\\ ...\\\end{array}$$
• Edge list $$\begin{array}{ll} a & b\\b & c \\ b & d\\ ...\\\end{array}$$