## Measurement and Analysis of Online Social Networks

Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, Bobby Bhattacharjee

## Motivation

• Real world graphs are essential to research
• Great discoveries of the last 15 years all came from measuring real world graphs
• Models were later built to capture these features
• The next great discoveries will also come from measured graphs
• Graph dynamics
• Interactions, influence, and contagion
• Etc.
• How do researchers acquire real world graphs?
• Not as straightforward as it seems
• Many practical and theoretical challenges to collecting real graphs

## Ways to Acquire Graphs

• Go to a company and beg for data
• Go ask researchers for their data
• Go gather data yourself

## Begging from the Man

• Quick and easy; no need to spend time crawling
• Possibility of getting complete data
• Most companies don't give data to unknown researchers
• Privacy: user data is often sensitive
• Economics: big data is often worth big money
• Company may not give you what you want
• Example: willing to give you the graph, but not user interactions
• You may be forced to sign an NDA
• May give the company right of refusal over your research findings
• Prevents you from sharing data with other researchers

## Asking Researchers for a Hookup

• Quick and easy; somebody already did all the work for you
• Basic analysis has probably already been done
• Set of real graphs in academia is quite limited
• Many graphs are incomplete, of suspicious quality
• Many are tiny (order of thousands of nodes)
• Almost all are simple, static snapshots: they lack all metadata and temporal information
• Some researchers don't release their data
• Some researchers can't release their data
• Remember those NDAs?

## DIY: Do it Yourself

• By far, the hardest method
• But in many cases, the only method
• Practical challenges
• If the site doesn't have an API, you have to scrape it
• Gathering lots of data requires lots of time
• Data may be hidden by privacy settings
• Crawling methodology may introduce bias into the data
• Graphs are dynamic, constantly evolving
• Traditional graphs traversal methods (e.g. BFS, DFS) are biased

## Practical Strategies for Crawling Graphs

• This paper crawls four graph datasets
• Flickr
• Livejournal
• Orkut
• Each one presents unique challenges
• Excellent case study of how to gather data and show it is representative

## Crawling Flickr

• Photo sharing site
• Users may follow each other
• Graph is directed
• API for querying users and who they follow
• In other words, only forward links can be crawled
• Crawling strategy
• Pick seed users
• Perform a BFS of the graph, starting at the seeds

## Problem: Data Completeness

Did the crawl locate all users?

## Estimating Completeness

• Use uniform random selection to locate an unbiased sample of users
• In this case, 26.9% of 6902 random users where in the crawled data
• Are the remaining 73.1% of Flickr users "important"?
• Only 250 random users (5%) were connected to the WCC
• 89.7% of random users have zero forward links
• Only 11K new nodes discovered by BFSing from the 5043 random users
• Thus, it is likely that the crawl captured the majority of the large WCC
• Correlary: nodes disconnected from the large WCC only form small subgraphs

## Crawling Livejournal

• Blogging site
• Users may follow each others blogs
• Following is much more common than on Flickr
• Graph is directed
• API for querying users, who they follow, and who follows them
• Thus, possible to completely crawl the large WCC
• Completeness estimated via random sampling
• 4773 random users out of 5000 (95.4%) present in the WCC
• Thus, the crawl is highly likely to be comprehensive

## Crawling Orkut

• "Pure" social networking site
• Users friend each other, both sides must approve
• Graph is undirected
• Users must be invited to join by a friend
• Thus, Orkut forms a single, large SCC
• Challenges
• No API, HTTP requests are rate-limited, user population is very large
• Thus, a complete BFS crawl is impossible, even using 58 machines for >1 month
• Crawl gathered 3M of Orkuts 27M users (11.3%)

## The Curse of Partial BFS

• Also known as Snowball sampling
• Partial BFS is biased towards gathering high degree nodes
• By definition the crawler is more likely to encounter high degree nodes
• Example: there are only two paths to locating user with degree = 2
• ... whereas there are 5000 paths to locate a user with degree = 5000
• Thus, the degree distribution of the Orkut sample is not representative
• Other metrics are not as severely impacted by this bias
• E.g. clustering and assortativity

• Video sharing site
• Users may follow each other
• Graph is directed
• API for querying users and who they follow
• In other words, only forward links can be crawled
• Estimating Completeness
• User identifiers are user-generated strings, not numbers
• Thus, choosing a random sample of users is much more difficult (although not impossible)
• No estimates of completeness in this paper

## Summary: Stategies for DIY Graph Crawling

• Perform a complete BFS
• How long will this take?
• If the graph is directed, how many nodes are missing from the WCC?
• Gather a uniformly random sample of nodes
• May not be possible on sites with irregular identifiers
• Partial BFS/DFS, a.k.a. Snowball Sampling
• Known to produce results with biased degree distribution
• Random Walk
• Also produces results biased towards high degree nodes

## Building a Better Random Walk

• Assuming complete BFS and uniform random sampling are impossible...
• ... is there a way to sample the graph without incurring bias?
• Metropolis-Hastings random walk
• MCMC technique for sampling from a difficult to sample distribution
• Converges towards uniform random sampling
v = initial_seed_node
while stop_the_crawl is False:
w = random.choice(v.neighbors)     # pick a neighbor at random
p = random.random()                # 0 <= p < 1.0
if p <= v.degree/w.degree: v = w   # walk to node w
else: pass                         # stay at node v

• Not the only modified random walk algorithm for unbiased graph sampling
• Whole subfield of research dedicated to these algorithms

## Today's Preferred Technique: Twitter Hoses

• Twitter offers streams of tweets sampled uniformly at random (sort of)
• Larger hoses available from third-parties like Gnip for a fee
• $$^5$$ - Gardenhose, a.k.a. Decahose: 10% of all tweets
• $$^6$$ - Halfhose: 50% of all tweets
• $$^7$$ - Firehose: 100% of all tweets
• Twitter is (arguably) the most popular OSN for study because of this data

## Final Thoughts

• Thinking critically about data gathering methodologies is key for evaluating study results
• Is the data complete?
• If the data is sampled, how was it sampled?
• Are there hidden assumptions or biases in the data?
• Developing sound methodologies is crucial to conducting your own research
• Think carefully about what data you need
• Is it possible to gather this data? If so, what are the challenges?
• Can you demonstrate that the data is free from bias, or the bias is quantifiable?
• GIGO
• Garbage In - Garbage Out
• If you start with poor data, the results are going to be even worse

## Systematic Topology Analysis and Generation Using Degree Correlations

Priya Mahadevan, Dmitri Krioukov, Kevin Fall, Amin Vahdat

## Limits of Real World Graphs

• Real world data is always good, but it isn't without challenges
• Hard to obtain, often incomplete or biased
• Few datasets in existence
• Are findings statistically significant if you only have 1 graph?
• Possible solution: use models to generate synthetic graphs
• Infinite graph samples can produce high statistical confidence
• No privacy issues, easy to share data

## Limits of Graph Models

• How do you fit a model to a given real world graph?
• Models have parameters that must be set
• How do you choose parameter values such that model output is close to a given real world graph?
• Existing models are feature based
• Preferential Attachment produces power-law degree scaling
• Wattz-Strogatz produces tight-clustering and low path lengths
• Forest Fire produces shrinking graph diameters
• What about other features, or combinations of features?
• Existing models are not future-proof
• How can a model capture features that haven't been identified yet?

## $$d$$K-distributions and $$d$$K-graphs

• Simple but very powerful set of concepts
• First, extract a distribution from a given real graph
• The $$d$$K-distribution acts like a fingerprint for the real graph
• Single parameter, $$d$$, controls amount of information captured by the distribution
• As $$d$$ increases, more information is extracted
• Second, a generator synthesizes a new graph based on a $$d$$K-distribution
• Resulting graph is known as a $$d$$K-graph
• Generator is stochastic; possible to generate many $$d$$K-graphs
• ... however, $$d$$K-graphs are also statistically similar to the original graph
• Similarity increases as $$d$$ gets larger

## $$d$$K-distribution Example

• 0K (average degree):
• $$\overline{k} = 2$$
• 1K (degree distribution):
• P(1)=1, P(2)=2, P(3)=1
• 2K (joint degree distribution):
• P(1,3)=1, P(2,2)=1, P(2,3)=2
• 3K (clustering):
• P(1,3,2)=2, P(2,2,3)=1
• 4K:
• P(3,2,1,2)=1

## 1K-graph Generation

• Like a puzzle where all pieces are identical
• Very easy to put all the pieces together
• But, many possible arrangements

## 2K-graph Generation

• Like a puzzle where some pieces are identical
• More types of pieces mean more constraints on valid constructions
• Relatively harder to put together, still easy in an absolute sense
• Less possible arrangments than before, but still significant freedom

## $$n$$K-graph Generation

• Like a puzzle where no pieces are identical
• Only one possible construction

## Constraining $$d$$K-graphs

• 0K - How many graphs have $$n$$ nodes and average degree $$k$$?
• 1K - How many graphs have avg. degree $$k$$ and a specific degree distribution?
• Recall the experiment where we randomly rewired the edges of graphs...
• 2K - How many graphs have avg. degree $$k$$, specific degree distribution, and JDD?
• 3K - How many graphs have avg. degree $$k$$, specific degree distribution, JDD, and clustering distribution?
• 4K - Etc...

## $$d$$K Summary

• Method for deriving distributions from real graphs...
• ... and generating synthetic graphs from those distributions
• As $$d$$ heads to infinity, captures target graph with perfect accuracy
• In practice, $$d = 3$$ is sufficient to capture all currently known features
• Generators for $$d > 2$$ are extremely slow and complex