4.5 A Case Study: Small World Phenomenon
This section under major construction.
Small world phenomenon.
The mathematical model that we use for studying the nature of pairwise connections among entities is known as the graph. Some graphs exhibit a specific property known as the smallworld phenomenon. You may be familiar with this property, which is sometimes known as six degrees of separation. It is the basic idea that, even though each of us has relatively few acquaintances, there is a relatively short chain of acquaintances (the six degrees of separation) separating us from one another.Graphs.
A graph is comprised of a set of vertices and a set of edges. Each edge represents a connection between two vertices. The list below suggests the diverse range of systems where graphs are appropriate starting points for understanding structure.
GRAPH VERTICES EDGES Communication telephones, computers fiber optic cable Circuits gates, registers, processors wires Mechanical joints rods, beams, springs Hydraulic reservoirs, pumping stations pipelines Financial stocks, currency transactions Transportation street intersections, airports highways, air routes Scheduling tasks precedence constraints Software systems functions function calls Internet web pages hyperlinks Games board positions legal moves Social networks people, actors, terrorists friendships, movie casts, associations Protein interaction networks proteins proteinprotein interactions Genetic regulatory networks genes regulatory interactions Neural networks neurons synapses Infectious disease people infections Electrical power grid transmission stations cable Chemical compounds molecules chemical bonds
Graph data type.
The following API supports graph processing:
Graph implementation.
Program Graph.java implements this API. Its internal representation is a symbol table of sets: the keys are vertices and the values are the sets of neighbors (the vertices adjacent to the key) This representation uses the two data types ST.java and SET.java that we introduced in Section 4.4. Our implementation has three important properties: Clients can efficiently iterate through the graph vertices.
 Clients can efficiently iterate through a vertex's neighbors.
 Space usage is proportional to the number of edges.
Performermovie graphs.
The performermovie graph includes a vertex for each performer and movie and an edge between a performer and a movie if the performer appeared in the movie.
We provide a number of sample data files (encoded using UTF8 for compatibility
with In).
Each line in a file consists of a movie, followed by a list of peformers
that appeared in the movie, separated by the / delimiter.
These input files are as of November 2, 2006. The data is taken from the Internet Movie Database.
FILE DESCRIPTION MOVIES ACTORS EDGES cast.06.txt movies release in 2006 8780 84236 103815 cast.0006.txt movies release since 2000 52195 348497 606531 cast.G.txt movies rated G by MPAA 1288 21177 28485 cast.PG.txt movies rated PG by MPAA 3687 74724 116715 cast.PG13.txt movies rated PG13 by MPAA 2538 70325 103265 cast.mpaa.txt movies rated by MPAA 21861 280624 627061 cast.action.txt action movies 14938 139861 266485 cast.rated.txt popular movies 4527 122406 217603 cast.all.txt over 250,000 movies 285462 933864 3317266
Program IndexGraph.java
takes a query which is the name of a movie (in which case it
prints the list of performers that appear in that movie) or the name
of a performer (in which case it prints the lst of movies
in which that performer has appeared).
Shortest paths in graphs.
Given two vertices in a graph, a path is a sequence of edges connecting them. A shortest path is one with minimal length over all such paths (there typically are multiple shortest paths).
We use the following API.
PathFinder.java implements the API using a classic algorithm known as breadthfirst search.
Degrees of separation.
One of the classic applications of shortestpaths algorithms is to find the degrees of separation of individuals in social networks. To fix ideas, we discuss this application in terms of a recently popularized pastime known as the Kevin Bacon game, which uses the movieperformer graph that we just considered. Kevin Bacon is a prolific actor who appeared in many movies. We assign every performer who has appeared in a movie a Bacon number: Bacon himself is 0, any performer who has been in the same cast as Bacon has a Kevin Bacon number of 1, any other performer (except Bacon) who has been in the same cast as a performer whose number is 1 has a Kevin Bacon number of 2, and so forth. For example, Meryl Streep has a Kevin Bacon number of 1 because she appeared in The River Wild with Kevin Bacon. Nicole Kidman's number is 2: although she did not appear in any movie with Kevin Bacon, she was in Cold Mountain with Donald Sutherland, and Sutherland appeared in Animal House with Kevin Bacon.Given the name of a performer, the simplest version of the game is to find some alternating sequence of movies and performers that lead back to Kevin Bacon. Remarkably, PathFinder.java is precisely the program that you need to find a shortest path that establishes the Bacon number of any performer in movies.txt.
The six degrees of separation rule suggests that most actors will have a Kevin Bacon number of 6 or less. In fact, most have a Kevin Bacon number of 3 or less!
% java PathFinder movies.txt "/" "Bacon, Kevin" Kidman, Nicole Bacon, Kevin Animal House (1978) Sutherland, Donald (I) Cold Mountain (2003) Kidman, Nicole Delpy, Julia Bacon, Kevin The Air I Breathe (2007) Delpy, Julia
Q + A
Q. What is formally meant by a small world network?
A. High local clustering and short global separation. Local clustering means that two vertices are more likely to be adjacent if they share a neighbor. Short global separation means that all (or most) pairs of vertices are connected by "short" paths. Here short means the length of chain scales logarithmic (or polylogarithmic) with the number of vertices. For example, empirically, any two documents on the Web are around 19 clicks away (directed) on average. Empirical estimate = .3 + 2.06 log_{10} N, where N = 1 billion. If web grows by factor of 1,000, only 21 clicks according to model.
Q. Other small world applications.
A. The small world phenomenon has numerous applications in the natural and social sciences. The domain of the Kevin Bacon is actormovie relationships, but we can ask the same questions with mathematicians and papers (Erdös number), musicians and rock groups, baseball players and teams, or people and online social network groups. Sociology: looking for a job via connections, marketing products or ideas, formation and spread of fame and fads, train of thought followed in a conversation, and defining representativeness of political bodies. Science: synchronization of neurons, analysis of World Wide Web, secure peertopeer networks, routing algorithms for ad hoc wireless networks, design of electrical power grids, processing of information in the human brain, phase transitions in coupled Kuramoto oscillators, spread of infectious diseases and computer viruses, and evolution of cooperation in multiplayer iterated Prisoner's Dilemma.
In The SmallWorld Phenomenon: An Algorithmic Perspective.
Q. Other applications and extensions.
A. Another important application of breadth first search is crawling the web. Or routing packets on the Internet. An important generalization is when each edge has an associated distance or time and the goal is to find a path that minimizes the sum of the edge lengths. More sophisticated algorithm known as Dijkstra's algorithm solves that problem in linearithmic time. Basis for map routing applications on the web and in GPSequipped automobiles.
Exercises
 Wikipedia graph  a nice dataset.
 Add methods vertices() and edges() to Graph that returns the number of vertices and edges in the graph, respectively.
 Add a boolean method exists(String v) that returns true if v is a vertex in the graph, and false otherwise.
 Add a boolean method exists(String v, String w) that returns true if vw is an edge in the graph, and false otherwise.
 Suppose you use a stack instead of a queue when running breadth first in PathFinder.java. Does it still compute Kevin Bacon numbers correctly?
 Suppose you use a queue instead of a stack when forming the shortest path in path(). What is the effect?
 Add a method isReachable(v) to PathFinder.java that returns true if v can reach s via some path, and false otherwise.

The degree of a vertex is the number of incident edges.
Add a method degree() to Graph
that takes a Vertex
as input and returns the degree of that vertex.
Use this method to find the actor node in the Kevin Bacon graph
with the highest degree, i.e., the actor that has appeared in the most
movies.
public int degree(String v) { if (st.contains(v)) return st.get(v).size(); else return 0; }
 Write a program MovieStats.java that reads in a movie data set and prints out the number of distinct actors and movies.
 Modify Bacon.java so that the user types two actors (one per line) and the program prints a shortest chain between the two actors.
 Create a copy constructor for Graph that takes as input a graph G, and creates and initializes a new, independent, copy of the graph. Any changes you make to G should not affect the newly created graph.
 Array of PathFinder's to save away previous queries?

Modify Graph.java to include a method iterator()
that returns an Iterator to the graph's vertices.
Also, make Graph implement the
Iterable<String> interface so that you can
use it with foreach to iterate over the graph's vertices.
public Iterator<String> iterator() { return st.iterator(); }
 True or false. At some point during breadth first search the queue can contain two vertices, one whose distance from the source is 7 and one whose distance is 9. Answer: False. The queue can contain vertices of at most two distinct distances d and d+1. Breadth first search examines the vertices in increasing order of distance from the source. When examining a vertex at distance d, only vertices of distance d+1 can be enqueued.
Creative Exercises
 Histogram.
Write a program Histogram.java to
print a histogram of Kevin Bacon numbers, indicating how many actors have
a Bacon number of 0, 1, 2, 3, etc.
# Freq  0 1 1 2083 2 187072 3 515582 4 113741 5 8269 6 772 7 93 8 7 Inf 28942
 Large Bacon numbers. Find the actors in cast.txt with the largest, but finite, Kevin Bacon number. Answer: here are a few actors whose number is 8: Ozra Esmaili, Gabriela Pereyra, Yamil Turk, Alejandro Zain.
 Actor graph. An alternate way to compute Kevin Bacon numbers is to build a graph where there is a vertex for each actor (but not for each movie). Instead, two actors are connected by an edge if they appear in a movie together. Calculate Kevin Bacon numbers by running BFS on the actor graph. Compare the running time versus the algorithm described in the text. Explain why this approach is so much slower (and more difficult to reconstructor the movies along the path).
 Connected components. Given an undirected graph, a connected component is a maximal set of vertices that are mutually reachable. Write a program CCFinder.java that computes the connected components of a graph. Include a constructor that takes a Graph as an argument and computes all of the connected components using breadth first search. Also, include a method isReachable(v, w) that returns true if v and w are in the same connected component, and false otherwise. Also include a method components() that returns the number of of connected components. Note: need a way to iterate over the vertices in the symbol table. Then run BFS from each vertex that hasn't yet been visited.
 Flood fill / image procesing. Blob = collection of neighboring pixels of the same color. How many blobs in picture? Given NbyN image of color pixels, identify all clusters of pixels of the same color that are connected. Find connected components.
 Word ladder.
Write a program WordLadder.java
that takes two 5 letter strings as commandline arguments, and reads in
a list of 5 letter words
from standard input, and prints out a shortest
(or word ladder
connecting the two strings (if it exists).
This word game was invented by Lewis Carroll and was orignally known as a doublet.
Two words can be connected in a word ladder chain if they differ
in exactly one letter.
As an example, the following word ladder connects green and brown.
green greet great groat groan grown brown
You can also try out your program on this list of 6 letter words.
 All paths. Given an undirected graph and two vertices s and t, write a program AllPaths.java that can be used to count or print out all simple paths from s to t. A simple path does not revisit any vertex more than once. In twodimensional grids, such paths are referred to as selfavoiding walks. It is a fundamental problem in in statistical physics and theoretical chemistry, e.g., to model the spatial arrangement of linear polymer molecules in a solution. Warning: there might be exponentially paths, so don't run this on large graphs. Reference.
 Garbage collection. Reference counting vs. markandsweep. Automatic memory management in languages like Java is a challenging problem. Allocating memory is easy, but discovering when a program is finished with memory (and reclaiming it) is more difficult. Reference counting: doesn't work with circular linked structure. Markandsweep algorithm. Root = local variables and static variables. Run DFS from roots, marking all variables references from roots, and so on. Then, make second pass: free all unmarked objects and unmark all marked objects. Or a copying GC would then move all of the marked objects to a single memory area. Uses one extra bit per object. JVM must pause while garbage collection occurs. Fragments memory.
 Directed graph. Create a data type Digraph.java for directed graph. It is almost identical to our Graph ADT, except that when we insert an edge vw, we only insert w on the adjacently list of v, but not v onto the adjacency list of w. Include a method outdegree(String v).
 Shortest path in a digraph. Write a directed version of PathFinder that finds the shortest path from s to every other vertex in a directed graph. Hint: breadth search first works in digraphs.
 Transitive closure. Write an ADT TransitiveClosure.java whose constructor takes a Digraph as an argument, and whose method isReachable(v, w) true if w is reachable from v along a directed path in G, and false otherwise. Note: need a way to iterate over the vertices in the symbol table. Then run BFS from each vertex. Might need to store an VbyV table or compute queries on the fly.
 Topological order and cycle detection. Given a directed graph, determine if there is a directed cycle. If so, print one out. Otherwise print out a topological order. Given a list of pairs of variables of the form x < y, find an ordering that is consistent with the partial order. For example, if x < z, y < z, z < v, and v < w, then x < y < z < v < w is one possible solution. Print that the input is inconsistent if there is a cycle, e.g., x < y, y < z, z < x.
 Cover time. Given a connected, undirected graph G, a random walk moves from vertex v to w with probability equal to 1 / degree(v) if (v, w) is an edge (and 0 otherwise). Write a program to simulate how many steps it takes to visit every vertex in the graph. Find an infinite sequence graph where the cover time grows proportional to V, proportional to V^2. Can you find one where it grows proportional to V^3 or 2^V?
 Center of the Hollywood universe. We can measure how good of a center that Kevin Bacon is by computing their Hollywood number or average path length. The Hollywood number of Kevin Bacon is the average Bacon number of all the actors (in its connected component). The Hollywood number of another actor is computed the same way, but we make them be the source instead of Kevin Bacon. Compute Kevin Bacon's Hollywood number and find an actor and actress with better Hollywood numbers than Kevin Bacon. Find the actor or actress (in the same connected component as Kevin Bacon) with the worst Hollywood number.
 Diameter. The distance between two vertices is the length of the shortest path between them. The eccentricity of a vertex is the greatest distance between it and any other vertex. The diameter of a graph is the greatest distance between any two vertices (the maximum eccentricity of any vertex). Write a program Diameter.java to compute the eccentricity of a vertex and the diameter of a graph. Use it to find the diameter of the Kevin Bacon graph.
 Parallel edges. Modify Graph to allow parallel edges, i.e., two or more edges connecting the same pair of vertices. Hint: use a Queue for the adjacency list instead of a SET.
 ErdosRenyi graph model.
Graph on V vertices. Edge possible edge is included
independently with probability p. Verify the following
properties.
 Connectivity threshholds. If p < 1/n and n is large, then most of connected components are small, with largest having O(log n) vertices. If p > 1/n then there is almost surely a giant component containing Theta(n) vertices. If p < ln n / n, the graph is disconnected with high probability; if p > ln n / n the graph is connected with high probability.
 Distribution of degrees. Distribution of degrees follows a binomial distribution. Most vertices have similar degrees. Called exponential since the probability that a vertex is connected to k other vertices decreases exponentially in k.
 No hubs. Maximum degree if p is a constant is O(log n).
 Not many triangles. Random graph model doesn't capture social networks. Typically the neighbors of a vertex are also neighbors of each other.
 Small diameter. If p > log n / n, then diameter is O(log n).
 Power law of web links. (Micahel Mitzenmacher) The indegrees and outdegrees of World Wide Web obey a power law. Can be modeled by preferred attachment process. Suppose that each web page has exactly one outgoing link. Each page is created one at a time, starting with a single page that points to itself. With probability p < 1, it links to one of the existing pages, chosen uniformly at random. With probability 1p it links to an existing page with probability proportional to the number of incoming links of that page. This rule reflects the common tendency for new web pages to point to popular pages. Write a program to simulate this process and plot a histogram of the number of incoming links. Answer: you should observe that the fraction of pages with indegree k is proportional to k^(1 / (1  p)).
Web Exercises
 All shortest pairs. Write a program AllShortestPaths.java that builds a graph from a file, and reads sourcedestination requests from standard input and prints a shortest path in the graph from the source to the destination.
 Faster word ladders.
To speed things up (if the word list is very large),
don't write a nested loop to try all pairs of
words to see if they are adjacent. To handle 5 letter words,
first sort the word list. Words that only differ in their last letter
will appear consecutively in the sorted list. Sort 4 more times,
but cyclically shift the letters one position to the right so
that words that differ in the ith letter will appear consecutively
in one of the sorted lists.
Try out this approach using a larger word list with words of different sizes. Two words of different lengths are neighbors if the smaller word is the same as the bigger word, minus the last letter, e.g., brow and brown.
 Checkers. Extend the rules of checkers to an NbyN checkerboard. Show how to determine whether a checker can become in king in the current move. (Use BFS or DFS.) Show how to determine whether black has a winning move. (Find an Euler path.)
 Combinational circuits. Determining the truth value of a combinational circuit given its inputs is a graph reachability problem (on a directed acyclic graph).
 Hex.
The game of Hex is played on a trapezoidal grid of hexagons....
 Describe how to detect whether white or black has one the game using breadth first search.
 Prove that the game cannot end in a tie. Hint: consider the set of cells reachable from the left side of the board.
 Prove that the first player in Hex can always win if they play optimally. Hint: if the second player had a winning strategy, you could choose a random cell initially, and then just copy the second players winning strategy. This is called strategy stealing.
 Explain the design flaw in the following implementation
of a ring graph.
public class RingGraph extends Graph { public RingGraph(int V) { for (int i = 0; i < V; i++) { String v = "" + i; String w = "" + (i + 1) % V; addEdge(v, w); } } }
Topological robustness. "Scalefree networks are robust to random damage, but vulernable to malicious attack." In a power law distributed small world network, deletion of random node rarely has much effect on length of shortest paths (since most shortest paths are routed through hubs). This is in constrast to random network. However, small world networks can catastrophically fail if the hub is deleted or attacked (whereas random networks have no central point of failure). Random 95% of Internet links (6209 vertices an 24401 links) can be removed and graph stays connected, but destroying 2.3% of hubs would disconnect it. Scalefree networks are also vulnerable to spreading viruses. We should immunize the hubs. Barabasi hypothesized that biological systems may be evolutionarily predisposed to prefer small world networks since they are more robust to perturbations (mutation, viral infection). Viruses do adapt to attack the hubs. Internet will still function if we randomly delete 80% of vertices. Scalefree networks are topologically robust if gamma <= 3. "promiscuous individuals are the vulnerable nodes to target in safesex campaigns." Liljeros.
Power laws and scalefree networks.
Create a histogram of the the degrees of vertices in a graph and look at distribution. Power laws model scale invariance observed in many natural phenomena, including physics, sociology, geography, and economics. Direct result of selforganization. US Road network is approximately random network with bellshaped connectivity distribution; US airport network is scalefree with many hubs. Think hub = airline hub.WWW, Internet, food webs, social webs, pairwise interactions between proteins in human body. Probability than an actor has k neighbors follows a power lawtail for large k: p(k) ~ k^(2.3). Web graph (indegrees and outdegrees) obey a power law. Probability that k documents point to a given web page ~ k^(2.1).
Generative models.
Reference: powerlaws.media.mit.edu. Traditionally, scientists have modeled graphs using completed ordered graphs (a grid or crystal lattice) or completely random graphs. Most real world networks seem fall in between the two extremes. People meet friends via existing friends, so the networks have many local connections and highly clustered. If person moves to a different city or joins a group, they form longrange connections.Such networks grow over time by the gradual addition of vertices and edges with preferential connectivity. New web page likely to connect to wellestablished web pages. Barabasi model: rich get richer. Start at time t = 1 with two vertices linked by d parallel edges. At every time t >= 2, add a new vertex with d edges using preferential attachment. Probability that new vertex will be connected to vertex i is ki / sum (kj), where ki is the connectivity of vertex i (indegree for directed graphs). Add edges one at a time, with subsequent edges doing preferential attachment using updated degrees. (Has property that for general average outdegree d, can run d = 1 model for dn steps and collapse vertices kd, kd  1, ..., (k1)d + 1 to make vertex k.) Fraction of vertices with degree k converges to 2d(d+1)/(k(k+1)(k+2)) as n gets large. In other words, a power law proportional with exponent 3, where d only influences constant. reference. Variants lead to exponents between 2 and infinity. (Exercise: simulate and compute an estimate.) As a result, hubs are formed.
Lovasz. Fixed preferential attachment graph. Fix n vertices with selfloops and m steps. In step i, choose two random vertices independently with probabilityd proportional to degree and connect them. Random graph model of Erdos and Renyi not model many large networks that arise in practice.
Simple random model is design principle for some decentralized peertopeer networks.
Cycle with random matching.
Bollobas and Chung proved that a cycle (of even length) plus
a random matching has log n diameter with high probability,
making it a small world network.