4.5   A Case Study: Small World Phenomenon


This section under major construction.


Small world phenomenon.

The mathematical model that we use for studying the nature of pairwise connections among entities is known as the graph. Some graphs exhibit a specific property known as the small-world phenomenon. You may be familiar with this property, which is sometimes known as six degrees of separation. It is the basic idea that, even though each of us has relatively few acquaintances, there is a relatively short chain of acquaintances (the six degrees of separation) separating us from one another.

Graphs.

A graph is comprised of a set of vertices and a set of edges. Each edge represents a connection between two vertices. The list below suggests the diverse range of systems where graphs are appropriate starting points for understanding structure.


GRAPH VERTICES EDGES
Communication telephones, computers fiber optic cable
Circuits gates, registers, processors wires
Mechanical joints rods, beams, springs
Hydraulic reservoirs, pumping stations pipelines
Financial stocks, currency transactions
Transportation street intersections, airports highways, air routes
Scheduling tasks precedence constraints
Software systems functions function calls
Internet web pages hyperlinks
Games board positions legal moves
Social networks people, actors, terrorists friendships, movie casts, associations
Protein interaction networks proteins protein-protein interactions
Genetic regulatory networks genes regulatory interactions
Neural networks neurons synapses
Infectious disease people infections
Electrical power grid transmission stations cable
Chemical compounds molecules chemical bonds


Graph data type.

The following API supports graph processing:

Graph API
Graph representation


Graph implementation.

Program Graph.java implements this API. Its internal representation is a symbol table of sets: the keys are vertices and the values are the sets of neighbors (the vertices adjacent to the key) This representation uses the two data types ST.java and SET.java that we introduced in Section 4.4. Our implementation has three important properties:

Performer-movie graphs.

The performer-movie graph includes a vertex for each performer and movie and an edge between a performer and a movie if the performer appeared in the movie.

Movie-Performer graph


We provide a number of sample data files (encoded using UTF-8 for compatibility with In). Each line in a file consists of a movie, followed by a list of peformers that appeared in the movie, separated by the / delimiter.

FILE DESCRIPTION MOVIES ACTORS EDGES
cast.06.txt movies release in 2006 8780 84236 103815
cast.00-06.txt movies release since 2000 52195 348497 606531
cast.G.txt movies rated G by MPAA 1288 21177 28485
cast.PG.txt movies rated PG by MPAA 3687 74724 116715
cast.PG13.txt movies rated PG-13 by MPAA 2538 70325 103265
cast.mpaa.txt movies rated by MPAA 21861 280624 627061
cast.action.txt action movies 14938 139861 266485
cast.rated.txt popular movies 4527 122406 217603
cast.all.txt over 250,000 movies 285462 933864 3317266
These input files are as of November 2, 2006. The data is taken from the Internet Movie Database.


Program IndexGraph.java takes a query which is the name of a movie (in which case it prints the list of performers that appear in that movie) or the name of a performer (in which case it prints the lst of movies in which that performer has appeared).

Shortest paths in graphs.

Given two vertices in a graph, a path is a sequence of edges connecting them. A shortest path is one with minimal length over all such paths (there typically are multiple shortest paths).

Shortest paths in a transportation graph
We use the following API.

PathFinder API
PathFinder.java implements the API using a classic algorithm known as breadth-first search.

Degrees of separation.

One of the classic applications of shortest-paths algorithms is to find the degrees of separation of individuals in social networks. To fix ideas, we discuss this application in terms of a recently popularized pastime known as the Kevin Bacon game, which uses the movie-performer graph that we just considered. Kevin Bacon is a prolific actor who appeared in many movies. We assign every performer who has appeared in a movie a Bacon number: Bacon himself is 0, any performer who has been in the same cast as Bacon has a Kevin Bacon number of 1, any other performer (except Bacon) who has been in the same cast as a performer whose number is 1 has a Kevin Bacon number of 2, and so forth. For example, Meryl Streep has a Kevin Bacon number of 1 because she appeared in The River Wild with Kevin Bacon. Nicole Kidman's number is 2: although she did not appear in any movie with Kevin Bacon, she was in Cold Mountain with Donald Sutherland, and Sutherland appeared in Animal House with Kevin Bacon.

Given the name of a performer, the simplest version of the game is to find some alternating sequence of movies and performers that lead back to Kevin Bacon. Remarkably, PathFinder.java is precisely the program that you need to find a shortest path that establishes the Bacon number of any performer in movies.txt.

% java PathFinder movies.txt "/" "Bacon, Kevin" 
Kidman, Nicole 
   Bacon, Kevin 
   Animal House (1978) 
   Sutherland, Donald (I) 
   Cold Mountain (2003) 
   Kidman, Nicole 
Delpy, Julia 
   Bacon, Kevin 
   The Air I Breathe (2007) 
   Delpy, Julia
The six degrees of separation rule suggests that most actors will have a Kevin Bacon number of 6 or less. In fact, most have a Kevin Bacon number of 3 or less!


Q + A

Q. What is formally meant by a small world network?

A. High local clustering and short global separation. Local clustering means that two vertices are more likely to be adjacent if they share a neighbor. Short global separation means that all (or most) pairs of vertices are connected by "short" paths. Here short means the length of chain scales logarithmic (or polylogarithmic) with the number of vertices. For example, empirically, any two documents on the Web are around 19 clicks away (directed) on average. Empirical estimate = .3 + 2.06 log10 N, where N = 1 billion. If web grows by factor of 1,000, only 21 clicks according to model.

Q. Other small world applications.

A. The small world phenomenon has numerous applications in the natural and social sciences. The domain of the Kevin Bacon is actor-movie relationships, but we can ask the same questions with mathematicians and papers (Erdös number), musicians and rock groups, baseball players and teams, or people and online social network groups. Sociology: looking for a job via connections, marketing products or ideas, formation and spread of fame and fads, train of thought followed in a conversation, and defining representative-ness of political bodies. Science: synchronization of neurons, analysis of World Wide Web, secure peer-to-peer networks, routing algorithms for ad hoc wireless networks, design of electrical power grids, processing of information in the human brain, phase transitions in coupled Kuramoto oscillators, spread of infectious diseases and computer viruses, and evolution of cooperation in multi-player iterated Prisoner's Dilemma.

In The Small-World Phenomenon: An Algorithmic Perspective.

Q. Other applications and extensions.

A. Another important application of breadth first search is crawling the web. Or routing packets on the Internet. An important generalization is when each edge has an associated distance or time and the goal is to find a path that minimizes the sum of the edge lengths. More sophisticated algorithm known as Dijkstra's algorithm solves that problem in linearithmic time. Basis for map routing applications on the web and in GPS-equipped automobiles.

Exercises

  1. Wikipedia graph - a nice dataset.
  2. Add methods vertices() and edges() to Graph that returns the number of vertices and edges in the graph, respectively.
  3. Add a boolean method exists(String v) that returns true if v is a vertex in the graph, and false otherwise.
  4. Add a boolean method exists(String v, String w) that returns true if v-w is an edge in the graph, and false otherwise.
  5. Suppose you use a stack instead of a queue when running breadth first in PathFinder.java. Does it still compute Kevin Bacon numbers correctly?
  6. Suppose you use a queue instead of a stack when forming the shortest path in path(). What is the effect?
  7. Add a method isReachable(v) to PathFinder.java that returns true if v can reach s via some path, and false otherwise.
  8. The degree of a vertex is the number of incident edges. Add a method degree() to Graph that takes a Vertex as input and returns the degree of that vertex. Use this method to find the actor node in the Kevin Bacon graph with the highest degree, i.e., the actor that has appeared in the most movies.
    public int degree(String v) {
       if (st.contains(v)) return st.get(v).size();
       else return 0;
    }
    
  9. Write a program MovieStats.java that reads in a movie data set and prints out the number of distinct actors and movies.
  10. Modify Bacon.java so that the user types two actors (one per line) and the program prints a shortest chain between the two actors.
  11. Create a copy constructor for Graph that takes as input a graph G, and creates and initializes a new, independent, copy of the graph. Any changes you make to G should not affect the newly created graph.
  12. Array of PathFinder's to save away previous queries?
  13. Modify Graph.java to include a method iterator() that returns an Iterator to the graph's vertices. Also, make Graph implement the Iterable<String> interface so that you can use it with foreach to iterate over the graph's vertices.
    public Iterator<String> iterator() {
       return st.iterator();
    }
    
  14. True or false. At some point during breadth first search the queue can contain two vertices, one whose distance from the source is 7 and one whose distance is 9. Answer: False. The queue can contain vertices of at most two distinct distances d and d+1. Breadth first search examines the vertices in increasing order of distance from the source. When examining a vertex at distance d, only vertices of distance d+1 can be enqueued.

Creative Exercises

  1. Histogram. Write a program Histogram.java to print a histogram of Kevin Bacon numbers, indicating how many actors have a Bacon number of 0, 1, 2, 3, etc.
      #     Freq
    ------------
      0        1
      1     2083
      2   187072
      3   515582
      4   113741
      5     8269
      6      772
      7       93
      8        7
    Inf    28942
    
  2. Large Bacon numbers. Find the actors in cast.txt with the largest, but finite, Kevin Bacon number. Answer: here are a few actors whose number is 8: Ozra Esmaili, Gabriela Pereyra, Yamil Turk, Alejandro Zain.
  3. Actor graph. An alternate way to compute Kevin Bacon numbers is to build a graph where there is a vertex for each actor (but not for each movie). Instead, two actors are connected by an edge if they appear in a movie together. Calculate Kevin Bacon numbers by running BFS on the actor graph. Compare the running time versus the algorithm described in the text. Explain why this approach is so much slower (and more difficult to reconstructor the movies along the path).
  4. Connected components. Given an undirected graph, a connected component is a maximal set of vertices that are mutually reachable. Write a program CCFinder.java that computes the connected components of a graph. Include a constructor that takes a Graph as an argument and computes all of the connected components using breadth first search. Also, include a method isReachable(v, w) that returns true if v and w are in the same connected component, and false otherwise. Also include a method components() that returns the number of of connected components. Note: need a way to iterate over the vertices in the symbol table. Then run BFS from each vertex that hasn't yet been visited.
  5. Flood fill / image procesing. Blob = collection of neighboring pixels of the same color. How many blobs in picture? Given N-by-N image of color pixels, identify all clusters of pixels of the same color that are connected. Find connected components.
  6. Word ladder. Write a program WordLadder.java that takes two 5 letter strings as command-line arguments, and reads in a list of 5 letter words from standard input, and prints out a shortest (or word ladder connecting the two strings (if it exists). This word game was invented by Lewis Carroll and was orignally known as a doublet. Two words can be connected in a word ladder chain if they differ in exactly one letter. As an example, the following word ladder connects green and brown.
    green greet great groat groan grown brown
    

    You can also try out your program on this list of 6 letter words.

  7. All paths. Given an undirected graph and two vertices s and t, write a program AllPaths.java that can be used to count or print out all simple paths from s to t. A simple path does not revisit any vertex more than once. In two-dimensional grids, such paths are referred to as self-avoiding walks. It is a fundamental problem in in statistical physics and theoretical chemistry, e.g., to model the spatial arrangement of linear polymer molecules in a solution. Warning: there might be exponentially paths, so don't run this on large graphs. Reference.
  8. Garbage collection. Reference counting vs. mark-and-sweep. Automatic memory management in languages like Java is a challenging problem. Allocating memory is easy, but discovering when a program is finished with memory (and reclaiming it) is more difficult. Reference counting: doesn't work with circular linked structure. Mark-and-sweep algorithm. Root = local variables and static variables. Run DFS from roots, marking all variables references from roots, and so on. Then, make second pass: free all unmarked objects and unmark all marked objects. Or a copying GC would then move all of the marked objects to a single memory area. Uses one extra bit per object. JVM must pause while garbage collection occurs. Fragments memory.
  9. Directed graph. Create a data type Digraph.java for directed graph. It is almost identical to our Graph ADT, except that when we insert an edge v-w, we only insert w on the adjacently list of v, but not v onto the adjacency list of w. Include a method outdegree(String v).
  10. Shortest path in a digraph. Write a directed version of PathFinder that finds the shortest path from s to every other vertex in a directed graph. Hint: breadth search first works in digraphs.
  11. Transitive closure. Write an ADT TransitiveClosure.java whose constructor takes a Digraph as an argument, and whose method isReachable(v, w) true if w is reachable from v along a directed path in G, and false otherwise. Note: need a way to iterate over the vertices in the symbol table. Then run BFS from each vertex. Might need to store an V-by-V table or compute queries on the fly.
  12. Topological order and cycle detection. Given a directed graph, determine if there is a directed cycle. If so, print one out. Otherwise print out a topological order. Given a list of pairs of variables of the form x < y, find an ordering that is consistent with the partial order. For example, if x < z, y < z, z < v, and v < w, then x < y < z < v < w is one possible solution. Print that the input is inconsistent if there is a cycle, e.g., x < y, y < z, z < x.
  13. Cover time. Given a connected, undirected graph G, a random walk moves from vertex v to w with probability equal to 1 / degree(v) if (v, w) is an edge (and 0 otherwise). Write a program to simulate how many steps it takes to visit every vertex in the graph. Find an infinite sequence graph where the cover time grows proportional to V, proportional to V^2. Can you find one where it grows proportional to V^3 or 2^V?
  14. Center of the Hollywood universe. We can measure how good of a center that Kevin Bacon is by computing their Hollywood number or average path length. The Hollywood number of Kevin Bacon is the average Bacon number of all the actors (in its connected component). The Hollywood number of another actor is computed the same way, but we make them be the source instead of Kevin Bacon. Compute Kevin Bacon's Hollywood number and find an actor and actress with better Hollywood numbers than Kevin Bacon. Find the actor or actress (in the same connected component as Kevin Bacon) with the worst Hollywood number.
  15. Diameter. The distance between two vertices is the length of the shortest path between them. The eccentricity of a vertex is the greatest distance between it and any other vertex. The diameter of a graph is the greatest distance between any two vertices (the maximum eccentricity of any vertex). Write a program Diameter.java to compute the eccentricity of a vertex and the diameter of a graph. Use it to find the diameter of the Kevin Bacon graph.
  16. Parallel edges. Modify Graph to allow parallel edges, i.e., two or more edges connecting the same pair of vertices. Hint: use a Queue for the adjacency list instead of a SET.
  17. Erdos-Renyi graph model. Graph on V vertices. Edge possible edge is included independently with probability p. Verify the following properties.
    • Connectivity threshholds. If p < 1/n and n is large, then most of connected components are small, with largest having O(log n) vertices. If p > 1/n then there is almost surely a giant component containing Theta(n) vertices. If p < ln n / n, the graph is disconnected with high probability; if p > ln n / n the graph is connected with high probability.
    • Distribution of degrees. Distribution of degrees follows a binomial distribution. Most vertices have similar degrees. Called exponential since the probability that a vertex is connected to k other vertices decreases exponentially in k.
    • No hubs. Maximum degree if p is a constant is O(log n).
    • Not many triangles. Random graph model doesn't capture social networks. Typically the neighbors of a vertex are also neighbors of each other.
    • Small diameter. If p > log n / n, then diameter is O(log n).
  18. Power law of web links. (Micahel Mitzenmacher) The indegrees and outdegrees of World Wide Web obey a power law. Can be modeled by preferred attachment process. Suppose that each web page has exactly one outgoing link. Each page is created one at a time, starting with a single page that points to itself. With probability p < 1, it links to one of the existing pages, chosen uniformly at random. With probability 1-p it links to an existing page with probability proportional to the number of incoming links of that page. This rule reflects the common tendency for new web pages to point to popular pages. Write a program to simulate this process and plot a histogram of the number of incoming links. Answer: you should observe that the fraction of pages with indegree k is proportional to k^(-1 / (1 - p)).

Web Exercises

  1. All shortest pairs. Write a program AllShortestPaths.java that builds a graph from a file, and reads source-destination requests from standard input and prints a shortest path in the graph from the source to the destination.
  2. Faster word ladders. To speed things up (if the word list is very large), don't write a nested loop to try all pairs of words to see if they are adjacent. To handle 5 letter words, first sort the word list. Words that only differ in their last letter will appear consecutively in the sorted list. Sort 4 more times, but cyclically shift the letters one position to the right so that words that differ in the ith letter will appear consecutively in one of the sorted lists.

    Try out this approach using a larger word list with words of different sizes. Two words of different lengths are neighbors if the smaller word is the same as the bigger word, minus the last letter, e.g., brow and brown.

  3. Checkers. Extend the rules of checkers to an N-by-N checkerboard. Show how to determine whether a checker can become in king in the current move. (Use BFS or DFS.) Show how to determine whether black has a winning move. (Find an Euler path.)
  4. Combinational circuits. Determining the truth value of a combinational circuit given its inputs is a graph reachability problem (on a directed acyclic graph).
  5. Hex. The game of Hex is played on a trapezoidal grid of hexagons....
    • Describe how to detect whether white or black has one the game using breadth first search.
    • Prove that the game cannot end in a tie. Hint: consider the set of cells reachable from the left side of the board.
    • Prove that the first player in Hex can always win if they play optimally. Hint: if the second player had a winning strategy, you could choose a random cell initially, and then just copy the second players winning strategy. This is called strategy stealing.
  6. Explain the design flaw in the following implementation of a ring graph.
    public class RingGraph extends Graph {
       public RingGraph(int V) {
          for (int i = 0; i < V; i++) {
             String v = "" + i;
             String w = "" + (i + 1) % V;
             addEdge(v, w);
          }
       }
    }
    
    The client can create a ring graph on V vertices, then invoke the addEdge() method, which would result in something other than a ring graph.

Topological robustness. "Scale-free networks are robust to random damage, but vulernable to malicious attack." In a power law distributed small world network, deletion of random node rarely has much effect on length of shortest paths (since most shortest paths are routed through hubs). This is in constrast to random network. However, small world networks can catastrophically fail if the hub is deleted or attacked (whereas random networks have no central point of failure). Random 95% of Internet links (6209 vertices an 24401 links) can be removed and graph stays connected, but destroying 2.3% of hubs would disconnect it. Scale-free networks are also vulnerable to spreading viruses. We should immunize the hubs. Barabasi hypothesized that biological systems may be evolutionarily predisposed to prefer small world networks since they are more robust to perturbations (mutation, viral infection). Viruses do adapt to attack the hubs. Internet will still function if we randomly delete 80% of vertices. Scale-free networks are topologically robust if gamma <= 3. "promiscuous individuals are the vulnerable nodes to target in safe-sex campaigns." Liljeros.

Power laws and scale-free networks.

Create a histogram of the the degrees of vertices in a graph and look at distribution. Power laws model scale invariance observed in many natural phenomena, including physics, sociology, geography, and economics. Direct result of self-organization. US Road network is approximately random network with bell-shaped connectivity distribution; US airport network is scale-free with many hubs. Think hub = airline hub.

WWW, Internet, food webs, social webs, pairwise interactions between proteins in human body. Probability than an actor has k neighbors follows a power law-tail for large k: p(k) ~ k^(-2.3). Web graph (in-degrees and out-degrees) obey a power law. Probability that k documents point to a given web page ~ k^(-2.1).

Generative models.

Reference: powerlaws.media.mit.edu. Traditionally, scientists have modeled graphs using completed ordered graphs (a grid or crystal lattice) or completely random graphs. Most real world networks seem fall in between the two extremes. People meet friends via existing friends, so the networks have many local connections and highly clustered. If person moves to a different city or joins a group, they form long-range connections.

Such networks grow over time by the gradual addition of vertices and edges with preferential connectivity. New web page likely to connect to well-established web pages. Barabasi model: rich get richer. Start at time t = 1 with two vertices linked by d parallel edges. At every time t >= 2, add a new vertex with d edges using preferential attachment. Probability that new vertex will be connected to vertex i is ki / sum (kj), where ki is the connectivity of vertex i (indegree for directed graphs). Add edges one at a time, with subsequent edges doing preferential attachment using updated degrees. (Has property that for general average outdegree d, can run d = 1 model for dn steps and collapse vertices kd, kd - 1, ..., (k-1)d + 1 to make vertex k.) Fraction of vertices with degree k converges to 2d(d+1)/(k(k+1)(k+2)) as n gets large. In other words, a power law proportional with exponent 3, where d only influences constant. reference. Variants lead to exponents between 2 and infinity. (Exercise: simulate and compute an estimate.) As a result, hubs are formed.

Lovasz. Fixed preferential attachment graph. Fix n vertices with self-loops and m steps. In step i, choose two random vertices independently with probabilityd proportional to degree and connect them. Random graph model of Erdos and Renyi not model many large networks that arise in practice.

  • Watts-Strogatz model. Watts-Strogatz model captures structural property. Watts and Strogatz proposed a hybrid model that contained typical links of vertices near each other (people know their geographic neighbors), plus some random long-range connection links. n-by-n grid digraph (local contacts) with many random shortcuts (long-range contacts). Leads to abundance of short paths.
  • Kleinberg model. No way for participants in Watts-Strogatz model to find short paths in a decentralized But Milgram experiment also had striking algorithmic component - individuals can find short paths! Kleinberg proposed making distribution of shortcuts obey a power law, with probability proportional to the dth power of the distance (in d dimensions). Each vertex has one long-range neighbor. Uniform over all distance scales (same number of links at distances 1-10 as 10-100 or 100-1000. Greedy algorithm (forward message along edge that brings it as close to target as possible in terms of lattice distance) leads to log^2 N length paths. The only exponent that works is d.

    Simple random model is design principle for some decentralized peer-to-peer networks.

    Cycle with random matching. Bollobas and Chung proved that a cycle (of even length) plus a random matching has log n diameter with high probability, making it a small world network.