Percolation theory and finding the critical point

2024 February 15
math

(As mentioned before, current MIT students should not read this material.)

  1. Exercises on random walks
  2. Percolation theory and finding the critical point
  3. Percolation theory and fractal dimension
  4. The coastline paradox
  5. Ant in a labyrinth: random walk with invasion percolation

Consider a grid of cells arranged in a lattice, each of which is randomly (iid) “occupied” with probability p = 0.4:

0  0  1  0  0  0  0  0  0  0
1  0  0  0  1  0  0  0  1  1
0  0  1  0  0  0  0  1  1  1
0  0  1  0  0  0  1  1  1  0
0  0  0  0  1  0  1  1  1  0
1  0  0  0  1  0  1  1  0  0
1  0  1  0  0  0  0  1  0  0
1  1  1  0  0  0  0  0  1  1
1  1  0  1  1  1  0  1  0  0
0  1  0  1  1  0  0  0  1  0

I have marked occupied sites with a 1, and unoccupied sites with a 0. We would like to know how the very simple behavior of each site controls the behavior of the grid as a whole. Certainly if we just count the number of occupied sites in the grid we will not find any surprises, as this does not involve the spatial geometry of the grid. However the behavior becomes much more interesting if we ask a different question: what happens if we group the occupied sites into contiguous clusters?

.  .  1  .  .  .  .  .  .  .
5  .  .  .  2  .  .  .  5  5
.  .  3  .  .  .  .  5  5  5
.  .  3  .  .  .  5  5  5  .
.  .  .  .  4  .  5  5  5  .
6  .  .  .  4  .  5  5  .  .
6  .  6  .  .  .  .  5  .  .
6  6  6  .  .  .  .  .  6  6
6  6  .  8  8  8  .  7  .  .
.  6  .  8  8  .  .  .  9  .

Here, we have labeled each cluster of contiguous occupied sites with a unique integer (note that the grid is toroidal). Of course, each cluster has a random size and shape, but it is hard to see how cluster size is controlled by the individual site probability p = 0.4. Is it improbable for 15 / 100 sites to be in a single cluster?

Note: all grids and diagrams shown here were made with the help of a python program supplied to students which is capabable of efficiently identifying clusters and producing relevant statistics about them.

Let us bump up the grid size a little bit and focus on what happens to the largest cluster as we vary the probability p.

Here we have used the same random seed for each run, but only varied the threshold of what is considered an “occupied” site. When we use p = 0.55, more than half the sites in the grid are occupied, but they do not form large clusters: the region of black cells in the bottom-center is the largest such cluster.

As we slowly increase the threshold p, the number of occupied sites also slowly increases, but there are so many closely packed clusters that the newly occupied sites inevitably start merging some of the existing clusters, so the size of the largest cluster grows quite fast; observe the large blocks of solid color above, representing large regions that join the largest cluster all at once. By the time p = 0.6, one large cluster dominates the whole grid, containing most occupied sites.

If we repeat this experiment, we find that one massive cluster emerges every time at about the same probability. Indeed if we move to an infinite lattice, it can be proven that there exists a critical threshold p_c such that when p > p_c there exists an infinitely large cluster with probability 1, while conversely for p \leq p_c the probability of an infinite cluster is 0.1The behavior at p = p_c has only been proven for 2D and d \geq 19, for other dimensions it is conjectured. This abrupt shift in behavior is an example of a phase transition.

For a fixed but large grid size L, let q_1 be the probability that a randomly selected occupied site is part of the largest cluster, and similarly q_2 for the second largest cluster. If C_1, C_2 are the sizes of the largest and second largest clusters, then (approximately)

q_i = \frac {C_i}{pL^d}

where d is the dimension of the grid. Let us estimate q_i by simulating a number of random grids, and graph how q_i depends on p for a fixed L and d. When computing the q_i, it is important to perform many simulations and take an average of the results to reduce noise in the graphs; note that the smaller the choice of L, the more simulations will be needed.

Because q_1 and q_2 are mostly near either 0 or 1, especially for large L, the most appropriate way to present the data is with the q_i on a log scale.

(I have performed a greater number of simulations for values of p nearer the critical point2I was pleased to discover when redoing these graphs year later that I had saved the simulation results and didn’t have to re-run them. Across all graphs I used in total about 2.5 million simulations, for each of which I recorded the 10 largest cluster sizes., and smoothed the graphs, as will be discussed shortly.)

Observe that when p \ll p_c, q_1 and q_2 are quite close in value: the largest cluster is not especially distinguished, but simply one of many. In two dimensions take, say, p = 0.4 and L = 5000, for which q_1 \approx 10^{-5}. There are 10^7 = pL^2 occupied sites in the grid, so the largest cluster is about 100 in size. With the grid having a few million clusters, it is not a surprise the second largest is close behind in size.

As p increases, the largest and second largest clusters both tend to grow in size, and so do q_1, q_2. (The drop in the q_i that occurs for very small p is because even as the number of clusters is increasing the largest cluster remains only a few sites in size.) But as we get close to p_c there is a sudden shift in behavior; the largest clusters begin to approach the size of the whole grid, and any two big clusters are likely to meet. All of the significant clusters get absorbed into a single massive cluster, and the larger a cluster is the more likely it is to be absorbed. Thus as p > p_c increases further, q_2 drops quickly.

How can we use this information to estimate the critical point p_c? There are several different approaches we can use here. When 0 < p < p_c, then q_1 \to 0 in the limit L \to \infty, and conversely when p > p_c then q_1 \to 1 as L \to \infty, so the critical point can be identified by anything sensitive to this transition; for example, we can look for the value of p where q_1 = 1/2 or where the derivative dq_1 / dp is maximized, either of which will converge to p_c in the limit of large L (though not necessarily at the same rate).3On further thought I think this is an error, and q_1 might be continuous in p even at infinite L, so the former method would actually give a value very slightly larger than p_c. I think the latter method still works, though. I prefer a different method, which is to identify the value of p that maximizes q_2, which also converges to p_c.

Now certainly we can just eyeball the peak of q_2 in the figure, or use the value of p from whichever run had the highest q_2, as we used a large enough value of L that this will give a decent result. However picking the run with the largest q_2 is not robust against random noise, and we can improve on our results by choosing a numerical method that is more robust to noise in the data. Probably the best way to deal with the noise is to fit the graph of q_1 or q_2 to a suitable functional form near the point of interest; this can be somewhat tricky, especially since (for example) the peak of q_2 is asymmetrical. A much simpler method is to smooth the graph before searching for the peak; then any spikes due to a single data point should be softened sufficiently that they will not be the highest point anymore.4Though note that the maximum of q_2 will be lowered by the smoothing process, and as the peak is asymmetric the location of the peak will be shifted slightly to a lower value of p.

In my approach, I used Gaussian smoothing5Note that this is one of the many applications of the diffusion equation we worked with last time with the function

f(x) = \exp\left(-\frac {x^2}{2 \sigma^2}\right),

where \sigma is the smoothing length-scale. The larger the choice of \sigma, the more smoothing is applied; the goal is to use \sigma large enough to smooth out any peaks due to noise, but not so large as to actually smooth out the peak we are interested in. A function f used to perform smoothing is called a kernel.

If we have a continuous function y(x), smoothing it with a kernel f gives us

\tilde y(x) = \frac {\int f(x' - x) y(x')\ dx'}{\int f(x' - x)\ dx'},

which is to say we use f to perform a weighted average of the y(x') for x' near x. If we have a discrete series y(x_i), as is more likely in practice, we use instead

\tilde y(x) = \frac {\sum_i f(x_i - x) y(x_i)}{\sum_i f(x_i - x)}.

Note that it is not necessary that the x_i are uniformly spaced or that x is one of the x_i.

Different choices of kernel f have different merits, and choosing an appropriate kernel is hard (and frequently arbitrary). A Gaussian kernel is the most common choice. As another example, using f(x) = 1 for |x| < \sigma and f(x) = 0 for |x| > \sigma gives us a smoothing where \tilde y(x) is just the average of y(x') for |x' - x| < \sigma; this is also referred to as a moving-window average.

Having smoothed the data, let us zoom in near the peak of q_2:

We used a smoothing length scale of \sigma = 0.0003 in the inset (where I have more data points) and \sigma = 0.003 in the full graphs shown previously. The slope dq_1/dp was estimated by computing

\left. \frac{dq_1}{dp} \right\rvert_{p + 0.5 \Delta p} = \frac {q_1(p + \Delta p) - q_1(p)}{\Delta p}

where \Delta p is the distance between consecutive data points. (Estimating a derivative based on just two data points should generally only be done with smoothed data.6Note that since smoothing and taking the derivative are both linear, it doesn’t matter which order we do them in; taking the derivative of smoothed data gives the same result as smoothing the derivative of unsmoothed data.)

We get a very good estimate of p_c:

2D 3D
\operatorname{argmax} \frac {dq_1}{dp} 0.59188 0.31201
\operatorname{argmax} q_2 0.59158 0.31068
True p_c 0.592746 0.3116

One alternative method would be to consider the value of p at which the largest cluster first spans the width of the whole domain.

Can we estimate the error in our method without reference to an authoritative value? As our calculation involves random simulations, we may repeat it and get a different answer; by repeating it many times we can develop a probability distribution, and thus estimate the standard deviation of sampling from this distribution.7Say we repeat the calculation 100 times and estimate the standard deviation of each calculation is 0.002, so we might guess that the error in our answer is about 0.002. However, having done 100 times as many simulations we can now improve on our answer by taking its mean; with 100 times as many simulations it will have a much lower error. But, now how will we know what this new method’s error is? Will we have to repeat this method 100 times, doing 10000 times as much as the original calculation? No, because when averaging from a random distribution the standard deviation scales with the square root of the number of samples, so we know the end result will have an error that scales like 0.002 / \sqrt{100} = 0.0002.

However that is not the only error, or even the main source of error, which is that we used a finite L, whereas this method only converges to the true value of p_c in the limit L \to \infty. We can reduce the error by using larger values of L but eventually we must stop and have some largest size we simulated – I stopped at L = 5000 in two dimensions. So how can we improve without using larger L? By repeating this calculation with multiple smaller values of L, we can see how the value we compute varies with L. Extrapolating these values numerically lets us estimate the limit to much higher values of L than we are capable of actually simulating. This general technique is called series acceleration, with many approaches such as Aitken’s method.

Indeed this was the whole point of the previous problem on random walks; even in 1D it was infeasible to directly estimate the probability of a random walker never returning to the origin,8The average recurrence time is infinite! so instead we estimated the probability of not returning for various finite times and extrapolated to infinity.

Follow RSS/Atom feed for updates.