Percolation theory and finding the critical point

2024 February 15

(As mentioned before, current MIT students should not read this material.)

Exercises on random walks
Percolation theory and finding the critical point
Percolation theory and fractal dimension
The coastline paradox
Ant in a labyrinth: random walk with invasion percolation

Consider a grid of cells arranged in a lattice, each of which is randomly (iid) “occupied” with probability $p = 0.4$ :

0  0  1  0  0  0  0  0  0  0
1  0  0  0  1  0  0  0  1  1
0  0  1  0  0  0  0  1  1  1
0  0  1  0  0  0  1  1  1  0
0  0  0  0  1  0  1  1  1  0
1  0  0  0  1  0  1  1  0  0
1  0  1  0  0  0  0  1  0  0
1  1  1  0  0  0  0  0  1  1
1  1  0  1  1  1  0  1  0  0
0  1  0  1  1  0  0  0  1  0

I have marked occupied sites with a 1, and unoccupied sites with a 0. We would like to know how the very simple behavior of each site controls the behavior of the grid as a whole. Certainly if we just count the number of occupied sites in the grid we will not find any surprises, as this does not involve the spatial geometry of the grid. However the behavior becomes much more interesting if we ask a different question: what happens if we group the occupied sites into contiguous clusters?

.  .  1  .  .  .  .  .  .  .
5  .  .  .  2  .  .  .  5  5
.  .  3  .  .  .  .  5  5  5
.  .  3  .  .  .  5  5  5  .
.  .  .  .  4  .  5  5  5  .
6  .  .  .  4  .  5  5  .  .
6  .  6  .  .  .  .  5  .  .
6  6  6  .  .  .  .  .  6  6
6  6  .  8  8  8  .  7  .  .
.  6  .  8  8  .  .  .  9  .

Here, we have labeled each cluster of contiguous occupied sites with a unique integer (note that the grid is toroidal). Of course, each cluster has a random size and shape, but it is hard to see how cluster size is controlled by the individual site probability $p = 0.4$ . Is it improbable for 15 / 100 sites to be in a single cluster?

Note: all grids and diagrams shown here were made with the help of a python program supplied to students which is capabable of efficiently identifying clusters and producing relevant statistics about them.

Let us bump up the grid size a little bit and focus on what happens to the largest cluster as we vary the probability $p$ .

Each non-white location is colored by the smallest value of p for which it is part of the largest cluster in the grid. — Each non-white location is colored by the smallest value of $p$ for which it is part of the largest cluster in the grid.

Here we have used the same random seed for each run, but only varied the threshold of what is considered an “occupied” site. When we use $p = 0.55$ , more than half the sites in the grid are occupied, but they do not form large clusters: the region of black cells in the bottom-center is the largest such cluster.

As we slowly increase the threshold $p$ , the number of occupied sites also slowly increases, but there are so many closely packed clusters that the newly occupied sites inevitably start merging some of the existing clusters, so the size of the largest cluster grows quite fast; observe the large blocks of solid color above, representing large regions that join the largest cluster all at once. By the time $p = 0.6$ , one large cluster dominates the whole grid, containing most occupied sites.

If we repeat this experiment, we find that one massive cluster emerges every time at about the same probability. Indeed if we move to an infinite lattice, it can be proven that there exists a critical threshold $p_c$ such that when $p > p_c$ there exists an infinitely large cluster with probability 1, while conversely for $p \leq p_c$ the probability of an infinite cluster is 0.¹The behavior at $p = p_c$ has only been proven for 2D and $d \geq 19$ , for other dimensions it is conjectured. This abrupt shift in behavior is an example of a phase transition.

For a fixed but large grid size $L$ , let $q_1$ be the probability that a randomly selected occupied site is part of the largest cluster, and similarly $q_2$ for the second largest cluster. If $C_1, C_2$ are the sizes of the largest and second largest clusters, then (approximately)

$q_i = \frac {C_i}{pL^d}$

where $d$ is the dimension of the grid. Let us estimate $q_i$ by simulating a number of random grids, and graph how $q_i$ depends on $p$ for a fixed $L$ and $d$ . When computing the $q_i$ , it is important to perform many simulations and take an average of the results to reduce noise in the graphs; note that the smaller the choice of $L$ , the more simulations will be needed.

Because $q_1$ and $q_2$ are mostly near either 0 or 1, especially for large $L$ , the most appropriate way to present the data is with the $q_i$ on a log scale.

Probability of an occupied site being in the largest and second-largest clusters in a 2D grid with L = 5000.

Probability of an occupied site being in the largest and second-largest clusters in a 3D grid with L = 150.

(I have performed a greater number of simulations for values of $p$ nearer the critical point²I was pleased to discover when redoing these graphs year later that I had saved the simulation results and didn’t have to re-run them. Across all graphs I used in total about 2.5 million simulations, for each of which I recorded the 10 largest cluster sizes., and smoothed the graphs, as will be discussed shortly.)

Observe that when $p \ll p_c$ , $q_1$ and $q_2$ are quite close in value: the largest cluster is not especially distinguished, but simply one of many. In two dimensions take, say, $p = 0.4$ and $L = 5000$ , for which $q_1 \approx 10^{-5}$ . There are $10^7 = pL^2$ occupied sites in the grid, so the largest cluster is about 100 in size. With the grid having a few million clusters, it is not a surprise the second largest is close behind in size.

As $p$ increases, the largest and second largest clusters both tend to grow in size, and so do $q_1, q_2$ . (The drop in the $q_i$ that occurs for very small $p$ is because even as the number of clusters is increasing the largest cluster remains only a few sites in size.) But as we get close to $p_c$ there is a sudden shift in behavior; the largest clusters begin to approach the size of the whole grid, and any two big clusters are likely to meet. All of the significant clusters get absorbed into a single massive cluster, and the larger a cluster is the more likely it is to be absorbed. Thus as $p > p_c$ increases further, $q_2$ drops quickly.

How can we use this information to estimate the critical point $p_c$ ? There are several different approaches we can use here. When $0 < p < p_c$ , then $q_1 \to 0$ in the limit $L \to \infty$ , and conversely when $p > p_c$ then $q_1 \to 1$ as $L \to \infty$ , so the critical point can be identified by anything sensitive to this transition; for example, we can look for the value of $p$ where $q_1 = 1/2$ or where the derivative $dq_1 / dp$ is maximized, either of which will converge to $p_c$ in the limit of large $L$ (though not necessarily at the same rate).³On further thought I think this is an error, and $q_1$ might be continuous in $p$ even at infinite $L$ , so the former method would actually give a value very slightly larger than $p_c$ . I think the latter method still works, though. I prefer a different method, which is to identify the value of $p$ that maximizes $q_2$ , which also converges to $p_c$ .

Now certainly we can just eyeball the peak of $q_2$ in the figure, or use the value of $p$ from whichever run had the highest $q_2$ , as we used a large enough value of $L$ that this will give a decent result. However picking the run with the largest $q_2$ is not robust against random noise, and we can improve on our results by choosing a numerical method that is more robust to noise in the data. Probably the best way to deal with the noise is to fit the graph of $q_1$ or $q_2$ to a suitable functional form near the point of interest; this can be somewhat tricky, especially since (for example) the peak of $q_2$ is asymmetrical. A much simpler method is to smooth the graph before searching for the peak; then any spikes due to a single data point should be softened sufficiently that they will not be the highest point anymore.⁴Though note that the maximum of $q_2$ will be lowered by the smoothing process, and as the peak is asymmetric the location of the peak will be shifted slightly to a lower value of $p$ .

In my approach, I used Gaussian smoothing⁵Note that this is one of the many applications of the diffusion equation we worked with last time with the function

$f(x) = \exp\left(-\frac {x^2}{2 \sigma^2}\right),$

where $\sigma$ is the smoothing length-scale. The larger the choice of $\sigma$ , the more smoothing is applied; the goal is to use $\sigma$ large enough to smooth out any peaks due to noise, but not so large as to actually smooth out the peak we are interested in. A function $f$ used to perform smoothing is called a kernel.

If we have a continuous function $y(x)$ , smoothing it with a kernel $f$ gives us

$\tilde y(x) = \frac {\int f(x' - x) y(x')\ dx'}{\int f(x' - x)\ dx'},$

which is to say we use $f$ to perform a weighted average of the $y(x')$ for $x'$ near $x$ . If we have a discrete series $y(x_i)$ , as is more likely in practice, we use instead

$\tilde y(x) = \frac {\sum_i f(x_i - x) y(x_i)}{\sum_i f(x_i - x)}.$

Note that it is not necessary that the $x_i$ are uniformly spaced or that $x$ is one of the $x_i$ .

Different choices of kernel $f$ have different merits, and choosing an appropriate kernel is hard (and frequently arbitrary). A Gaussian kernel is the most common choice. As another example, using $f(x) = 1$ for $|x| < \sigma$ and $f(x) = 0$ for $|x| > \sigma$ gives us a smoothing where $\tilde y(x)$ is just the average of $y(x')$ for $|x' - x| < \sigma$ ; this is also referred to as a moving-window average.

Having smoothed the data, let us zoom in near the peak of $q_2$ :

We used a smoothing length scale of $\sigma = 0.0003$ in the inset (where I have more data points) and $\sigma = 0.003$ in the full graphs shown previously. The slope $dq_1/dp$ was estimated by computing

$\left. \frac{dq_1}{dp} \right\rvert_{p + 0.5 \Delta p} = \frac {q_1(p + \Delta p) - q_1(p)}{\Delta p}$

where $\Delta p$ is the distance between consecutive data points. (Estimating a derivative based on just two data points should generally only be done with smoothed data.⁶Note that since smoothing and taking the derivative are both linear, it doesn’t matter which order we do them in; taking the derivative of smoothed data gives the same result as smoothing the derivative of unsmoothed data.)

We get a very good estimate of $p_c$ :

	2D	3D
$\operatorname{argmax} \frac {dq_1}{dp}$	0.59188	0.31201
$\operatorname{argmax} q_2$	0.59158	0.31068
True $p_c$	0.592746	0.3116

One alternative method would be to consider the value of $p$ at which the largest cluster first spans the width of the whole domain.

Can we estimate the error in our method without reference to an authoritative value? As our calculation involves random simulations, we may repeat it and get a different answer; by repeating it many times we can develop a probability distribution, and thus estimate the standard deviation of sampling from this distribution.⁷Say we repeat the calculation 100 times and estimate the standard deviation of each calculation is 0.002, so we might guess that the error in our answer is about 0.002. However, having done 100 times as many simulations we can now improve on our answer by taking its mean; with 100 times as many simulations it will have a much lower error. But, now how will we know what this new method’s error is? Will we have to repeat this method 100 times, doing 10000 times as much as the original calculation? No, because when averaging from a random distribution the standard deviation scales with the square root of the number of samples, so we know the end result will have an error that scales like $0.002 / \sqrt{100} = 0.0002$ .

However that is not the only error, or even the main source of error, which is that we used a finite $L$ , whereas this method only converges to the true value of $p_c$ in the limit $L \to \infty$ . We can reduce the error by using larger values of $L$ but eventually we must stop and have some largest size we simulated – I stopped at $L = 5000$ in two dimensions. So how can we improve without using larger $L$ ? By repeating this calculation with multiple smaller values of $L$ , we can see how the value we compute varies with $L$ . Extrapolating these values numerically lets us estimate the limit to much higher values of $L$ than we are capable of actually simulating. This general technique is called series acceleration, with many approaches such as Aitken’s method.

Indeed this was the whole point of the previous problem on random walks; even in 1D it was infeasible to directly estimate the probability of a random walker never returning to the origin,⁸The average recurrence time is infinite! so instead we estimated the probability of not returning for various finite times and extrapolated to infinity.

The behavior at $p = p_c$ has only been proven for 2D and $d \geq 19$ , for other dimensions it is conjectured.↩︎
I was pleased to discover when redoing these graphs year later that I had saved the simulation results and didn’t have to re-run them. Across all graphs I used in total about 2.5 million simulations, for each of which I recorded the 10 largest cluster sizes.↩︎
On further thought I think this is an error, and $q_1$ might be continuous in $p$ even at infinite $L$ , so the former method would actually give a value very slightly larger than $p_c$ . I think the latter method still works, though.↩︎
Though note that the maximum of $q_2$ will be lowered by the smoothing process, and as the peak is asymmetric the location of the peak will be shifted slightly to a lower value of $p$ .↩︎
Note that this is one of the many applications of the diffusion equation we worked with last time↩︎
Note that since smoothing and taking the derivative are both linear, it doesn’t matter which order we do them in; taking the derivative of smoothed data gives the same result as smoothing the derivative of unsmoothed data.↩︎
Say we repeat the calculation 100 times and estimate the standard deviation of each calculation is 0.002, so we might guess that the error in our answer is about 0.002. However, having done 100 times as many simulations we can now improve on our answer by taking its mean; with 100 times as many simulations it will have a much lower error. But, now how will we know what this new method’s error is? Will we have to repeat this method 100 times, doing 10000 times as much as the original calculation? No, because when averaging from a random distribution the standard deviation scales with the square root of the number of samples, so we know the end result will have an error that scales like $0.002 / \sqrt{100} = 0.0002$ .↩︎
The average recurrence time is infinite!↩︎

Follow RSS/Atom feed for updates.