Law of Large Numbers and Central Limit Theorem

In this article we will get an understanding of two theorems of uttermost importance in statistics: the Law of Large Numbers and the Central Limit Theorem.

After a theoretical introduction, we will demonstrate the theorems performing computer simulations of a huge number of Bernoulli trials. We will repeat the experiments with different parameters and observe the outcomes.

All the work related to this homework can be found in the repository, in the homeworks/homework-04/ directory.

Theoretical background

The Law of Large Numbers and the Central Limit Theorem are crucial in statistics. Also, they are extremely useful in order to enter into the statistics’ perspective and to understand how probability works.

Law of Large Numbers

The Encyclopedia Britannica gives a strict and authoritative formulation of the Law of Large Numbers, along with a brief introduction:

law of large numbers, in statistics, the theorem that, as the number of identically distributed, randomly generated variables increases, their sample mean (average) approaches their theoretical mean.

Intuitively, this law states that even if there might be some contradictory outcomes when repeating trials for a small number of times, the more trials we do, the more the measured relative frequencies are close to the values of the theoretical distribution.

The Encyclopedia Britannica also gives a brief historical background about the Law of Large Numbers:

The law of large numbers was first proved by the Swiss mathematician Jakob Bernoulli in 1713. He and his contemporaries were developing a formal probability theory with a view toward analyzing games of chance. Bernoulli envisaged an endless sequence of repetitions of a game of pure chance with only two outcomes, a win or a loss. Labeling the probability of a win p, Bernoulli considered the fraction of times that such a game would be won in a large number of repetitions. It was commonly believed that this fraction should eventually be close to p.

We will talk about Jakob Bernoulli again, as we will use the random variable named after him to perform simulations.

Central Limit Theorem

As it is such an authoritative source, we will cite the Encyclopedia Britannica again in order to give a formulation of the Central Limit Theorem:

central limit theorem, in probability theory, a theorem that establishes the normal distribution as the distribution to which the mean (average) of almost any set of independent and randomly generated variables rapidly converges.

One important implication of the Central Limit Theorem is that it is often possible to replace a complex or unusual distribution with a normal random variable. This can be very useful, given the huge amount of theory formulated around the normal distribution.

Computer simulations

Below, we will proceed with the computer simulations. We will give some definitions to keep a consistent terminology:

trial: a single experiment (e.g. a coin toss) with a known outcome
path or trajectory: a set of trials performed in sequence
relative frequency: an empirical frequency of trial outcomes computed on measured data
distribution: theoretical probability of outcome of a trial

Methodology and resources

All the related work can be found in the homeworks/homework-03 directory of the repository. As of the previous homework, the directory itself is structured as an npm package containing a JavaScript library, bringing all the advantages we already discussed.

A command-line toolkit is present in the repository. It can be used to perform simulations and generate relative charts offline and without a browser.

The chart.js library has been used to plot charts, so the outcome of the simulations can be presented in a clear way.

The yargs library has been used to parse the command-line arguments of the toolkit.

No third party libraries were used to actually perform the simulations.

Simulation software

First, the simulation library itself must be realized. In order to achieve a modular and flexible design, a number of JavaScript classes have been implemented:

Class	Description
Simulator	Object performing the simulations
Simulation	Outcome of a simulation, along with its metadata
Plotter	Abstract configuration generator for `chart.js`
LinePlotter	configuration generator for a `chart.js` line chart
BarPlotter	configuration generator for a `chart.js` Barchart

`Simulator`

The Simulator is used to setup and perform a simulation in a clean way. The code is given below:

export class Simulator {
  constructor({ paths, trials, p = 0.5 }) {
    this.paths = paths
    this.trials = trials
    this.p = p
  }

  simulate() {
    const data = Array.from({ length: this.paths })
      .map(() => this.simulatePath(this.trials))
    return new Simulation({ data, p: this.p })
  }

  simulatePath() {
    return Array.from({ length: this.trials })
      .map(() => Math.random() < this.p ? 1 : -1)
  }
}

This approach encourages composition, eventually passing Simulator object to functions that can then use them in a generic way.

Notice that the trials are performed with a Bernoulli distribution of parametric probability p.

`Simulation`

A Simulation represent the actual outcome of a full simulation. It contains the trajectory with all their trial outcomes and the metadata of the simulator (e.g. the probability of the used Bernoulli distribution).

Also, there are handy instance methods to perform computations over the simulation data, such as relative frequencies and normalization.

Below, the full code is given:

export class Simulation {
  constructor({ data, p }) {
    this.data = data
    this.p = p
    this.paths = data.length
    this.trials = data[0]?.length || 0
  }

  relativeFrequency() {
    return this.data.map(path => {
      const a = Array.from(path)
      let sum = 0
      for (let i = 0; i < a.length; ++i) {
        sum += a[i]
        a[i] = sum / (i + 1)
      }
      return a
    })
  }

  cumulative() {
    return this.data.map(path => {
      const a = Array.from(path)
      let sum = 0
      for (let i = 0; i < a.length; ++i) {
        sum += a[i]
        a[i] = sum
      }
      return a
    })
  }

  gaussianFrequency() {
    const data = this.cumulative()
    const { mean, sigma } = this.getGaussianParameters()

    return data.map(path => path.map((x, i) => {
      const n = i + 1
      return (x - n * mean) / (sigma * Math.sqrt(n))
    }))
  }

  getGaussianParameters() {
    return {
      mean: 2 * this.p - 1,
      sigma: 2 * Math.sqrt(this.p * (1 - this.p))
    }
  }
}

The relativeFrequency method is used to incrementally compute the relative frequencies of the simulation outcome for each trajectory.

`Plotter` and subclasses

The Plotter class offers a generic abstract interface to easily plot charts. As you can see, under the hood, it uses the chart.js library:

export class Plotter {
  constructor({ data, xLabel, yLabel }) {
    this.data = data
    this.xLabel = xLabel
    this.yLabel = yLabel
  }

  static configureChartDefaults(Chart) {
    Chart.defaults.datasets.line.borderWidth = 1
    Chart.defaults.elements.point.radius = 0
    Chart.defaults.plugins.legend.display = false
  }

  plot() {
    throw new Error("Not implemented")
  }
}

This class also offers a convenient way to preconfigure chart.js.

For the purpose of this article, we will need to plot line and bar charts. Each of those have its own class, defined as a subclass of Plotter:

export class LinePlotter extends Plotter {
  constructor({ min, max, ...args }) {
    super({ ...args })
    this.min = min
    this.max = max
  }

  plot() {
    const labels = Array.from({ length: this.data[0]?.length || 0 }, (_, i) => i)
    const labelStep = Math.ceil(this.data.length / 10)

    return {
      type: 'line',
      data: {
        labels,
        datasets: this.data.map(data => ({ data })),
      },
      options: {
        scales: {
          x: {
            title: { display: true, text: this.xLabel },
            ticks: {
              callback: (_, i) => (i % labelStep === 0 ? i.toString() : '')
            }
          },
          y: {
            title: { display: true, text: this.yLabel },
            min: this.min,
            max: this.max,
          },
        },
      },
      plugins: {
        decimation: { algorithm: 'lttb' },
      }
    }
  }
}

export class BarPlotter extends Plotter {
  constructor({ min, max, bars, step, ...args }) {
    super({ ...args })
    this.min = min
    this.max = max
    this.bars = bars
    this.step = step || Math.abs(max - min) / bars
  }

  plot() {
    const bars = this.#computeBarValues()
    const labels = Array.from({ length: bars.length },
      (_, i) => this.min + i * this.step)
      .map(x => x.toFixed(2))

    return {
      type: 'bar',
      data: {
        labels,
        datasets: [
          {
            data: bars,
            backgroundColor: 'rgb(75, 192, 192)',
          },
        ],
      },
      options: {
        scales: {
          x: { title: { display: true, text: this.xLabel } },
          y: { title: { display: true, text: this.yLabel } },
        },
      },
    }
  }

  #computeBarValues() {
    const bars = new Array(this.bars).fill(0)
    for (const path of this.data) {
      for (const z of path) {
        if (z < this.min || z > this.max)
          continue
        const i = Math.floor((z - this.min) / this.step)
        bars[i]++
      }
    }
    for (let i = 0; i < bars.length; ++i)
      bars[i] /= this.data.length * this.data[0]?.length
    return bars
  }
}

Notice that those subclasses can be called in a generic way, encouraging the use of composition:

function plotSomething(plotter, ...otherArgs) {
  // Do some stuff
  plotter.plot() // Who knows which concrete plotter am I using?
}

plotSomething(new LinePlotter({
  data: simulation.relativeFrequency(),
  xLabel: 'Trial',
  yLabel: 'Frequency',
  min: -0.2,
  max: 0.2,
}), './charts/relative-frequency.png')

plotSomething(new BarPlotter({
  data: gaussianFrequency,
  xLabel: 'Trial',
  yLabel: 'Zₙ',
  min: -3,
  max: 3,
  bars: 150,
}), './charts/gaussian-bell.png')

Running the simulations

The simulations can be ran locally without a browser using the dedicated command-line tool. It is written in the spirit of a UNIX tool:

$ yarn toolkit --help
hw04-toolkit plot [--paths n] [--trials n] [--height h] [--width w] <output-dir>

Commands:
  hw04-toolkit plot <outdir>  Plot the charts of a simulation to outdir/ (must
                              exist)

Options:
  --version  Show version number                                       [boolean]
  --help     Show help                                                 [boolean]

Launching the toolkit

First, install the dependency with yarn (or a package manager of your choice):

yarn install

Then, create a directory for the charts images and launch the homework toolkit:

mkdir charts
yarn toolkit --paths 20 --trials 40000 --height 720 --width 1240 charts/

If the command executes correctly, the charts will be found in the specified directory as PNG images.

Source code

Below, the full source code of the toolkit is given, as an example of how the classes can be used:

#!/usr/bin/env node
import yargs from 'yargs';
import { hideBin } from 'yargs/helpers';
import { writeFile } from 'node:fs/promises'
import { ChartJSNodeCanvas } from 'chartjs-node-canvas';
import { Simulator, Plotter, LinePlotter, BarPlotter } from '../src/index.js'


yargs()
  .scriptName("hw04-toolkit")
  .usage('$0 plot [--paths n] [--trials n] [--height h] [--width w] <output-dir>')
  .command('plot <outdir>', 'Plot the charts of a simulation to outdir/ (must exist)',
    y => y
      .option('width', { type: 'number', describe: 'Image width (pixels)', default: 1240 })
      .option('height', { type: 'number', describe: 'Image height (pixels)', default: 720 })
      .option('paths', { type: 'number', describe: 'Number of paths (i.e. trajectories)', default: 20 })
      .option('trials', { type: 'number', describe: 'Number of trials per path', default: 40000 })
      .option('p', { type: 'number', describe: 'Probability of success for each trial', default: 0.5 })
      .positional('outdir', { type: 'string', demandOption: true, describe: 'Output directory' }),
    plot,
  )
  .help()
  .strict()
  .parse(hideBin(process.argv))

function chartCallback(ChartJS) {
  ChartJS.defaults.responsive = true
  ChartJS.defaults.maintainAspectRatio = false
  Plotter.configureChartDefaults(ChartJS)
}

async function plot(args) {
  const simulation = new Simulator({
    paths: args.paths,
    trials: args.trials,
    p: args.p,
  }).simulate()

  exportSimulation(new LinePlotter({
    data: simulation.relativeFrequency(),
    xLabel: 'Trial',
    yLabel: 'Frequency',
    min: -0.2,
    max: 0.2,
  }), args.width, args.height, `${args.outdir}/relative-frequency.png`)

  const gaussianFrequency = simulation.gaussianFrequency()

  exportSimulation(new LinePlotter({
    data: gaussianFrequency,
    xLabel: 'Trial',
    yLabel: 'Zₙ',
  }), args.width, args.height, `${args.outdir}/gaussian-frequency.png`)

  exportSimulation(new BarPlotter({
    data: gaussianFrequency,
    xLabel: 'Trial',
    yLabel: 'Zₙ',
    min: -3,
    max: 3,
    bars: 150,
  }), args.width, args.height, `${args.outdir}/gaussian-bars.png`)
}

async function exportSimulation(plotter, width, height, outfile) {
  const configuration = plotter.plot()

  const chartJSNodeCanvas = new ChartJSNodeCanvas({ width, height, chartCallback })
  const buffer = await chartJSNodeCanvas.renderToBuffer(configuration)
  await writeFile(outfile, buffer, 'base64')
}

Simulation analysis

A simulation basically consists in a set of outcomes of Bernoulli trials, grouped together in trajectories. Below we will discuss the Law of Large Numbers and the Central Limit Theorem, giving the outcomes of some simulation in support of our theses.

Relative frequencies

Let $n$ be the number of trials. The Law of Large Numbers states that with $n \to \infty$ the relative frequency tends to the theoretical mean of the reference distribution (a Bernoulli in our case).

First: let’s perform a simulation with a small number of trials per path:

yarn toolkit plot --paths 10 --trials 80 -p 0.5 charts/trials-small

Relative frequencies with a few
trials Relative frequencies with a low number of trials

As we can see, the relative frequencies in the chart above even go out of the bound we chose, and definitely do not seem to tend to the average mean (which should be 0 for a Bernoulli distribution with $p = 0.5$). However, let’s drastically increase the number of trials per path:

yarn toolkit plot --paths 10 --trials 20000 -p 0.5 charts/trials-high

Relative frequencies with a high number of trials

We can definitely observe that the Law of Large Numbers is working here. It is clear that, after some initial fluctuations, the relative frequencies of each path tend to 0.

Normalizing the relative frequencies

With the Simulation.cumulative instance method, we can compute an empirical Binomial distribution over our simulation. Doing so, we can scale and center it to obtain a Gaussian distribution. Let’s examine the

Relative frequencies with a high number of
trials Normalized empirical Binomial with a low number of trials

Relative frequencies with a high number of
trials Normalized empirical Binomial with a high number of trials

As a surprising result, we observe that the charts are quite similar. The number of trials is too small to see the effects of the Law of Large Numbers. However, if the empirical distribution is normalized, it tends to assume the shape of a Gaussian one. We will see this much better in the following section.

Observing a Gaussian bell

Perhaps the most iconic shape in all statistics is the so-called Gaussian bell. In a Gaussian distribution, the probability tends to aggregate around the mean, with some values at the edges as “exceptions”. We can group the results of the previous normalization of the empirical Binomial distribution and plot a bar chart to see if its shape resembles a Gaussian bell.

Relative frequencies with a high number of
trials Bar chart of the normalized empirical Binomial with a low number of trials

Relative frequencies with a high number of
trials Bar chart of the normalized empirical Binomial with a high number of trials

We can see a rather odd shape in the case with a few trials. The number of single experiments is too low to make the frequencies aggregate around the theoretical mean. The case with a higher number of trials is much more significative and makes us see the effects of the Central Limit Theorem.

A huge number of trials

Below, we give the three charts types we included before, this time for a very large number of trials (took 42 seconds and about 3Gb of memory to generate, and managed to crashed my computer a couple of times trying with higher numbers):

yarn toolkit plot --paths 30 --trials 300000 -p 0.5 charts/trials-huge

Relative frequencies with a huge number of trials

Relative frequencies with a huge number of
trials Normalized empirical Binomial with a huge number of trials

Relative frequencies with a huge number of
trials Bar chart of the normalized empirical Binomial with a huge number of trials

Especially with a higher number of trajectories, we can really appreciate the effects of the Central Limit Theorem in the bar chart.

Conclusions

We developed a small framework to perform computer simulation of a large number of Bernoulli trials.

Then, we managed to leverage such simulations to understand the effects of the Law of Large Numbers and the Central Limit Theorem.