### A Primer on Statistical Estimation and Inference

#### The law of large numbers and sound statistical reasoning are the foundation for effective statistical inference in data science

*The law of large numbers and sound statistical reasoning are the foundation for effective statistical inference in data science.*

*The following text draws significantly from my book, “**Data Science — An Introduction to Statistics and Machine Learning**” [Plaue 2023], recently published by **Springer Nature**.*

### Introduction

Through our everyday experience, we have an intuitive understanding of what a typical body height is for people in the population. In much of the world, adult humans are typically between 1.60 m and 1.80 m tall, while people taller than two meters are rare to meet. By providing a frequency distribution of body height, this intuited fact can be backed up with numerical evidence:

These figures are based on a dataset collected by the U.S. Centers for Disease Control and Prevention (CDC) that lists, among other attributes, the height of more than 340,000 individuals [CDC 2018]. An inspection of this frequency table shows that, in fact, more than half of the people interviewed for the survey reported their height to be between

1.60 m and 1.80 m.

Even though the **sample** is of limited size, we are confident that our investigations allow us to draw conclusions about the population as a whole. For example, based on data alone, we can conclude with some confidence that a human being cannot grow to a height of three meters.

One important goal of **stochastics** is to justify such conclusions rigorously, mathematically. The field can be divided into two subfields:

**Probability theory**deals with the mathematical definition and investigation of the concept of probability. A central object of such an investigation are**random variables**: variables the values of which are not specified or known precisely but are subject to uncertainty. In other words, a probability can only be given that a random variable takes values within a certain range.**Inferential statistics**is based on the assumption that statistical observations and measures, such as frequencies, means, etc., are values or**realizations**of random variables. Conversely, the field investigates the extent to which characteristics of random variables can be estimated from sampled data. In particular, under certain simplifying assumptions, it is possible to quantify the accuracy or error of such an estimate.

Let us examine a straightforward example of statistical inference: determining whether a coin is fair or biased by observing a sequence of coin tosses. We can assume that the outcome of tossing the coin is determined by a discrete random variable *X_*1 that takes on the values of zero (representing tails) or one (representing heads). If we were to flip the same coin again, we can assume that the outcome can be described by a second random variable *X_*2, which is independent of the first but follows the same distribution.

If we lack any evidence to support the hypothesis that the coin is biased, we may assume that the coin is fair. In other words, we expect that heads will appear with the same probability as tails. Under this assumption, known as the **null hypothesis**, if we were to repeat the experiment multiple times, we would expect heads to turn up about as often as tails.

Conversely, the data allow us to draw conclusions about the underlying true distribution. For example, if we were to observe very different frequencies for heads and tails, such as a 70% frequency for heads compared to 30% for tails, then — if the sample size is sufficiently large — we would be convinced that we need to correct our original assumption of equal probability. In other words, we may need to abandon our assumption that the coin is fair.

In the example above, the frequency of heads appearing in the data acts as an estimator of the probability of the random event “the coin shows heads.” Common sense suggests that our confidence in such estimates increases with the size of the sample. For instance, if the imbalance described earlier were found in only ten coin tosses (seven heads and three tails), we might not yet be convinced that we have a biased coin. It is still possible that the null hypothesis of a fair coin holds true. In everyday terms, the outcome of the experiment could also be attributed to “pure chance.” However, if we observed seventy instances of heads out of one hundred coin tosses, it would be much stronger evidence in favor of the alternative hypothesis that the coin is biased!

### The central limit theorem: from point estimates to confidence intervals

Point estimates are among the most fundamental tools in the toolkit of statisticians and data scientists. For instance, the arithmetic mean, derived from a sizable sample of a population, provides an insight into the typical value that a given variable might assume. In machine learning, we estimate model parameters from training data, which should cover an adequate number of labeled examples.

Through experience and intuition, we have come to believe that larger samples and larger amounts of training data allow for more accurate statistical procedures and better predictive models. Inferential statistics offer a more robust foundation for supporting this intuition, often referred to as the **law of large numbers**. Furthermore, we gain a deeper understanding of what constitutes a “sufficiently large sample” by calculating **confidence intervals**, as opposed to relying solely on point estimates. Confidence intervals provide us with ranges of values within which we can reasonably assert that the true parameter we seek to estimate resides.

In the following sections, we will present the mathematical framework for computing confidence intervals in a self-contained manner, at the core of which lies the **central limit theorem**.

#### Chebyshev’s law of large numbers

Just as we expect the relative frequency to be a good estimator for the probability of an event or outcome of a binary variable, we expect the arithmetic mean to be a good estimator for the expected value of the random variable that produces the numeric data we observe.

It is important to note that this estimate itself is again a random variable. If we roll a die 50 times and record the average number, and then we repeat the experiment, we will likely get slightly different values. If we repeat the experiment many times, the arithmetic means we recorded will follow some distribution. For large samples, however, we expect them to show only a small dispersion and to be centered around the true expected value. This is the key message of **Chebyshev’s law of large numbers**, which we will detail below.

Before doing so, we introduce an important tool in probability theory— **Chebyshev’s inequality**. Suppose that we are given some random variable *X* with finite mean *μ* and variance *σ²*. Then, for any ε > 0, the following holds, where Pr( · ) means “probability of”:

This result aligns with our intuitive understanding of a measure of dispersion: the smaller the variance, the more likely it is that the random variable will take on values that are close to the mean.

For example, the probability of finding an observed value of the random variable within six standard deviations of its expected value is very high, at least 97%. In other words, the probability that a random variable takes on a value that deviates from the mean by more than six standard deviations is very low, less than 3%. This result holds for distributions of any shape as long as the expected value and variance are finite values.

Now suppose that we observe numeric values in a sample that are the realizations of random variables *X_*1, *…*, *X_*N. We assume that these random variables are mutually independent and follow the same distribution, a property commonly known as** independent and identically distributed**, or **i.i.d.** for short. This assumption is reasonable when the observations are the result of independently set up and identically prepared trials or when they represent a random selection from a population. However, it is important to note that this assumption may not always be justified.

In addition, we assume that the expected value *μ*** **and variance *σ² *of every random variable exists and is finite. Since the variables follow the same distribution, these values are the same for each of the variables. Next, we consider the following random variable that produces the arithmetic mean:

First, we show that the arithmetic mean estimator x̄ is an **unbiased estimator**:** **its values distribute around the true mean *μ*. This is a result that follows directly from the linearity of the expected value *E*[ · ]:

Next, we want to show that for large samples, the values of the arithmetic mean estimator do not disperse too far from the true mean. Since the *X_*1, *…*, *X_*N are assumed to be mutually independent, they are pairwise uncorrelated. It is not difficult to check that for pairwise uncorrelated random variables, the variance can be represented as follows since all cross terms vanish:

Therefore, the variance of the arithmetic mean estimator is given as follows:

Now that we know the mean and the variance of the arithmetic mean estimator, we can apply Chebyshev’s inequality:

This result shows that the arithmetic mean is a **consistent estimator** of the expected value: it converges in probability to the true mean. In other words, for large samples, the expected value *μ* of the underlying distribution and the arithmetic mean of the sample are unlikely to differ significantly.

#### Lindeberg–Lévy central limit theorem

Chebyshev’s law of large numbers states that, under fairly general conditions, the arithmetic mean of a large sample is very likely to be found close to the true mean of the underlying distribution. Perhaps surprisingly, we can be quite specific on how the averages of large samples distribute around the true expectation. This is the key message of the **Lindeberg–Lévy central limit theorem**. For any numbers *a*, *b* with *a* < *b*:

The integrand on the right-hand side of the equation is the probability density function of the **standard normal distribution**: the **normal distribution** — which has the well-known bell shape — with vanishing mean and unit variance.

In general, a sequence of random variables is said to converge in distribution against some random variable if their cumulative distribution functions converge pointwise against the distribution of that random variable*. *Thus, mathematically, the central limit theorem states that the following sequence of random variables always converges in distribution to a standard normally distributed random variable, no matter how *X_*1, … *X_*N are distributed (as long as they are i.i.d.):

Statistically, the central limit theorem implies that if we repeatedly collect a

sufficiently large sample from the same population, the mean values of those samples will be normally distributed. This theorem has practical significance because it allows us to make precise statements about the accuracy of statistical estimations. A common misconception is that this theorem is the reason why many empirical distributions can supposedly be approximated by a normal distribution in practice. However, this is not the case.

Although the proof of the theorem requires advanced analytical tools that we will not discuss here (see, e.g., [Durrett 2019, Theorem 3.4.1]), we can understand its practical implications through a numerical example. Let us consider the following probability density function that we assume produces the data under study:

To emphasize that the theorem holds for any shape of the underlying distribution, notice how the density function does not resemble a bell curve. We can inspect histograms of a large number of means computed from samples of size *N* drawn repeatedly from the distribution through numeric simulation. For samples that only consist of a single instance, *N* = 1, we cannot expect the limit theorem to apply — we simply reproduce the underlying distribution:

However, even for the relatively small sample size *N* = 5, the distribution of the arithmetic means — i.e., repeated sampling and computation of (*x_*1 + … + *x_*5) / 5 — shows the typical bell shape of the normal distribution:

Grant Sanderson, on his YouTube channel 3Blue1Brown, made a video that provides additional intuitive insight on the central limit theorem that is delightful to watch.

#### Interval estimation and hypothesis testing

The central limit theorem is important because it allows us to specify a **confidence interval** rather than just a point estimate when estimating the mean of some population: instead of a single estimated value, we specify an interval that we can be reasonably sure contains the true mean. For example, suppose we want to ensure that our estimate is correct with 95% confidence for sufficiently large samples. We can achieve this by setting the confidence interval with a **confidence level** of γ = 0.95:

We make the following ansatz with the number *z* > 0, which is yet to be determined:

The central limit theorem allows us to conclude:

Thus, *z* = *z*(γ) is determined by the integral limits that produce an area of γ under the standard normal bell curve. For example, *z*(0.95) = 1.96 or *z*(0.99) = 2.58.

In conclusion, the interval estimate of the mean at confidence level γ based on a sufficiently large sample (commonly used rules of thumb are *N* > 30 or *N* > 50) is given as follows:

In order to arrive at the above formula, we have replaced the mean μ and the standard deviation σ with their empirical estimates x̄ and *s*(x), repectively. This is a reasonable approximation for sufficiently large samples and can be justified by **Slutsky’s theorem** which essentially states that the operations of basic arithmetic commute with taking the limit in distribution as long as at least one of the summands/factors converges to a constant.

Instead of the confidence level γ, the **significance level**, or **probability oferror,** α = 1 − γ can be specified.

Let us compute a practical example. The 99.9% confidence interval for the average body height of male respondents in the CDC survey is given by [177.98 cm, 178.10 cm]. This high statistical accuracy is due to the large sample size *N* with more than 190,000 male persons who were interviewed. We want to demonstrate how interval estimation works for a smaller sample size. To this end, we repeatedly draw a random sample of *N* = 50 body height values and compute the corresponding 95% confidence interval. The result can be seen in the following figure:

Notice that most of the confidence intervals, shown as vertical error bars, also contain the true value of 178 cm, shown as a horizontal dashed line. However, some do not contain it, about five in one hundred — this is expected by construction and is consistent with the specified error probability of α = 5%. There is always the possibility that the interval estimate will miss the true mean of the population, especially at low confidence levels.

Another important application of the central limit theorem, closely related to interval estimation, is in hypothesis testing. Suppose that we have reason to believe that the expected value of a random variable *X *is *not* equal to some value μ. In that case, we want to disprove the null hypothesis *E*[*X*] = μ. We may say that this null hypothesis is not consistent with the data if the observed mean is not included in the following interval:

Let us revisit the example of a possibly biased coin from the introduction. We record the result of each coin flip, which yields a sequence of binary values, where a value of one represents heads and a value of zero represents tails. The arithmetic mean of that sequence is equal to the relative frequency of heads, and we can apply what we have learned so far. Suppose we have reason to believe that the coin is not fair. The null hypothesis claims that the coin is fair, i.e., *E*[*X*] = 0.5. In a first experiment, we observe that after ten tosses, the coin lands with heads on top seven times. At a confidence level of γ = 0.95, the null hypothesis interval for this experiment is the following: [0.24, 0.76]. The actually observed proportion of 0.7 is still within this interval. Therefore, the null hypothesis of a fair coin cannot be rejected at the given confidence level.

The sample size is relatively small, and it is actually recommended to use the **Student’s t-test**. A

*t*-test would correct the critical standard score

*z*(0.95) = 1.96 to 2.26, and thus result in an even wider confidence interval.

If, on the other hand, we observed seventy out of one hundred coin tosses with the outcome of heads, the following confidence interval would be the result, assuming the null hypothesis to be true: [0.41, 0.59]. In this case, the actually observed proportion of 0.7 is *not *contained in the confidence interval. Thus, the null hypothesis should be rejected, and we may conclude — at the given confidence level — that the coin is biased.

We can also investigate whether the means of two populations are equal, based on a sample of each. The two-sided, two-sample ** Z-test** implies a rejection of the null hypothesis of equal mean if the following condition is met:

### Drawing conclusions from data: pitfalls of statistical inference

Performing statistical tests and computing confidence intervals do not replace proper statistical reasoning: statistically significant effects may still have little practical relevance, or may just represent a spurious relationship.

#### Statistical vs. practical significance: effect size

Especially for very large samples, it can be quite common to detect statistically significant differences in the mean or other types of effects that are considered significant according to statistical tests. However, those effects might still be small in magnitude.

For example: the CDC dataset allows for comparisons between different U.S. states. We can compare the average height of male respondents in Rhode Island with those in New York. Applying the *Z*-test, we obtain a test score of 0.33 cm at a confidence level of 95%. This value is below the observed difference of 0.44 cm. Therefore, the difference is statistically significant. However, it is very small in magnitude, and therefore we can expect it to be of little practical relevance.

In many cases, the effect size can be gauged well by specifying the effect in

natural units. In the above example, we chose metric units of length. Another possibility is to specify it in units corresponding to a multiple of the standard deviation. **Cohen’s d** is a measure of the practical relevance of a statistical effect. It is defined as the difference of means divided by the pooled variance [Cohen 1988, p.67]:

The difference 0.44 cm observed in the example above corresponds to a value of 0.05 for Cohen’s *d*. When we compare the average height of respondents in Puerto Rico with those in New York, we get a value of 0.50 for Cohen’s *d*, corresponding to a difference in metric units of 4.1 cm.

Rules of thumb for interpreting values of Cohen’s *d* are noted in the following table [Sawiloswky 2009]:

#### Statistical inference vs. causal explanation: Simpson’s paradox

Certainly, one of the most frequently cited pitfalls in statistical inference is the mantra, “correlation does not imply causation.” This concept is often illustrated using examples of correlations that are blatantly spurious and sometimes comical, like attributing a shortage of pirates to global warming.

However, in practical applications, it is often far from obvious whether a statistical association is indeed spurious or indicative of a causal relationship. One source of spurious correlation that isn’t immediately discernible is the presence of unknown confounding variables. In fact, the existence of an unknown confounder can lead to the reversal of a correlation when examining specific subpopulations, a phenomenon known as **Simpson’s paradox**.

Simpson’s paradox can be illustrated by the following example (cf. [Blyth 1972], [Bickel et al. 1975] and [Freedman et al. 2007, Chap. 2, Sect. 4]): At a university’s six largest departments, *p_x* = 30% of 1835 female applicants are admitted, compared to *p_y* = 45% of 2691 male applicants. We can use the *Z*-test to conclude that this difference in admission rates is significant at a confidence level of 99%.

These are the numbers broken down by university department:

For each department, we can compute the two-sided test score and compare that score with the absolute value of the observed difference in admission rate, | *p_y* − *p_x *|. From the available data, we can also compute the rate of admission *p *for each department, irrespective of gender:

Only department *A* exhibits a significant difference in admission rates. Contrary to the comparison across all departments, it is in favor of female applicants. Departments *A* and *B* are the departments where applicants are the most likely to succeed in being admitted, by a large margin. 51% of male applicants choose these departments to apply for but only 7% of all female applicants do so. Therefore, the data are consistent with the hypothesis that female applicants are more likely to apply for more competitive studies, which implies that they are more likely to be rejected.

### Conclusion

The law of large numbers provides a robust foundation for the process of statistical estimation, and its validity is rigorously supported by the central limit theorem. Statistical estimations become increasingly accurate as more data is considered, and in many cases, we can compute metrics that quantify both the accuracy and our confidence in the outcomes.

However, it is important to emphasize that adopting a “shut up and calculate” approach is insufficient for sound statistical reasoning and effective data science. Firstly, even when **random errors** are minimized, statistical results can still be influenced by a variety of **systematic errors**. These may arise from factors such as **response bias**, malfunctioning measurement equipment, or a flawed study design that introduces **sampling bias**. Consequently, a thorough examination of potential sources of bias is imperative for reliable statistical analysis.

Secondly, when interpreting results, it is critical to recognize that statistical significance and correlation alone are inadequate for assessing the practical significance or the underlying reasons behind observed effects. Statistical findings must be contextualized to ascertain their real-world importance and to provide explanations for the observed phenomena.

### References

[Plaue 2023] Matthias Plaue. “*Data Science — An Introduction to Statistics and Machine Learning*”. Springer Berlin, Heidelberg. 2023.

[CDC 2018] Centers for Disease Control and Prevention (CDC). *Behavioral Risk Factor Surveillance System Survey Data*. Atlanta, Georgia: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention. 2018.

*The CDC data are **in the public domain and may be reproduced without permission**.*

[Durrett 2019] Rick Durrett. *Probability: Theory and Examples*. 5th ed. Cambridge University Press, May 2019.

[Cohen 1988] Jacob Cohen. *Statistical power analysis for the behavioral sciences*. 2nd ed. New Jersey, USA: Lawrence Earlbaum Associates, 1988.

[Sawilowsky 2009] Shlomo S. Sawilowsky. “New Effect Size Rules of Thumb”. In: Journal of Modern Applied Statistical Methods 8.2 (Nov. 2009), pp. 597–599.

[Blyth 1972] Colin R. Blyth. “On Simpson’s Paradox and the Sure-Thing Principle”. In: Journal of the American Statistical Association 67.338 (June 1972), pp. 364–366.

[Bickel et al. 1975] P. J. Bickel, E. A. Hammel, and J. W. O’Connell. “Sex Bias in Graduate Admissions: Data from Berkeley”. In: Science 187.4175 (Feb. 1975), pp. 398–404.

[Freedman et al. 2007] David Freedman, Robert Pisani, and Roger Purves. *Statistics*. 4th ed. W. W. Norton & Company, Feb. 2007.

A primer on statistical estimation and inference was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.