library(ggplot2)
library(patchwork)
library(palmerpenguins)
Introduction to Central Limit Theorem
theme_set(theme_bw())
theme_replace(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
Sampling Distributions
Sample mean
The sampling distribution of the sample mean is the theoretical distribution of means that would result from taking all possible samples of size \(n\) from the population.
Suppose we are sampling from a population which comes from a standard Normal distribution such that observations in our sample are iid (independent, identically distributed). Each sample of size \(n\) will then consists of realizations \(\{x_1,\dots,x_n\}\), such that each \(x_i\) will be a realization of the random variable \(X_i \sim N(0,1)\)
To obtain the sampling distribution for the sample mean,
For the total number of samples do the following:
- Generate a random sample of size \(n\) from the population
- Compute the sample mean and store the result in a variable, say
sample_means
- Plot each of the
sample_means
using a histogram
<- 10000
n_samples <- 100 # each sample will be of size 100
sample_size <- numeric(n_samples)
sample_means
for(i in 1:n_samples){
= rnorm(n = sample_size, mean = 0, sd=1) # generate a new sample
sample_i = mean(sample_i) # obtain mean for each sample
sample_means[i] }
Show Code
<- ggplot(data.frame(sample_means),aes(sample_means))+
sampling_dist geom_histogram(fill = 'steelblue',alpha = 0.3,
color = 'black',bins=30)+
labs(title = 'Sampling distribution',
x = 'Sample means',y = '')
The plot above is the distribution of sample means after taking 10,000 samples with size \(n=100\) from the population. The overall distribution appears to be normally distributed (bell shaped curve).
The Central Limit Theorem (CLT) states that if you have a population with mean \(\mu\) and standard deviation \(\sigma\) and take sufficiently large random samples from the population with replacement, then the distribution of the sample means (sampling distribution) will be approximately normally distributed with mean \(\mu\) and standard error \(\frac{\sigma}{\sqrt{n}}\). Note, the standard deviation of the sampling distribution of the sample mean (or any other statistic) is referred to as the standard error.
Recall, the population is distributed as a standard normal distribution, i.e \(N(\mu=0,\sigma=1)\)
mean(sample_means)
#> [1] -0.00146148
which is very close to the population mean of \(\mu =0\)
with standard error
sd(sample_means)
#> [1] 0.1001859
which is approximately \(\frac{\sigma}{\sqrt{n}} = \frac{1}{10}\)
Below is a demonstration of how increasing the sample size affects the distribution of the sample mean and seeing CLT in action
Show Code
<- function(n_samples, mean = 0, sd = 1,
generate_sampling_distribution color = 'black', fill = 'steelblue',
alpha = 0.3, bins=30) {
<- numeric(n_samples)
sample_means
for(i in 1:n_samples){
= rnorm(n = n_samples, mean = mean, sd=sd) # generate a new sample
sample_i = mean(sample_i) # obtain mean for each sample
sample_means[i]
}
<- round(mean(sample_means),3)
sample_mean <- round(sd(sample_means),3)
standard_error
<- ggplot(data.frame(sample_means,n_samples),aes(sample_means))+
sampling_dist geom_histogram(fill = fill,alpha = alpha,
color = color,bins=bins)+
labs(title = 'Sampling distribution',
x = 'Sample means',y = '',
subtitle = paste0('mean=',sample_mean,
', standard error=',standard_error))
return(sampling_dist+facet_grid(~n_samples))
}
<- generate_sampling_distribution(100)+
plots generate_sampling_distribution(1000)+
generate_sampling_distribution(5000)+
generate_sampling_distribution(10000)+
plot_layout(nrow = 2, ncol = 2)
To interactively visualize how the sampling distribution of the sample mean builds up one sample at a time and see when the Central Limit Theorem start to kick in, I highly recommend the art of stats sampling distributions and Central Limit Theorem web applications.
Application: Sample Mean
Now we will apply the the concepts of generating a sampling distribution for the sample mean using a real world dataset. In particular, we will use the penguins
dataset from the palmerpenguins
package.
Using the penguins dataset we will generate a sampling distribution for the sample mean of the penguins’ body mass (grams)
For simplicity, we will remove any missing observations
<- penguins$body_mass_g[!is.na(penguins$body_mass_g)] body_mass_g
Recall, to obtain the sampling distribution for the sample mean, do the following process
For the total number of samples do the following:
- Generate a random sample of size \(n\) from the population,
penguins$body_mass_g
- Compute the sample mean of the penguins body mass and store the result in a variable, say
body_mass_xbar
- Plot each of the
body_mass_xbar
using a histogram
<- 10000
n_samples <- 50
sample_size
<- numeric(n_samples)
body_mass_xbar
for(i in 1:n_samples){
= sample(body_mass_g, size = sample_size) # generate a new sample
sample_i = mean(sample_i) # obtain mean for each sample
body_mass_xbar[i] }
Show Code
<- ggplot(data.frame(body_mass_xbar),
sampling_dist aes(body_mass_xbar))+
geom_histogram(fill = 'steelblue',alpha = 0.3,
color = 'black',bins=30)+
labs(title = 'Sampling distribution of Sample Mean',
subtitle = 'Penguins body mass (grams)',
x = 'Sample means',y = '')
The mean for the sampling distribution is
mean(body_mass_xbar)
#> [1] 4200.661
which is approximately the population mean for the penguins body mass (without missing observations)
mean(body_mass_g)
#> [1] 4201.754
the standard error for the sampling distribution is
sd(body_mass_xbar)
#> [1] 104.0124
Sample proportions
The idea behind generating a sampling distribution for sample proportions is very similar to the procedure described in Sample means section.
To obtain the sampling distribution for the sample proportions;
For the total number of samples do the following:
- Generate a random sample of size \(n\) from the population
- Compute the sample proportion and store the result in a variable, say
sample_proportions
- Plot each of the
sample_proportions
using a histogram
Here the sample proportion is the fraction of samples which were success. We will take a large number of random samples from the population, where the population is generated from a binomial distribution with 10 trials and probability of success \(p=0.25\) for each trial.
<- 10000
n_samples <- 100 # each sample will be of size 100
sample_size <- 10
n_trials <- numeric(n_samples)
sample_proportions
for(i in 1:n_samples){
= rbinom(n = sample_size, size= n_trials,
sample_i prob = 0.25) # generate a new sample
= mean(sample_i)/n_trials # obtain proportion for each sample
sample_proportions[i] }
Show Code
<- ggplot(data.frame(sample_proportions),
sampling_dist_prop aes(sample_proportions))+
geom_histogram(fill = 'steelblue',alpha = 0.3,
color = 'black',bins=30)+
labs(title = 'Sampling distribution',
x = 'Sample proportions',y = '')
The plot above is the distribution of sample proportions after taking 10,000 samples with size \(n=100\) from the population. The overall distribution appears to be normally distributed (bell shaped curve).
Through the CLT, the sampling distribution for the sample proportions will be distributed with mean \(p\) and standard error \(\sqrt{\frac{p(1-p)}{n}}\), \(\hat{p} \sim N(p,\sqrt{\frac{p(1-p)}{n}} )\)
mean(sample_proportions)
#> [1] 0.2501257
which is approximately the population parameter \(p=0.25\).
The standard error is
sd(sample_proportions)
#> [1] 0.01373183
Application: Sample proportion
Using the penguins dataset we will generate a sampling distribution for the sample proportion of the Adelie penguins
For simplicity, we will remove any missing observations
<- penguins$species[!is.na(penguins$species)] species
Recall, to obtain the sampling distribution for the sample proportion
For the total number of samples do the following:
- Generate a random sample of size \(n\) from the population,
penguins$species
(without missing observations) - Compute the sample proportion of the penguins Adelie species and store the result in a variable, say
adelie_proportions
- Plot each of the
adelie_proportions
using a histogram
<- 10000
n_samples <- 50
sample_size
<- numeric(n_samples)
adelie_proportions
for(i in 1:n_samples){
= sample(species, size = sample_size) # generate a new sample from the population
sample_i = mean(sample_i == 'Adelie') # obtain proportion for each sample
adelie_proportions[i] }
Show Code
<- ggplot(data.frame(adelie_proportions),
sampling_dist aes(adelie_proportions))+
geom_bar(fill = 'steelblue',alpha = 0.3,
color = 'black')+
labs(title = 'Sampling distribution of Sample Proportion',
subtitle = "Adelie penguins",
x = 'Sample proportions',y = '')
The mean for the sampling distribution is
mean(adelie_proportions)
#> [1] 0.441898
which is approximately the population proportion for the Adelie penguins (without missing observations)
prop.table(table(species))['Adelie']
#> Adelie
#> 0.4418605
the standard error for the sampling distribution is
sd(adelie_proportions)
#> [1] 0.06484981
which should approximately be \(\sqrt{\frac{p(1-p)}{n}}\)
<- prop.table(table(species))['Adelie']
p sqrt((p*(1-p))/sample_size)
#> Adelie
#> 0.07023102