Why do we divide by instead of when we calculate the sample variance?
The definition of Variance of a random variable X is given by
For the sample estimator of the variance, we often use the following sample variance formula.
Why do we divide by
Short answer
It’s because the sample variance
This behavior is called Bessel’s correction. One thing you should keep in mind is that the Bessel’s correction is only for the samples from iid distribuiton. If the samples are not iid, (e.g. Time-Series data) the Bessel’s correction would not yield an unbiased estimator.
Long answer
What is unbiased estimator?
When we estimate some unknown parameter
Unbiasedness of sample variance
Let
Note that
Since the sample variance with the denominator of
What about the sample mean?
Along the way to show the
Implementations
In most cases, we don’t explicitly calculate the sample variance. Instead, we call the implemented function from the numerical library we use. Thus, it is useful to know whether the function uses Bessel’s correction or not, since it varies by the implementation details.
Numpy
Numpy has the np.var
method to calculate the variance of a given input. np.var
returns the uncorrected variance, which is divided by ddof
argument. To get an unbiased sample variance, you have to add ddof=1
. The np.var
will return the result divided by n-ddof
.
import numpy as np
= np.array([1,2,3])
x # Divide by n where n = len(x)
np.var(x) =1) # Divide by (n-1) np.var(x, ddof
Pandas
Pandas has a method in its DataFrame class, pd.DataFrame.var
. Unlike the np.var
, pd.DataFrame.var
returns the corrected sample variance with the denominator of ddof
argument as in np.var
.
import pandas as pd
= pd.DataFrame([1,2,3])
x # Divide by (n-1)
x.var() =0) # Divide by n = (n-ddof) x.var(ddof
R
Standard function var
in R returns the unbiased sample variance with
These functions use
on the denominator purely for consistency with stats::var()
(for the record, I disagree with the rationale for). -R Documentation-
<- rnorm(10)
x <- len(x)
n var(x) # Divide by (n-1)
var(x) * (n-1) / n # Divide by n
Conclusion
Now we know why we divide by