A Means to an End
in Data Science
As data scientists we often need to estimate a value from a sample data set in order to answer a business question about the whole population. In most cases we get a sample data set, and we wish to estimate the mean of some value and so the sample mean is a good choice for an estimator.
However, the arithmetic mean is not always the best choice for all the cases.
But before we get into the difference between several means and what is the best choice for different cases I would like to start from the top and ask a more fundamental questions - what is an estimator?
Well, a good definition that I found states:
An estimator is a statistic that estimates some fact about the population. You can also think of an estimator as the rule that creates an estimate.
Taking this and applying it to what we started with:
Given the business question and provided a sample data set, we wish to construct a rule that will answer the business question about the whole population.
Now lets talk about the different types of means - The Pythagorean Means are the three “classic” means: the Arithmetic mean, the Geometric mean, and the Harmonic mean.
Each of the Pythagorean Means has different properties and is used in different cases.
Now that we know what an estimator is and what the Pythagorean Means are, lets consider several examples and see which means fits best to which case.
In each of the following examples we will do the following:
- Explain the premise
- Define the business question
- Describe the data
- Show the analytical solution
- Finish off with an implementation of the solution including charts to visualize the importance of the right mean.
First Example - Particles traveling fixed distance
Particle i travels a certain distance xᵢ in a certain velocity vᵢ in a duration of tᵢ.
We are given a data set comprised of all the velocities of a sample set of the particles.
Given the sample of particle velocities we wish estimate the mean velocity in order to estimate the total distance traveled by all the particles (it is true that the total traveled distance can be easily computed, but I wish to show the effect of using the wrong estimator).
We can assume that the following holds:
\begin{aligned}
\sum_i v_i \cdot t_i &= \sum_i x_i
\end{aligned}
In this case we are looking for a fixed value v̄ that if we replace our individual velocities with it, the following claim will hold:
\begin{aligned}
\sum_i \bar{v} \cdot t_i &= \sum_i x_i
\end{aligned}
Let’s start with the analytical solution where all the particles travel the same distance of x.
For the left hand side of the equation:
\begin{aligned}
\bar{v} \sum_i t_i &= \bar{v} \sum_i \frac{x}{v_i}\\[2em]
\end{aligned}
For the right hand side we get:
\begin{aligned}
\sum_i x_i &= n \cdot x\\[2em]
\end{aligned}
For both sides we get:
\begin{aligned}
\bar{v} \sum_i \frac{x}{v_i} &= n \cdot x\\[2em]
\bar{v} &= \frac{n}{\sum_i \frac{1}{v_i}}
\end{aligned}
To conclude this case - we will need the harmonic mean to estimate the mean velocity v̄.
Now that we have solved the problem analytically lets implement this.
We will start with the basic imports:
We will generate the data points by generating the time it took each particle to complete the distance of x and derive the velocities. After we have the data set ready we will construct 3 different estimates - Arithmetic, Geometric and Harmonic means of the velocities. We will sample 70% of the samples and try to predict to total distance traveled by all the particles.
Lets plot the results of all the estimators and see which one gets the best accuracy:
We can clearly see that the Harmonic mean estimator reached the best accuracy in this case.
Second Example - Particles traveling fixed duration
This case is similar to the previous case with one exception - the particles travel the same duration t before it dissolves.
As in the previous example, given a sample of velocities we wish to estimate the total distance all the particles traveled.
In this example the same assumption holds:
\begin{aligned}
\sum_i v_i \cdot t_i &= \sum_i x_i
\end{aligned}
As in the previous case we are looking for a fixed value that will satisfy the same equation
\sum_i \bar{v}\cdot t_i = \sum_i x_i
Where in this case tᵢ is the same for all the particles and we can say the following for the left hand side:
\begin{aligned}
\sum_i \bar{v}\cdot t_i &= n \cdot t\cdot \bar{v}
\end{aligned}
For the right hand side we can say:
\begin{aligned}
\sum_i x_i &= \sum_i t \cdot v_i &= t \sum_i v_i
\end{aligned}
From there we get:
\begin{aligned}
n \cdot t\cdot \bar{v} &= t \sum_i v_i\\[2em]
\bar{v} &= \frac{\sum_i v_i}{n}
\end{aligned}
From the analytical solution we can conclude that the Arithmetic mean is the best fit for estimating the mean velocity v̄.
And now for the fun part - the implementation.
The difference between the previous case is that tᵢ is fixed and xᵢ is generated randomly.
We can clearly see from the plots that the Arithmetic mean has the best accuracy:
Third Example - Population Growth
In this case we will take a look at a population of germs/rabbits/ewoks, where in each time step i the population grows by αᵢ. Our data set is comprised of samples of αᵢ from the complete period and we wish to estimate the size of the population at time t.
We assume that we know or can derive the initial population size.
In this case the population over time can be described as
\begin{aligned}
P_t = \prod_{i=0}^t P_0\cdot \alpha_i
\end{aligned}
and we are looking for a fixed value α, that will hold the following equations:
\begin{aligned}
P_t = \prod_{i=0}^t P_0\cdot \alpha
\end{aligned}
We can derive from the previous formula
\begin{aligned}
P_0\prod_i^t \alpha &= P_0\prod_i \alpha_i\\[2em]
P_0\cdot \alpha^t &= P_0\prod_i \alpha_i\\[2em]
\alpha^t &= \prod_i \alpha_i\\[2em]
\alpha &= \sqrt[t]{\prod_i \alpha_i}
\end{aligned}
From the analytical solution we can see that the needed estimator here is the Geometric mean.
Lets continue to the implementation, where we generate αᵢ randomly, sample it to define the data set and estimate the total population at time t:
We can clearly see here that the Geometric estimator hits the nail right on the head:
Conclusion
We started with explaining what is an estimator, introduced the The Pythagorean Means, gone through several examples where each example needs a different mean to answer the business question.
The main point that I tried to communicate in this post is when you are presented with a business question and a data set, you can achieve more accurate results by understanding the “physics” of the problem and by choosing the right estimator for the job.