The 365 Data Science team is proud to invite you to our own community forum. A very well built system to support your queries, questions and give the chance to show your knowledge and help others in their path of becoming Data Science specialists.
Anybody can ask a question
Anybody can answer
The best answers are voted up and moderated by our team

Correction of variance formula

Correction of variance formula


Excuse me,
I have maybe a dumb question but I will ask it. 
In order to avoid the outliers influence, can we use in the numerator a median value instead of mean   s^2 = sum((xi – median)^2) / n – 1 or it is incorrect?  I think It will give us a more robust value than in the mean variance. Is it a bad idea?
Thank you!

1 Answer

365 Team

Hi Fernanda,
Not a dumb question at all! Actually pretty valid.
The term ‘variance’ cannot be really altered, because it stems from the mathematical definitions.
Here’s something we’ve explained in another question on the difference between variance and standard deviation:
Theoretically, variance makes much more sense. It takes part in describing distributions and has important properties with other measures. An example you will be able to get after watching this course is: the covariance of x and x (so the covariance of the same variable) = Cov (x,x) =  Var (x) , or the variance of x. 
Standard deviation and correlation are derivative concepts from the bigger ones (variance and covariance). And while for all practical purposes we use standard deviation and correlation, for all theoretical purposes, variance and covariance are much better.
A very theoretical example:
In mathematics (mainly physics and probability), the moment is a quantitative measure, describing the shape of a set of points and is very widely adopted in academia. The central moment is the moment around the mean.  
The zeroth central moment is the total probability (one), the first central moment is 0. Because the first moment around the mean itself is 0 (no moment)*. The second central moment is variance, third -> skewness, and the fourth – kurtosis.
*By the way, the first moment (not central) is the mean.
So, based on the theory academia has developed throughout the years, variance is the measure which occurs naturally. Therefore, it has stuck as the main measure of variability in theory (where you don’t care about the units of measurement).
You question is a bit different so here’s some elaboration.
Variance occurs naturaly in relation to the mean. If you use the median instead, you are getting a different measure, which won’t be variance, but may be as useful, or even more useful for some problems. While I haven’t used that on my own, I have thought many times about it. Problem is that most models are using the ‘traditional variance’, instead of the one you suggested, so you can use it for a single problem, but not in general (as so many frameworks have been built on the traditional notion).
Here’s a nice thread if you want to read further (however, examples are in R):
Hope this helps!
365 Team

Learn Data Science
this Summer!
Get 50% OFF