Our method of thinking naturally incorporates the idea of skewness. Intuitively, our thoughts recognise the pattern in a visualization when we look at it.
In India, more than 50% of people are under 25, and more than 65% are under 35. The age distribution of the population of India has a hump on the left side and is relatively planar on the right, according to a plot of the distribution. That is to say, we can say that the data skews toward the end.
Skewness is a metric for a distribution’s asymmetry. When the left and right sides of a distribution are not mirror reflections, it is asymmetrical. Right (or positive), left (or negative), or zero skewness can all apply to a distribution. It is a fundamental statistic that everyone in data science and analytics should understand. It’s something we just can’t get away from.
I will help you to learn about skewness, its different types, and its significance in data science. It is a concept that will serve you well throughout your data science career.
What is Skewness ?
Skewness is a distortion or asymmetry in a set of data that deviates from the symmetrical bell curve, or normal distribution. In layman’s terms, it is the amount by which a variable deviates from the normal distribution.
Skewness is the measure of how far a random variable’s probability distribution deviates from the normal distribution. You might be wondering why I’m bringing up normal distribution here. Well, let me tell you the fact that the probability distribution without any skewness is known as the normal distribution. Let me brief you more on Normal Distribution.
The Normal distribution, commonly referred to as the Gaussian distribution, is seen in many naturally occurring measurements, including height and weight at birth. It is bell-shaped and symmetric.
Its values can range from -infinity(∞) to infinity(∞), and the majority of them are centered on the mean. In order to make the mean equal to the median, the piece of the curve below the mean will be the mirror image of the portion of the curve above the mean.
The probability density function (PDF) is a function that can be read as expressing a relative likelihood that the value of the random variable will be close to any particular sample or point in the sample space. For a normal distribution, PPDF is:
where Mu is mean and Sigma is standard deviation.
You can see a symmetrical distribution, which is essentially a normal distribution, in the image below. On both sides of the dashed line, you can see that it is symmetrical. In addition, there are two different types of skewness:
- Positive Skewness
- Negative skewness
- No Skewness ( Symmetrical Dist.)
Positively skewed probability distributions have their tails on the right, whereas negatively skewed distributions have their tails on the left. Later, we’ll comprehend this in greater depth. Let’s first explore why skewness is such a crucial idea for you as a data science practitioner.
Why is Skewness Important?
We now understand that skewness is a measure of asymmetry and that the type of skewness depends on which side of the probability distribution the tail is located. But why is it crucial to understand the skewness of the data?
First, it is assumed that the distributions of the independent and target variables are similar in linear models. As a result, understanding data skewness enables us to develop more accurate linear models.
Skewness also reveals information regarding the direction of outliers. As you can see, our distribution is positively skewed, and the majority of the outliers are located on its right side. The number of outliers is not disclosed by the skewness. We only get to know the direction of the outliers from the skewness.
Now that we are aware of the significance of skewness, it is time to comprehend the distributions I previously demonstrated.
As discussed earlier, the probability distribution with practically minimal skewness is the ideal normal distribution. It has almost perfect symmetry. As a result, the skewness value for a normal distribution is zero.
However, why is it almost entirely symmetrical but not entirely symmetrical?
This is due to the fact that no real data actually has a perfectly normal distribution. As a result, even the value of skewness is close to zero rather than absolutely zero. Despite the fact that zero is utilized as a benchmark when calculating a distribution’s skewness.
So far, we have used a probability or frequency distribution to understand the skewness of the normal distribution. Let’s now examine it in terms of a boxplot as that is the most typical method for examining a distribution in the field of data science.