An Outlier is an observation that lies in an abnormal distance from the values in a random sample from a population. In simple words if the value is much smaller or larger than the most of the other values in the given dataset then those values will be detected as Outliers. The Outliers can be simply detected using the Box plot.
In this scenario we are taking the Diabetes dataset from Kaggle. The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.
• You can get the dataset by click the following link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
In this dataset there are most of the features those contains Outliers but here for example we take “BloodPressure” Column to detect and remove the Outlier using IQR (Interquartile Range) method.
• The 25% is quantile is the 62.00 i.e., Q1 and the 75% quantile is 80.00 i.e., Q3, and the Q2 is 50% which is Median. Interquartile Range (IQR) is nothing but it is the difference in between Q1 and Q3.
The figure represents some statistical information about the “BloodPressure” column of the dataset. Among the 768 values the minimum value is 0.0 which is practically impossible for human body, so definitely there are some outliers whose value is extremely low. But we know the normal human body having the blood pressure range is 90-120 below 90 considered as low blood pressure and above 120 is considered as high blood pressure
so we need to remove only those minimum values. The maximum value is 122 which would be a symbol of high blood pressure and it is possible so we don’t need to remove that part.
In this way we can remove find the Q1 and Q3 of any feature from the given data as shown in the figure and find the IQR.
And The IQR is 18.0, which is nothing but it is the difference between Q1 and Q3 i.e., 80.00 – 62.00 = 18.00.
Initially the Boxplot of “BloodPressure” column looks like this
Before removing those outliers first, we have to find the upper and lower limit.
The formula to find the upper limit is [Q3 + (1.5 * IQR)].
The formula to find the lower limit is [Q1 – (1.5 * IQR)].
Then we need to replace those outliers with upper limit and lower limit.
Now after removing those outliers the box plot looks like this:
In this way we can remove the outliers from any given Dataset.
Silan Software is one of the India's leading provider of offline & online training for Java, Python, AI (Machine Learning, Deep Learning), Data Science, Software Development & many more emerging Technologies.
We provide Academic Training || Industrial Training || Corporate Training || Internship || Java || Python || AI using Python || Data Science etc