Handling Outliers With IQR method

1. What are Outliers in a Dataset?

An Outlier is an observation that lies in an abnormal distance from the values in a random sample from a population. In simple words if the value is much smaller or larger than the most of the other values in the given dataset then those values will be detected as Outliers. The Outliers can be simply detected using the Box plot.


2. Is it necessary to Remove outliers?
  • • Outliers are very important aspect of data analysis in the Data Science project. The extreme value points present in the dataset we may find them at Data Pre-processing step, which are much smaller or much larger than other values we called them Outliers, while some outliers represent natural variations in population, they should be remain unchanged in dataset but in most of the cases Outliers are problematic they should be removed since they represent measurement error, data processing error and causes poor sampling. So, it is important to investigate the nature of Outlier before deciding to remove.
  • • Removal of Outliers creates a normal distribution in some variables and makes transformations for the other variables more effective.
  • • Removal of Outliers method reduces the variances of training data so the test accuracy is also increasing along with train accuracy.

Detection of Outlier with Diabetes dataset

In this scenario we are taking the Diabetes dataset from Kaggle. The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

• You can get the dataset by click the following link: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database


img

In this dataset there are most of the features those contains Outliers but here for example we take “BloodPressure” Column to detect and remove the Outlier using IQR (Interquartile Range) method.


• The 25% is quantile is the 62.00 i.e., Q1 and the 75% quantile is 80.00 i.e., Q3, and the Q2 is 50% which is Median. Interquartile Range (IQR) is nothing but it is the difference in between Q1 and Q3.

The figure represents some statistical information about the “BloodPressure” column of the dataset. Among the 768 values the minimum value is 0.0 which is practically impossible for human body, so definitely there are some outliers whose value is extremely low. But we know the normal human body having the blood pressure range is 90-120 below 90 considered as low blood pressure and above 120 is considered as high blood pressure

img

so we need to remove only those minimum values. The maximum value is 122 which would be a symbol of high blood pressure and it is possible so we don’t need to remove that part.

In this way we can remove find the Q1 and Q3 of any feature from the given data as shown in the figure and find the IQR.

And The IQR is 18.0, which is nothing but it is the difference between Q1 and Q3 i.e., 80.00 – 62.00 = 18.00.

img

How to Remove the Outliers

Initially the Boxplot of “BloodPressure” column looks like this

img

Before removing those outliers first, we have to find the upper and lower limit.
The formula to find the upper limit is [Q3 + (1.5 * IQR)].
The formula to find the lower limit is [Q1 – (1.5 * IQR)].
Then we need to replace those outliers with upper limit and lower limit.
Now after removing those outliers the box plot looks like this:

img

In this way we can remove the outliers from any given Dataset.


About the Author



Silan Software is one of the India's leading provider of offline & online training for Java, Python, AI (Machine Learning, Deep Learning), Data Science, Software Development & many more emerging Technologies.

We provide Academic Training || Industrial Training || Corporate Training || Internship || Java || Python || AI using Python || Data Science etc





 PreviousNext