Introduction
Welcome back to our data wrangling journey, where we delve into the intricacies of handling academic performance datasets. In this blog post, our primary objective is to address outliers and enhance the accuracy of our data representation. Through the utilization of Python and indispensable libraries like pandas, numpy, matplotlib, and scipy, we will demonstrate effective techniques for identifying and eliminating outliers. The result will be a refined and dependable dataset that instills trust in its users. Prepare to witness the prowess of outlier detection and witness the remarkable metamorphosis of raw data into a more reliable and trustworthy form.
1. Loading the Academic Performance Dataset
To kickstart our journey, let's load the academic performance dataset using the following code:
```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore
# Load the dataset
df = pd.read_csv('Academic_Performance.csv')
```
2. Tackling Outliers
Outliers can significantly impact our analysis by distorting results and affecting statistical measures. Let's begin by examining the dataset for any missing values using the `isnull().sum()` function in pandas:
```
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
```
To handle missing values, we'll employ mean imputation, as shown in the previous blog. Now, let's move on to tackling outliers
3. Identifying and Removing Outliers
We'll focus on the numeric columns: 'Sem1', 'Sem2', 'Sem3', 'Sem4', 'Sem5', 'Sem6', 'Sem7', 'Sem8', and 'Average'. Before identifying outliers, let's visualize the boxplot of these variables:
```
plt.figure(figsize=(8, 6))
df[numeric_columns].boxplot()
plt.title("Boxplot Before Handling Outliers")
plt.xlabel("Variable")
plt.ylabel("Value")
plt.show()
```
Then output will be look like this
By using the z-score method, we can identify and remove outliers. The following code accomplishes this:
```
z_scores = zscore(df[numeric_columns])
outliers = (np.abs(z_scores) > 3).any(axis=1)
df = df[~outliers]
```
After removing the outliers, let's visualize the boxplot again to observe the impact:
```
plt.figure(figsize=(8, 6))
df[numeric_columns].boxplot()
plt.title("Boxplot After Removing Outliers")
plt.xlabel("Variable")
plt.ylabel("Value")
plt.show() ```
Then output will be look like this
4. Saving the Modified Dataset
Summary
In this blog post, we embarked on a mission to conquer outliers lurking within the academic performance dataset. By leveraging the power of Python and essential libraries, we successfully identified and removed outliers, ensuring a more reliable and accurate representation of the data. Through insightful visualization and robust outlier detection techniques, we refined the dataset and set the stage for more trustworthy analyses.
Stay tuned for more data wrangling adventures, where we unravel the mysteries hidden within datasets and extract valuable insights!
Happy data exploration and may your outliers never stand in the way of truth
Click here for accesing dataset- Academic_Performance.csv