Introduction
In this blog post, we will delve into descriptive statistics, focusing on measures of central tendency and variability. We will explore two scenarios: calculating summary statistics for a numeric variable grouped by a categorical variable and examining statistical details for specific categories in a dataset. By the end, you will have a solid understanding of how to analyze data using these fundamental statistical measures.
1. Summary Statistics Grouped by Categorical Variables:
To calculate summary statistics (mean, median, minimum, maximum, and standard deviation) for a numeric variable grouped by a categorical variable, let's consider an example dataset called 'Loan-payments-data.csv'. We will calculate summary statistics for the 'age' variable grouped by the 'education' variable. Here's the code to perform the analysis:
```
import pandas as pd
# Read the dataset from the CSV file
df = pd.read_csv('Loan-payments-data.csv')
# Calculate the summary statistics for 'age' grouped by 'education'
age_summary = df.groupby('education')['age'].describe()
# Extract the numeric values from the summary statistics
age_values = age_summary['mean'].tolist() + age_summary['50%'].tolist() + age_summary['min'].tolist() + \ age_summary['max'].tolist() + age_summary['std'].tolist()
# Print the summary statistics
print('Summary statistics for age grouped by education:')
print(age_summary)
print('')
# Print the list of numeric values
print('Numeric values for age grouped by education:')
print(age_values)
'''
Output
2. Statistical Details for Specific Categories
Next, let's examine statistical details for specific categories in a dataset. For this example, we will use the 'Iris (1).csv' dataset. We will calculate percentile, mean, and standard deviation for the species 'Iris-setosa', 'Iris-versicolor', and 'Iris-virginica'. Here's the code to perform the analysis:
```python
import pandas as pd
# Read the dataset from the CSV file
df = pd.read_csv('Iris (1).csv')
# Filter the dataset for the specified species
setosa_data = df[df['Species'] == 'Iris-setosa']
versicolor_data = df[df['Species'] == 'Iris-versicolor']
virginica_data = df[df['Species'] == 'Iris-virginica']
# Calculate the statistical details for each species
setosa_stats = setosa_data.describe()
versicolor_stats = versicolor_data.describe()
virginica_stats = virginica_data.describe()
# Display the statistical details
print("Statistical details for Iris-setosa:")
print(setosa_stats)
print()
print("Statistical details for Iris-versicolor:")
print(versicolor_stats)
print()
print("Statistical details for Iris-virginica:")
print(virginica_stats)
'''
Output
Summary
Descriptive statistics, such as measures of central tendency (mean, median) and variability (standard deviation), are powerful tools for summarizing and analyzing datasets. By calculating summary statistics for a numeric variable grouped by a categorical variable and examining statistical details for specific categories, you can gain valuable insights into your data. These statistical measures serve as a foundation for further analysis and decision-making in various fields, including finance, economics, and social sciences.
Click here for dataset - Loan-payments-data.csv
follow devcp.in for more content