Descriptive Statistics

Juma Shafara - Aug 24 - - Dev Community
## Uncomment and run this cell to install the packages
# !pip install --upgrade dataidea
Enter fullscreen mode Exit fullscreen mode

Descriptive Statistics and Summary Metrics

In this notebook, we will learn to obtain important values that describe our data including:

  • Measures of central tendency
  • Measures of variability
  • Measures of distribution shape
  • Measures of association
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

This notebook has been modified to use the Nobel Price Laureates Dataset which you can download from opendatasoft

# load the dataset (modify the path to point to your copy of the dataset)
data = pd.read_csv('../assets/nobel_prize_year.csv')
data = data[data.Gender != 'org'] # removing organizations
data.sample(n=5)
Enter fullscreen mode Exit fullscreen mode
Year Gender Category birth_year age
505 2014 male Physics 1929 85
318 1952 male Literature 1885 67
883 1933 male Literature 1870 63
481 1995 male Peace 1908 87
769 2005 male Peace 1942 63

Don't Miss Any Updates!

Before we continue, I have a humble request, to be among the first to hear about future updates of the course materials, simply enter your email below, follow us on (formally Twitter), or subscribe to our YouTube channel.

What is Descriptive Statistics

Descriptive statistics is a branch of statistics that deals with the presentation and summary of data in a meaningful and informative way. Its primary goal is to describe and summarize the main features of a dataset.

Commonly used measures in descriptive statistics include:

  1. Measures of central tendency: These describe the center or average of a dataset and include metrics like mean, median, and mode.

  2. Measures of variability: These indicate the spread or dispersion of the data and include metrics like range, variance, and standard deviation.

  3. Measures of distribution shape: These describe the distribution of data points and include metrics like skewness and kurtosis.

  4. Measures of association: These quantify the relationship between variables and include correlation coefficients.

Descriptive statistics provide simple summaries about the sample and the observations that have been made.

1. Measures of central tendency ie Mean, Median, Mode:

The Center of the Data:

The center of the data is where most of the values are concentrated.

Mean: It is the average value of a dataset calculated by summing all values(numerical) and dividing by the total count.

mean_value = np.mean(data.age)
print("Mean:", mean_value)
Enter fullscreen mode Exit fullscreen mode
Mean: 60.21383647798742
Enter fullscreen mode Exit fullscreen mode

Median: It is the middle value of a dataset when arranged in ascending order. If there is an even number of observations, the median is the average of the two middle values.

median_value = np.median(data.age)
print("Median:", median_value)
Enter fullscreen mode Exit fullscreen mode
Median: 60.0
Enter fullscreen mode Exit fullscreen mode

Mode: It is the value that appears most frequently in a dataset.

mode_value = sp.stats.mode(data.age)[0]
print("Mode:", mode_value)
Enter fullscreen mode Exit fullscreen mode
Mode: 56
Enter fullscreen mode Exit fullscreen mode

Homework:

Other ways to find mode (ie using pandas and numpy)

2. Measures of variability

The Variation of the Data:

The variation of the data is how spread out the data are around the center.

a) Variance and Standard Deviation:

Variance: It measures the spread of the data points around the mean.

# how to implement the variance and standard deviation using numpy
variance_value = np.var(data.age)
print("Variance:", variance_value)
Enter fullscreen mode Exit fullscreen mode
Variance: 159.28551085795658
Enter fullscreen mode Exit fullscreen mode

Standard Deviation: It is the square root of the variance, providing a measure of the average distance between each data point and the mean.

std_deviation_value = np.std(data.age)
print("Standard Deviation:", std_deviation_value)
Enter fullscreen mode Exit fullscreen mode
Standard Deviation: 12.620836377116873
Enter fullscreen mode Exit fullscreen mode

Summary

In summary, variance provides a measure of dispersion in squared units, while standard deviation provides a measure of dispersion in the original units of the data

Note!

Smaller variances and standard deviation values mean that the data has values similar to each other and closer to the mean and the vice versa is true

plt.hist(x=data.age, bins=20, edgecolor='black')
# add standard deviation lines
plt.axvline(mean_value, color='red', linestyle='--', label='Mean')
plt.axvline(mean_value+std_deviation_value, color='orange', linestyle='--', label='1st std Dev')
plt.axvline(mean_value-std_deviation_value, color='orange', linestyle='--')
plt.title('Age of Nobel Prize Winners')
plt.ylabel('Frequency')
plt.xlabel('Age')
# Adjust the position of the legend
plt.legend(loc='upper left')

plt.show()
Enter fullscreen mode Exit fullscreen mode

Mean and Standard Deviation

b) Range and Interquartile Range (IQR):

Range: It is the difference between the maximum and minimum values in a dataset. It is simplest measure of variation

# One way to obtain range
min_age = min(data.age)
max_age = max(data.age)
age_range = max_age - min_age
print('Range:', age_range)
Enter fullscreen mode Exit fullscreen mode
Range: 80
Enter fullscreen mode Exit fullscreen mode
# Calculating the range using numpy
range_value = np.ptp(data.age)
print("Range:", range_value)
Enter fullscreen mode Exit fullscreen mode
Range: 80
Enter fullscreen mode Exit fullscreen mode

Interquartile Range (IQR): It is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the dataset.

Quartiles:

Calculating Quartiles

The quartiles (Q0,Q1,Q2,Q3,Q4) are the values that separate each quarter.

Between Q0 and Q1 are the 25% lowest values in the data. Between Q1 and Q2 are the next 25%. And so on.

  • Q0 is the smallest value in the data.
  • Q1 is the value separating the first quarter from the second quarter of the data.
  • Q2 is the middle value (median), separating the bottom from the top half.
  • Q3 is the value separating the third quarter from the fourth quarter
  • Q4 is the largest value in the data.
# Calculate the quartile
quartiles = np.quantile(a=data.age, q=[0, 0.25, 0.5, 0.75, 1])

print('Quartiles:', quartiles)
Enter fullscreen mode Exit fullscreen mode
Quartiles: [17. 51. 60. 69. 97.]
Enter fullscreen mode Exit fullscreen mode

Percentiles:

Percentiles are values that separate the data into 100 equal parts.

For example, The 95th percentile separates the lowest 95% of the values from the top 5%

  • The 25th percentile (P25%) is the same as the first quartile (Q1).
  • The 50th percentile (P50%) is the same as the second quartile (Q2) and the median.
  • The 75th percentile (P75%) is the same as the third quartile (Q3)

Calculating Percentiles with Python

To get all the percentile values, we can use np.percentile() method and pass in the data, and the list of the percentiles as showed below.

# Getting many percentiles
percentiles = np.percentile(data.age, [25, 50, 75])
print(f'Percentiles: {percentiles}')
Enter fullscreen mode Exit fullscreen mode
Percentiles: [51. 60. 69.]
Enter fullscreen mode Exit fullscreen mode

To get a single percentile value, we can again use the np.percentile() method and pass in the data, and a the specicific percentile you're interested in eg:

# Getting one percentile at a time
first_quartile = np.percentile(a=data.age, q=25) # 25th percentile
middle_percentile = np.percentile(data.age, 50)
third_quartile = np.percentile(data.age, 75) # 75th percentile

print('Q1: ', first_quartile)
print('Q2: ', middle_percentile)  
print('Q3: ', third_quartile)
Enter fullscreen mode Exit fullscreen mode
Q1:  51.0
Q2:  60.0
Q3:  69.0
Enter fullscreen mode Exit fullscreen mode

Note!

Note also that we can be able to use the `np.quantile()` method to calculate the percentiles which makes logical sense as all the values mark a fraction(percentage) of the data

percentiles = np.quantile(a=data.age, q=[0.25, 0.50, 0.75])
print('Percentiles:', percentiles)
Enter fullscreen mode Exit fullscreen mode
Percentiles: [51. 60. 69.]
Enter fullscreen mode Exit fullscreen mode

Now we can be able to obtain the interquartile range as the difference between the third and first quartiles as predefined.

# obtain the interquartile
iqr_value = third_quartile - first_quartile
print('Interquartile range: ', iqr_value)
Enter fullscreen mode Exit fullscreen mode
Interquartile range:  18.0
Enter fullscreen mode Exit fullscreen mode

Note: Quartiles and percentiles are both types of quantiles

Summary

While the range gives an overview of the entire spread of the data from lowest to highest, the interquartile range focuses s`pecifically on the spread of the middle portion of the data, making it more robust against outliers.

Frequency Tables

Frequency means the number of times a value appears in the data. A table can quickly show us how many times each value appears. If the data has many different values, it is easier to use intervals of values to present them in a table.

Here's the age of the 934 Nobel Prize winners up until the year 2020. IN the table, each row is an age interval of 10 years

Age Interval Frequency
10-19 1
20-29 2
30-39 48
40-49 158
50-59 236
60-69 262
70-79 174
80-89 50
90-99 3

Note: The intervals for the values are also called bin

Further Reading

Chapter 3 of An Introduction to Statistical Methods and Data Analysis 7th Edition_New

What's on your mind? Put it in the comments!

. . . . . . . . . . . . .