Statistics is the backbone of data science, providing the essential tools and methodologies to extract meaningful insights from raw data. Data scientists rely heavily on statistics for various critical tasks—from cleaning messy datasets to creating powerful visualizations and building predictive models that offer glimpses into the future. Without these statistical foundations, transforming raw data into actionable insights that drive business success would be impossible.
What are Descriptive Statistics?
Descriptive statistics play a vital role in summarizing and organizing data, making it more understandable. They allow us to understand the central tendencies, variability, and distribution of our datasets.
Types of Descriptive Statistics
Descriptive statistics can be classified into three primary categories, each serving different purposes:
- Measures of Central Tendency
- Measures of Variability
- Measures of Frequency Distribution
1. Measures of Central Tendency
These statistical values describe the central position within a dataset. The three main measures are:
- Mean: The average of the observations, calculated as follows:
x̄ = ∑x / n
Where:
- ( x ) = Observations
- ( n ) = Number of terms
Here’s how to find the mean using Python:
import numpy as np
# Sample Data
arr = [5, 6, 11]
# Mean
mean = np.mean(arr)
print("Mean = ", mean)
Output: Mean = 7.333333333333333
- Mode: The most frequently occurring value in the dataset, useful for categorical data.
import scipy.stats as stats
# Sample Data
arr = [1, 2, 2, 3]
# Mode
mode = stats.mode(arr)
print("Mode = ", mode)
Output: Mode = ModeResult(mode=array([2]), count=array([2]))
- Median: The middle value that divides the dataset into two halves. If the number of elements is odd, the median is the center element; if even, it’s the average of the two central elements.
import numpy as np
# Sample Data
arr = [1, 2, 3, 4]
# Median
median = np.median(arr)
print("Median = ", median)
Output: Median = 2.5
These measures form the foundation for understanding data distribution and identifying anomalies.
2. Measure of Variability: Understanding Data Dispersion
Understanding how data spreads out is crucial. Measures of variability quantify this spread, which is important for identifying outliers and assessing model assumptions. Key measures include:
- Range: The difference between the largest and smallest data points.
import numpy as np
# Sample Data
arr = [1, 2, 3, 4, 5]
Maximum = max(arr)
Minimum = min(arr)
Range = Maximum - Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(Maximum, Minimum, Range))
Output: Maximum = 5, Minimum = 1 and Range = 4
- Variance: The average squared deviation from the mean.
import statistics
# Sample Data
arr = [1, 2, 3, 4, 5]
print("Var = ", (statistics.variance(arr)))
Output: Var = 2.5
- Standard Deviation: A measure that indicates the extent of variation or dispersion in data, calculated as the square root of the variance.
import statistics
arr = [1, 2, 3, 4, 5]
print("Std = ", (statistics.stdev(arr)))
Output: Std = 1.5811388300841898
3. Measures of Frequency Distribution
A frequency distribution table summarizes how data points are distributed across different categories or intervals. It helps identify patterns, outliers, and the overall structure of the dataset. Key components include:
- Data intervals or categories
- Frequency counts
- Relative frequencies (percentages)
- Cumulative frequencies
Understanding these measures lays the groundwork for more advanced analytical methods and visualizations such as histograms or pie charts.
For more content, follow me at — https://linktr.ee/shlokkumar2303