Statistics is the backbone of data science, providing the essential tools and methodologies to extract meaningful insights from raw data. Data scientists rely heavily on statistics for various critical tasks—from cleaning messy datasets to creating powerful visualizations and building predictive models that offer glimpses into the future. Without these statistical foundations, transforming raw data into actionable insights that drive business success would be impossible.

What are Descriptive Statistics?

Descriptive statistics play a vital role in summarizing and organizing data, making it more understandable. They allow us to understand the central tendencies, variability, and distribution of our datasets.

Types of Descriptive Statistics

Descriptive statistics can be classified into three primary categories, each serving different purposes:

Measures of Central Tendency
Measures of Variability
Measures of Frequency Distribution

1. Measures of Central Tendency

These statistical values describe the central position within a dataset. The three main measures are:

Mean: The average of the observations, calculated as follows:

  x̄ = ∑x / n

Where:

( x ) = Observations
( n ) = Number of terms

Here’s how to find the mean using Python:

  import numpy as np

  # Sample Data
  arr = [5, 6, 11]

  # Mean
  mean = np.mean(arr)
  print("Mean = ", mean)

Output: Mean = 7.333333333333333

Mode: The most frequently occurring value in the dataset, useful for categorical data.

  import scipy.stats as stats

  # Sample Data
  arr = [1, 2, 2, 3]

  # Mode
  mode = stats.mode(arr)
  print("Mode = ", mode)

Output: Mode = ModeResult(mode=array([2]), count=array([2]))

Median: The middle value that divides the dataset into two halves. If the number of elements is odd, the median is the center element; if even, it’s the average of the two central elements.

  import numpy as np

  # Sample Data
  arr = [1, 2, 3, 4]

  # Median
  median = np.median(arr)
  print("Median = ", median)

Output: Median = 2.5

These measures form the foundation for understanding data distribution and identifying anomalies.

2. Measure of Variability: Understanding Data Dispersion

Understanding how data spreads out is crucial. Measures of variability quantify this spread, which is important for identifying outliers and assessing model assumptions. Key measures include:

Range: The difference between the largest and smallest data points.

  import numpy as np

  # Sample Data
  arr = [1, 2, 3, 4, 5]

  Maximum = max(arr)
  Minimum = min(arr)

  Range = Maximum - Minimum
  print("Maximum = {}, Minimum = {} and Range = {}".format(Maximum, Minimum, Range))

Output: Maximum = 5, Minimum = 1 and Range = 4

Variance: The average squared deviation from the mean.

  import statistics

  # Sample Data
  arr = [1, 2, 3, 4, 5]
  print("Var = ", (statistics.variance(arr)))

Output: Var = 2.5

Standard Deviation: A measure that indicates the extent of variation or dispersion in data, calculated as the square root of the variance.

  import statistics

  arr = [1, 2, 3, 4, 5]
  print("Std = ", (statistics.stdev(arr)))

Output: Std = 1.5811388300841898

3. Measures of Frequency Distribution

A frequency distribution table summarizes how data points are distributed across different categories or intervals. It helps identify patterns, outliers, and the overall structure of the dataset. Key components include:

Data intervals or categories
Frequency counts
Relative frequencies (percentages)
Cumulative frequencies

Understanding these measures lays the groundwork for more advanced analytical methods and visualizations such as histograms or pie charts.

For more content, follow me at — https://linktr.ee/shlokkumar2303