Hacking NFL data with PostgreSQL, TimescaleDB, and SQL

mirandaauhl - Jul 30 '21 - - Dev Community

Table of contents

  1. The NFL dataset
  2. Accessing the data
  3. Let's start exploring!
  4. The power of SQL
  5. Faster insights with PostgreSQL and TimescaleDB
  6. Faster queries with TimescaleDB continuous aggregates
  7. Advanced SQL data analysis with TimescaleDB hyperfunctions
  8. Where can the data take you?

Learn how to use time-series data provided by the NFL to uncover valuable insights into many player performance metrics – and ways to apply the same methods to improve your fantasy league team, your knowledge of the game, or your viewing experience - all with PostgreSQL, standard SQL, and freely available extensions.

Time-series data is everywhere, including, much to our surprise, the world of professional sports. At Timescale, we're always looking for fun ways to showcase the expanding reach of time-series data. Stock, cryptocurrency, IoT, and infrastructure metrics data are relatively common and widely understood time-series data scenarios. Head to Twitter on any given day, search for #timeseries or #TimescaleDB, and you're sure to find questions about high-frequency trading or massive scale observability data with tools like Prometheus.

You can imagine our excitement, then, when we happened upon the NFL Big Data Bowl, an annual competition that encourages the data science community to use historical player position and play data to create machine learning models.

Did the NFL really give access to 18+ million rows of detailed play data from every regular season NFL game?

For background, the National Football League (NFL) is the US professional sports league for American football, and the NFL season is followed by tens of millions of people, culminating in the annual Super Bowl (which attracts 100M+ global viewers, whether for the game or for the commercials).

Each NFL game takes place as a series of “plays,” in which the two teams try to score and prevent the other team from scoring. There are approximately 200 plays per game, with up to 15 games a week during the regular season. A healthy amount of data, but nothing unmanageable.

So, at first glance, football game metrics might not immediately jump out as anything special.

But then the NFL did something pretty ambitious and amazing.

All NFL players are equipped with RFID chips that track players’ position, speed, and various other metrics, which teams use to identify trends, mitigate risks, and continuously optimize. The NFL started tracking and storing data for every player on the field, for every play, for every game.

As a result, we now have access to a very detailed analysis of exactly how a play unfolded, how quickly various players accelerated during each play, and the play’s outcome. A traditional view of play-by-play metrics is “down and distance” and the result of the play (yards gained, whether or not there was a score, and so on). With the NFL’s dataset, we're able to mine approximately 100 data points at 100-millisecond intervals throughout the play to see speed, distance, involved players, and much more.

This isn’t ordinary data. This is time-series data. Time-series data is a sequence of data points collected over time intervals, giving us the ability to track changes over time. In the case of the NFL’s dataset, we have time-series data that represents how a play changes, including the locations of the players on the field, the location of the ball, the relative acceleration of players in the field of play, and so much more.

Time-series data comes at you fast, sometimes generating millions of data points per second (read more about time-series data). Because of the sheer volume and rate of information, time-series data can already be complex to query and analyze, which is why we built TimescaleDB, a multi-node, petabyte-scale, completely free relational database for time-series.

We couldn't pass up the opportunity to look at the NFL dataset with TimescaleDB, exploring ways we could peer deeper into player performance in hopes of providing insights about overall player performance in the coming season.

Read on for more information about the NFL’s dataset and how you can start using it, plus some sample queries to jumpstart your analysis. They may help you get more enjoyment out of the game.

If you’d like to get started with NFL data, you can spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days. The instructions later in this post will take you through how to ingest the data and start using it for analysis.

If you’re new to time-series data or just have some questions you’d like to ask about the dataset, join our public Slack community, where you’ll find Timescale team members and thousands of time-series enthusiasts, and we’ll be happy to help you.


The NFL dataset

Over the last few years, the NFL and Kaggle have collaborated on the NFL Big Data Bowl. The goal is to use historical data to answer a predetermined genre of questions, typically producing a machine learning model that can help predict the outcome of certain plays during regular season games.

Although the 2020/2021 contest is over, the sample dataset they provided from a prior season is still available for download and analysis. The 2020/2021 competition focused on pass play defense efficiency; therefore, only the tracking data for offensive and defensive "playmakers" is available in the dataset. No offensive or defensive linemen data is included. (You can read more about last year’s winners.)

(Keep watching the NFL website for more information on the next Big Data Bowl.)


Accessing the data

For the purposes of this blog post and accompanying tutorial, we will use the sample data provided by the NFL. This data is from the 2018 NFL season and is available as CSV files, including game-specific data and week-by-week tracking data for each player involved in the "offensive" part of the pass play. Contest participants in the next season of the contest will have access to new weekly game data.

This data is also very relational in nature, which means that SQL is a great medium to start gleaning value – without the need for Jupyter notebooks, other data science specific languages (like Python or R), or additional toolsets.

If you want to follow along - or recreate! - the queries we go through below, follow our tutorial to set up the tables, ingest data, and start analyzing data in TimescaleDB. For those unfamiliar with TimescaleDB, it’s built on PostgreSQL, so you’ll find that all of our queries are standard SQL. If you know SQL, you’ll know how to do everything here. (Some of the more advanced query examples we provide require our new, advanced hyperfunctions, which come pre-installed with any Timescale Forge instance.)


Let's start exploring!

We've provided the steps needed to ingest the dataset into TimescaleDB in the accompanying tutorial, so we won’t go into that here.

The NFL dataset includes the following data:

  • Games: all relevant data about each game of the regular season, including date, teams, time, and location

  • Players: information on each player, including what team they play for and their originating college

  • Plays: a wealth of data about each pass play in the game. Helpful fields include the down, description of the play that happened, line of scrimmage, and total offensive yardage, among other details.

  • Week [1-17]: for each week of the season, the NFL provides a new CSV file with the tracking data of every player, for every play (pass plays for this data). Interesting fields include X/Y position data (relative to the football field) every few hundred milliseconds throughout each play, player acceleration, and the "type" of a route that was taken. (In our tutorial, this data is imported into the tracking table and totals almost 20 million rows of time-series data.)

In addition to the NFL dataset, we also provide some extra data from Wikipedia that includes game scores and stadium conditions for each game, which you can load as part of the tutorial. With other time-series databases, it can be difficult to combine your time-series data with any other data you may have on hand (see our TimescaleDB vs. InfluxDB comparison for reference).

Because TimescaleDB is PostgreSQL with time-series super powers, it supports JOINS, so any extra relational data you want to add for deeper analysis is just a SQL query away. In our case, we’re able to combine the NFL’s play-by-play data along with weather data for each stadium.

Once you have the data ready, the world of NFL playmakers is at your fingertips, so let’s get started!


The power of SQL

Year after year, we see SQL listed as one of the most popular languages among developers on the StackOverflow survey. Sometimes, however, we can be lured into thinking that the only way to gain insights from relational data is to query it with powerful data analytics tools and languages, create data frames, and use specialized regression algorithms before we can do anything productive.

SQL, it often feels, is only useful for getting and storing data in applications and that we need to leave the "heavy lifting" of analysis to more mature tools.

Not so! SQL can data munge with the best of them! Let's look at a first, quick example.

Average yards per position, per game

For this first example, we'll query the tracking table (the player movement data from all 17 weeks of games) and join to the game table to determine the number of yards per player position, per game.

The results give you a quick overview of how many yards different positions ran throughout each game. You could use this later to compare specific players to see how they compared, more or less yards, to that total.

WITH total_position_yards AS (
    SELECT sum(dis) position_yards, POSITION, gameid FROM tracking t 
    GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION;
Enter fullscreen mode Exit fullscreen mode

Number of plays by offensive player

As a season progresses and players get injured (or traded), it's helpful to know which of the available players have more playing experience, rather than those that have been sitting on the sideline for most of the season. Players with more playing time are often able to contribute to the outcome of the game.

This query finds all players that were on the offense for any play and counts how many total passing plays they have been a part of, ordered by total passing plays descending.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam
GROUP BY a.player_id, pl.display_name, a.team_name
ORDER BY play_count DESC;
Enter fullscreen mode Exit fullscreen mode
player_id display_name play_count team_name
2506109 Ben Roethlisberger 725 PIT
2558149 JuJu Smith-Schuster 691 PIT
2533031 Andrew Luck 683 IND
2508061 Antonio Brown 679 PIT
310 Matt Ryan 659 ATL
2506363 Aaron Rodgers 656 GB
2505996 Eli Manning 639 NYG
2543495 Davante Adams 630 GB
2540158 Zach Ertz 629 PHI
2532820 Kirk Cousins 621 MIN
79860 Matthew Stafford 619 DET
2504211 Tom Brady 613 NE

If you’re familiar with American football, you might know that players are substituted in and out of the game based on game conditions. Stronger, larger players may play in some situations, while faster, more agile players may play in others.

Quarterbacks, however, are the most “important” players on the field, and tend to play more than others. However, by omitting quarterbacks, we can get a deeper insight into players across all other positions.

WITH snap_events AS (
-- Create a table that filters the play events to show only snap plays
-- and display the players team information
 SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
   CASE
     WHEN t.team = 'away' THEN g.visitor_team
     WHEN t.team = 'home' THEN g.home_team
     ELSE NULL
     END AS team_name
 FROM tracking t
 LEFT JOIN game g ON t.gameid = g.game_id
 WHERE t.event IN ('snap_direct','ball_snap')
)
-- Count these events & filter results to only display data when the player was
-- on the offensive
SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name, pl."position"
FROM snap_events a
LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
LEFT JOIN player pl ON a.player_id = pl.player_id
WHERE a.team_name = p.possessionteam AND pl."position" != 'QB'
GROUP BY a.player_id, pl.display_name, a.team_name, pl."position"
ORDER BY play_count DESC;
Enter fullscreen mode Exit fullscreen mode

So, now we can see the non-quarterbacks who are on offense the most in a season:

player_id display_name play_count team_name position
2558149 JuJu Smith-Schuster 691 PIT WR
2508061 Antonio Brown 679 PIT WR
2543495 Davante Adams 630 GB WR
2540158 Zach Ertz 629 PHI TE
2541785 Adam Thielen 612 MIN WR
2543468 Mike Evans 610 TB WR
2555295 Sterling Shepard 610 NYG WR
2540169 Robert Woods 604 LA WR
2552600 Nelson Agholor 604 PHI WR
2543488 Jarvis Landry 592 CLE WR
2540165 DeAndre Hopkins 587 HOU WR
2543498 Brandin Cooks 581 LA WR

Sack percentage by quarterback on passing plays

We can start to go a little deeper by extracting specific data from the tracking table and layering queries on top of it to make correlations. One piece of information that might be helpful in your analysis is knowing which quarterbacks are sacked most often during passing plays. In football, a “sack” is a negative play for the offense, and quarterbacks who get sacked more often tend to be lower performers overall.

Once you know those players, you could expand your analysis to see if they are sacked more on specific types of plays (shotgun formation) or maybe if sacks occur more often in a specific quarter of the game (maybe the fourth quarter because the offensive line is more tired, or the team tends to be behind late in games and must pass more often).

Queries like this can quickly show you quarterbacks that are more likely to get sacked, particularly when they play a strong defensive team. To get started, we wanted to find the sack percentage of each quarterback based on the total number of pass plays they were involved in during the regular season. To do that we approached the tracking data by layering on Common Table Expressions so that each query could build upon previous results.

First, we select the distinct list of all plays, for each quarterback (qb_plays). The reason we do a SELECT DISTINCT… is because the tracking table holds multiple entries for each player, for each play. We just need one row for each play, for each quarterback.

With this result, we can then count the number of total plays per quarterback (total_qb_plays), the total number of games each quarterback played (qb_games) and then finally the number of pass plays the quarterback was a part of that resulted in a sack (sacks).

With that data in hand, we can finally query all of the values, do a percentage calculation, and order it by the total sack count.

WITH qb_plays AS (
    SELECT DISTINCT ON (POSITION, playid, gameid) POSITION, playid, player_id, gameid 
    FROM tracking t 
    WHERE POSITION = 'QB'
),
total_qb_plays AS (
    SELECT count(*) play_count, player_id FROM qb_plays
    GROUP BY player_id
),
qb_games AS (
    SELECT count(DISTINCT gameid) game_count, player_id FROM qb_plays 
    GROUP BY player_id
),
sacks AS (
    SELECT count(*) sack_count, player_id 
    FROM play p
    INNER JOIN qb_plays ON p.gameid = qb_plays.gameid AND p.playid = qb_plays.playid
    WHERE p.passresult = 'S'
    GROUP BY player_id
)
SELECT play_count, game_count, sack_count, (sack_count/play_count::float)*100 sack_percentage, display_name FROM total_qb_plays tqp
INNER JOIN qb_games qg ON tqp.player_id = qg.player_id
LEFT JOIN sacks s ON s.player_id = qg.player_id
INNER JOIN player ON tqp.player_id = player.player_id
ORDER BY sack_count DESC NULLS last;
Enter fullscreen mode Exit fullscreen mode

If you're an ardent football fan, the results from 2018 probably don't surprise you.

play_count game_count sack_count sack_percentage display_name
579 16 65 11.23 Deshaun Watson
602 16 55 9.14 Dak Prescott
611 16 53 8.67 Derek Carr
656 16 49 7.47 Aaron Rodgers
462 15 48 10.39 Russell Wilson
639 16 47 7.36 Eli Manning
448 14 45 10.04 Josh Rosen
659 16 43 6.53 Matt Ryan
386 14 43 11.14 Marcus Mariota
619 16 41 6.62 Matthew Stafford
621 15 38 6.12 Kirk Cousins
324 11 37 11.42 Ryan Tannehill
447 11 36 8.05 Carson Wentz

Of course, there are a few quarterbacks that always seem to have a way of avoiding a sack.

play_count game_count sack_count sack_percentage display_name
725 16 25 3.45 Ben Roethlisberger
682 16 22 3.23 Andrew Luck
613 16 21 3.43 Tom Brady

Now, let’s try some more “advanced” queries and analyses.


Faster insights with PostgreSQL and TimescaleDB

So far, the queries we've shown are interesting and help provide insights to various players throughout the season – but if you were looking closely, they're all regular SQL statements.

Examining a season of NFL tracking data isn't like typical time-series data, however. Most of the queries we want to perform need to examine all 20 million rows in some way.

This is where a tool that's been built for time-series analysis, even when the data isn't typical time-series data, can significantly improve your ability to examine the data and save money at the same time.


Faster queries with TimescaleDB continuous aggregates

We noticed that we often needed to build queries that started with the tracking table, filtering data by specific players, positions, and games. Part of the reason is that the play table doesn't list all of the players who were involved in a particular play. As a result, we need to cross-reference the tracking table to identify the players who were involved in any given play.

The first example query we demonstrated - “average yards per position, per game” - is a good example of this. The query begins by summing all yards, by position, for each game.

This means that every row in tracking has to be read and aggregated before we can do any other analysis. Scanning those 20 million rows is pretty boring, repetitive, and slow work – especially compared to the analysis we want to do!

On our small test instance, the "average yards" query takes about 8 seconds to run. We could increase the size of the instance (which will cost us more money), or we could be smarter about how we query the data (which will cost us more time).

Instead, we can use continuous aggregates to pre-aggregate the data we're querying over and over again, which reduces the amount of work TimescaleDB needs to do every time we run the query. (Continuous aggregates are like PostgreSQL materialized views. For more info, check out our continuous aggregates docs.)

CREATE MATERIALIZED VIEW player_yards_by_game_
WITH (timescaledb.continuous) AS
SELECT player_id, position, gameid,
 time_bucket(INTERVAL '1 day', "time") AS bucket,
 SUM(dis) AS yards
FROM tracking t
GROUP BY player_id, position, gameid, bucket;
Enter fullscreen mode Exit fullscreen mode

After running this query and creating a continuous aggregate, we can modify that first query just slightly, using this as our basis table.

WITH total_position_yards AS (
    SELECT sum(yards) position_yards, POSITION, gameid 
FROM player_yards_by_game t 
    GROUP BY POSITION, gameid)
SELECT avg(position_yards), position, game_date
FROM game g
INNER JOIN total_position_yards tpy ON g.game_id = tpy.gameid
WHERE POSITION IN ('QB','RB','WR','TE')
GROUP BY game_date, POSITION
ORDER BY game_date, position;
Enter fullscreen mode Exit fullscreen mode

We get the same result, but now the query runs in 100ms - 800x faster!


Advanced SQL data analysis with TimescaleDB hyperfunctions

Finally, the more we dug into the data, the more and more we found we needed (or wanted) functions specifically tuned for time-series data analysis to answer the types of questions we wanted to ask.

It is for this kind of analysis that we built TimescaleDB hyperfunctions, a series of SQL functions within TimescaleDB that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code.

Grouping data into percentiles

The NFL dataset is a great use case for percentiles. Being able to quickly find players that perform better or worse than some cohort is really powerful.

As an example, we'll use the same continuous aggregate we created earlier (total yards, per game, per player) to find the median total yards traveled by position for each game.

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position, gameid
)
--Find the mean and median for each position type
SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
FROM sum_yards
WHERE POSITION IS NOT null
GROUP BY position
ORDER BY mean_yards DESC;
Enter fullscreen mode Exit fullscreen mode
position mean_yards median_yards
FS 595.583433048431 626.388099960848
CB 572.3336749867212 592.2175990890378
WR 552.6508570179277 555.5030569048633
S 530.6436781609186 550.5961518474892
SS 522.5604103343453 551.1296628916651
MLB 462.70229007633407 490.77906906009343
ILB 402.7882871125599 403.3779668359464
OLB 393.40014271151847 390.6742117791442
QB 334.7025466893028 352.1192705472368
LB 328.9812527472519 257.72003396053884
TE 327.9515596330271 257.72003396053884

Finding extreme outliers

Finally, we can build upon this percentile query to find players at each position that run more than 95% of all other players at that position. For some positions, like wide receiver or free safety, this could help us find the “outlier” players that are able to travel the field consistently throughout a game – and make plays!

WITH sum_yards AS (
--Add position to the table to allow for grouping by it later
 SELECT a.player_id, display_name, SUM(yards) AS yards, p.position
 FROM player_yards_by_game a
 LEFT JOIN player p ON a.player_id = p.player_id
 GROUP BY a.player_id, display_name, p.position
),
position_percentile AS (
    SELECT POSITION, approx_percentile(0.95, percentile_agg(yards)) AS p95
    FROM sum_yards 
    GROUP BY position
)
SELECT a.POSITION, a.display_name, yards, p95
    FROM sum_yards a
    LEFT JOIN position_percentile pp ON a.POSITION = pp.position
    WHERE yards >= p95
AND a.POSITION IN ('WR','FS','QB','TE')
ORDER BY position;
Enter fullscreen mode Exit fullscreen mode
position display_name yards p95
FS Eric Weddle 13869.759999999997 12320.288323166456
FS Adrian Amos 12989.439999999966 12320.288323166456
FS Tyrann Mathieu 12565.219999999956 12320.288323166456
QB Aaron Rodgers 7422.35999999995 6667.51452813257
QB Patrick Mahomes 6985.989999999952 6667.51452813257
QB Matt Ryan 6759.959999999969 6667.51452813257
TE Zach Ertz 13124.58999999995 10667.986199523099
TE Jimmy Graham 12693.679999999982 10667.986199523099
TE Travis Kelce 12218.129999999957 10667.986199523099
TE David Njoku 11502.159999999965 10667.986199523099
TE George Kittle 11058.099999999975 10667.986199523099
TE Kyle Rudolph 10761.949999999968 10667.986199523099
TE Jared Cook 10678.22999999998 10667.986199523099
WR Antonio Brown 16877.559999999965 14271.23409723974
WR Brandin Cooks 15510.01999999995 14271.23409723974
WR JuJu Smith-Schuster 15492.76999999996 14271.23409723974
WR Robert Woods 15253.179999999958 14271.23409723974
WR Nelson Agholor 15180.32999999997 14271.23409723974
WR Tyreek Hill 15106.609999999973 14271.23409723974
WR Zay Jones 14790.589999999967 14271.23409723974
WR Sterling Shepard 14673.79999999996 14271.23409723974
WR Mike Evans 14620.129999999983 14271.23409723974
WR Davante Adams 14574.509999999951 14271.23409723974
WR Kenny Golladay 14354.499999999973 14271.23409723974
WR Jarvis Landry 14281.509999999971 14271.23409723974

Where can the data take you?

As you’ve seen in this example, time-series data is everywhere. Being able to harness it gives you a huge advantage, whether you’re working on a professional solution or a personal project.

We’ve shown you a few ways that time-series queries can unlock interesting insights, give you a greater appreciation for the game and its players, and (hopefully) inspired you to dig into the data yourself.

To get started with the NFL data:

  • Spin up a fully managed TimescaleDB service: create an account to try it for free for 30 days.
  • Follow our complete tutorial for step-by-step instructions for preparing and ingesting the dataset, along with several more queries to help you glean insights from the dataset.

If you’re new to time-series data or just have some questions about how to use TimescaleDB to analyze the NFL’s dataset, join our public Slack community. You’ll find Timescale engineers and thousands of time-series enthusiasts from around the world – and we’ll be happy to help you.

🙏 We’d like to thank the NFL for making this data available, and the millions of passionate fans around the world who make the NFL such an exciting game to watch.

And, Geaux Saints 🏈!

The original blog post was a collaboration between

Attila Toth, Miranda Auhl, Ryan Booz

. . . . . . . . . . . . . . . . . . . . . . . . . .