(This post originally appeared on the Timescale blog on March 19, 2020)
Public datasets can help us gain insight into our business and our world. Combining public datasets, either together or with our own data, often requires a series of steps to cleanup (or “normalize”) the data. This blog post walks you through some data normalization techniques.
I enjoy using public datasets, combining them with metrics and business data, and seeing if there are any correlations. It’s kind of a personal hobby. I recently wrote a blog post about analyzing COVID-19 data along that vein, and I’m starting a virtual meetup (with the help of my colleagues at Timescale) so I can meet people with similar interests and continue to learn new things. If you’re a data enthusiast, you’re welcome (and encouraged) to join us at any time - the more the merrier.
My projects’ goals are usually to take data from one source and combine it with data from another source in order to learn something interesting. One of the benefits of using a PostgreSQL time-series database like TimescaleDB is the ability to combine analytics and metrics (this is inherently time-series data, as we care about the specific time we collected each value) with other relational data.
For example, you may want to combine historic weather data with business performance to identify correlations that could predict future business results. You’d need a high-performance time-series database that also supports the traditional relational data for your business (for example, products, customers, and orders). Voila Timescale! Combining your external metadata with your time-series data is just a SQL statement.
One problem you are likely to encounter is normalizing your data. This is especially true in the realm of publicly available, open datasets. In my latest project, I wanted to analyze two public datasets: the San Francisco police incident database and the National Oceanic and Atmospheric Administration (NOAA) historic weather database (for the San Francisco area). There were two challenges:
-
The format for a “date” in each dataset was different, making it difficult to join the information for analysis. For example, if I wanted to know how many car prowls took place on occasions with more than 1cm of rain, I’d need to join both databases on their date columns:
incident_date
andweather_date
respectively. In the police database, dates are formatted asyyyy-mm-dd
, while in the weather database, it’smm/dd/yyyy
. - The weather data had numerous gaps, which made time-series graphs difficult to chart.
Let’s walk through how to solve both (all too common) issues.
Cleaning up your data format
The technique you use to clean data depends on how frequently you’ll want to update your dataset and run your analysis.
If you’re looking to download your data once before importing and using it, then you can use anything from Excel (for smaller datasets) to Unix tools like awk
.
First, download your data in CSV format (step 0). In Excel, open your CSV file, highlight a column, and in the Ribbon (top-nav menu), select the ‘Number Formats’ drop-down (it usually starts by saying “General”), and choose the option. Select ‘Custom’ and in the ‘Type:’ field, enter your preferred date format.
In this case, we want to reformat the weather data to match the police dataset. Thus, we want to reformat the date field to yyyy-mm-dd
, which we can enter directly. Press ‘OK’ and your column is now formatted however you’d like.
Option 1: Use Excel "Number Formats" settings to apply changes to your dataset files.
While Excel works for moderately sized datasets, if you have a large dataset, datasets that you want to update frequently, or data that you retrieve programmatically at periodic intervals, you need something that you can run on the command line and/or automate.
For this, we’ll use awk
, a programming language designed for manipulating text-based data. awk
is ideal for formatting dates in CSV files. This is the awk
command for reformatting a date with mm/dd/yyyy
into yyyy-mm-dd
format:
awk -F, '{split($7,a,"/"); $7=a[3]"-"a[1]"-"a[2]}1' OFS=, Weather_SFO.csv
Let’s dissect this command:
-
-F
is the field separator. In this case we have a comma-separated file, so the parameter for this argument is a comma (i.e., a comma separates each of our fields). -
split
: In our original comma-separated file, the 7th item in each row is the date we are trying to reformat. Therefore, we’ll split the 7th ($7
) field into strings, based on the forward slash character and store each part in an array calleda
. - We will then rearrange the date format to be the third (
a[3]
), first (a[1]
), and second (a[2]
) elements in the array, separated by dashes We’ll continue processing the file and, in the output (OFS
), use commas to separate them - And the file we will work on is
Weather_SFO.csv
Once you run the awk
command, you’ll see the output in your Terminal, which you can then pipe into another file.
Filling in data gaps (NULL values)
When we use public data, we often see “gaps,” or times where the dataset doesn’t show any reported values for a given area, time, or other variable. This can be due to faulty sensors, systems maintenance or outage, human error in creating the tables, or a slew of other reasons.
For example, in the San Francisco weather data we’ve been using as an example in this blog post, we see a NULL in the second row (mid temperature on January 6, 2018):
station | name | latitude | longitude | elevation | weather_date | precipitation | temp_max | temp_min | temp_mid
------------+--------------------------------+----------+-----------+-----------+--------------+---------------+----------+----------+----------
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-05 | 7.9 | 18.3 | 6.1 | 11.1
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-06 | 0 | 13.9 | 7.8 | [null]
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-07 | 0 | 11.7 | 8.3 | 10.6
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-08 | 93.5 | 11.1 | 8.9 | 11.1
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-09 | 9.9 | 11.7 | 10 | 10
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-10 | 0.3 | 10 | 8.3 | 10
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-11 | 0.3 | 13.3 | 8.3 | 11.7
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-12 | 0 | 16.7 | 3.9 | 8.9
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-13 | 0 | 17.2 | 4.4 | 8.3
USC00047767 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 2.4 | 2018-01-14 | 0 | 18.3 | 4.4 | 12.8
(10 rows)
If you’re trying to visualize data, this presents a problem: we need values for each row and can’t have instances where there isn’t any data to visualize. We see this with monitoring metrics as well; to view systems performance, we need to plot values at each desired interval, or our trend lines and graphs look faulty.
Fortunately, this is an area where TimescaleDB excels (no pun intended), with capabilities like gap filling and time_bucket
- a more powerful version of SQL date_trunc
. By gap filling the information in our weather dataset, we fill in every NULL value with the last (typically most recent) value for that sensor, by date. We call this “last observation carried forward” or locf
.
To do so, we use the Timescale time_bucket_gapfill
function:
SELECT
time_bucket_gapfill('1 day', weather_date) AS date,
name, latitude, longitude,
locf(avg(temp_mid), treat_null_as_missing:=true)
FROM weather
WHERE weather_date >= '2018-01-05' AND
weather_date <= '2018-01-30' AND
name = 'SAN FRANCISCO OCEANSIDE, CA US'
GROUP BY date, name, latitude, longitude
ORDER BY date
LIMIT 10;
You can read more about time_bucket_gapfill
in our documentation and blog post.
Now, when we run the query, we can see that our NULL value in the second row is replaced with the temperature value from the previous day:
date | name | latitude | longitude | locf
-----------+--------------------------------+----------+-----------+---------------------
2018-01-05 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.1000000000000000
2018-01-06 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.1000000000000000
2018-01-07 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 10.6000000000000000
2018-01-08 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.1000000000000000
2018-01-09 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 10.0000000000000000
2018-01-10 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 10.0000000000000000
2018-01-11 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 11.7000000000000000
2018-01-12 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 8.9000000000000000
2018-01-13 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 8.3000000000000000
2018-01-14 | SAN FRANCISCO OCEANSIDE, CA US | 37.728 | -122.5052 | 12.8000000000000000
(10 rows)
You can also use averages or your own computation to perform your gap filling operation. Consult the TimescaleDB documentation for more ideas based on your use case.
Conclusion
Public datasets are amazing. Whether you combine them with other public datasets or with your own proprietary data, they give you a wealth of resources to query, analyze, and use to better understand your world (whatever that means to you).
If you love public datasets or are intrigued by ways you can better use them, be sure to join our new meetup: DataPub: A Virtual Meetup for Public Data Enthusiasts. We'll meet online every month, and it's a great chance to hear from people using, creating, or building tools for public datasets.
Hopefully this post has given you some ideas and ways to overcome a few obstacles, so you can power through your data preparation and get to the best part: analysis, visualization, and sharing your findings with others.