Relative calculations 1: LAG and LEAD

Chidiebere Ogujeiofor - Sep 20 '20 - - Dev Community

In this post, we are going to continue our discussion by looking at calculations that involve getting the value from the previous/next row. We would see how using ORDER BY clause in an SQL window function can make these calculations pretty easy and possible.

Let's dive right in.

Our Big Question

One particularly important insight is the increase is the way the number of cases has been increasing. Suppose we had the following question:

What is the increase in cases recorded in the UK for each day in the month of June?

Assuming we had access to only the total number of confirmed cases one way we could answer this to generate a table where each row gives us the difference between the case recorded today and the case recorded previous day. Soemthing like this:

date total_confirmed_cases daily_case_recorded
2020-06-01 v1 0
2020-06-02 v2 v2 - v1
2020-06-03 v3 v3 - v2
2020-06-04 v4 v4 - v3
. . .
. . .
. . .
n vn vn - vn-1

Notice how the value in each row in the total_confirmed_cases column uses the value from the previous column to calculate its value.

Note that we assumed that we had access to only the total_confirmed_cases. Google COVID Dataset already gives us the daily_confirmed_cases which is the same as what we want to calculate.

In order to solve this, we would use the LAG aggregate function and ORDER BY in our window function.


The ORDER BY in a window function

Using the ORDER BY as an argument in the window function specifies how you want the window function to sort the rows in the ResultSet while performing calculations.

For example, in order to answer the question we have above, it is important that we sort the result by date so that when getting the value from the previous row, we would be getting the confirmed case from the previous day.

Using the ORDER BY clause makes it possible for us to specify that.

We can use the ORDER BY like so:


OVER(ORDER BY column_1, column_2, ..., column_n [ASC|DESC])

Enter fullscreen mode Exit fullscreen mode

LAG aggregate function

The LAG function is a special aggregate function the works only when used with window functions. It takes a column as an argument and returns the previous value of that column in the ResultSet.

We use the LAG function with the ORDER BY clause like so:

LAG(column_name) OVER(ORDER BY column_1, column_2, ..., column_n {ASC|DESC|)

Enter fullscreen mode Exit fullscreen mode

The SQL QUERY

Combining these together we can answer our Big question using the following query.

SELECT
  date,
  countries_and_territories as country,
  confirmed_cases,
  confirmed_cases - LAG(confirmed_cases) OVER(
    ORDER BY
      date
  ) as daily_increase
FROM
  `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
  date >= "2020-06-01"
  AND date <= "2020-06-30"
  AND geo_id = "UK";
Enter fullscreen mode Exit fullscreen mode

This filters the resultset and generates the increase in a number of cases using:

  confirmed_cases - LAG(confirmed_cases) OVER(
    ORDER BY
      date
  ) as daily_increase
Enter fullscreen mode Exit fullscreen mode

The window function orders by date while the LAG(confirmed_cases) returns cases confirmed for the previous row.

This generates the following table.

date country confirmed_cases daily_increase
2020-06-01 United_Kingdom 274762 NULL
2020-06-02 United_Kingdom 276332 1570
2020-06-03 United_Kingdom 277985 1653
2020-06-04 United_Kingdom 279856 1871
2020-06-05 United_Kingdom 281661 1805
2020-06-06 United_Kingdom 283311 1650
2020-06-07 United_Kingdom 284868 1557
2020-06-08 United_Kingdom 286194 1326
2020-06-09 United_Kingdom 287399 1205
2020-06-10 United_Kingdom 289140 1741
2020-06-11 United_Kingdom 290143 1003
2020-06-12 United_Kingdom 291409 1266
2020-06-13 United_Kingdom 292950 1541
2020-06-14 United_Kingdom 294375 1425
2020-06-15 United_Kingdom 295889 1514
2020-06-16 United_Kingdom 296857 968
2020-06-17 United_Kingdom 298136 1279
2020-06-18 United_Kingdom 299251 1115
2020-06-19 United_Kingdom 300469 1218
2020-06-20 United_Kingdom 301815 1346
2020-06-21 United_Kingdom 303110 1295
2020-06-22 United_Kingdom 304331 1221
2020-06-23 United_Kingdom 305289 958
2020-06-24 United_Kingdom 306210 921
2020-06-25 United_Kingdom 306862 652
2020-06-26 United_Kingdom 307980 1118
2020-06-27 United_Kingdom 309360 1380
2020-06-28 United_Kingdom 310250 890
2020-06-29 United_Kingdom 311151 901
2020-06-30 United_Kingdom 311965 814

One thing that you may have noticed is that the first value in the table is NULL.

Remember what we said in previous articles that window functions use the ResultSet generated after filtering in the WHERE clause, well, this is one of the consequences.

At the first instance, the LAG function returns the value of the specified column in the previous row but the first row does not have a previous row. Therefore, we get NULL( which means non-existent).

But we know that the confirmed case for the first day of July exists, so how do we get that?

One quick way is to modify the filter in the WHERE clause so that it returns data for a day before July 1st like so:

WHERE
  date >= "2020-05-31"
  AND date <= "2020-06-30"
  AND geo_id = "UK";
Enter fullscreen mode Exit fullscreen mode

Then we omit the first row in our resultset by adding OFFSET 1. The new query would now become

SELECT
  date,
  countries_and_territories as country,
  confirmed_cases,
  confirmed_cases - LAG(confirmed_cases) OVER(
    ORDER BY
      date
  ) as daily_increase
FROM
  `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
where
  date >= "2020-05-31"
  AND date <= "2020-06-30"
  AND geo_id = "UK"
LIMIT
  31 OFFSET 1;

Enter fullscreen mode Exit fullscreen mode

This would return the correct data for the first row

date country confirmed_cases daily_increase
2020-06-01 United_Kingdom 274762 1936
2020-06-02 United_Kingdom 276332 1570
2020-06-03 United_Kingdom 277985 1653
2020-06-04 United_Kingdom 279856 1871
2020-06-05 United_Kingdom 281661 1805
2020-06-06 United_Kingdom 283311 1650
2020-06-07 United_Kingdom 284868 1557
2020-06-08 United_Kingdom 286194 1326
2020-06-09 United_Kingdom 287399 1205
2020-06-10 United_Kingdom 289140 1741
2020-06-11 United_Kingdom 290143 1003
2020-06-12 United_Kingdom 291409 1266
2020-06-13 United_Kingdom 292950 1541
2020-06-14 United_Kingdom 294375 1425
2020-06-15 United_Kingdom 295889 1514
2020-06-16 United_Kingdom 296857 968
2020-06-17 United_Kingdom 298136 1279
2020-06-18 United_Kingdom 299251 1115
2020-06-19 United_Kingdom 300469 1218
2020-06-20 United_Kingdom 301815 1346
2020-06-21 United_Kingdom 303110 1295
2020-06-22 United_Kingdom 304331 1221
2020-06-23 United_Kingdom 305289 958
2020-06-24 United_Kingdom 306210 921
2020-06-25 United_Kingdom 306862 652
2020-06-26 United_Kingdom 307980 1118
2020-06-27 United_Kingdom 309360 1380
2020-06-28 United_Kingdom 310250 890
2020-06-29 United_Kingdom 311151 901
2020-06-30 United_Kingdom 311965 814

Now we can generate our chart:

Alt Text

You can interact with the chart by following this link.

More on LAG, intro to LEAD

The LAG function takes an optional integer as value as its second argument this integer indicates how many rows back we want to get. This number is 1 by default, that's why we get the previous row( 1 row before the current row).

Similar to the LAG function is the LEAD function which takes the value of the next row. It has the same constraints and syntax of the LAG function.

So LEAD(col_name, 2) indicate we want to get a value 2 rows after the current row from col_name while LEAD(col_name) indicate we want the next row value.

Conclusion

In this article, we've learnt how to perform a calculation that involves getting a single value from a row before or after our current row.

That's great but what if we want to perform an aggregation based on more than one row before or after our current row? We would see how to do that using ROWS PRECEDING and ROWS FOLLOWING in the next article of our series.

See you soon.

Take the lead

. . . . . . . . . . . . . . .