Geolocation with BigQuery: De-identify 76 million IP addresses in 20 seconds

Felipe Hoffa - Jul 8 '19 - - Dev Community

We published our first approach to de-identifying IP addresses four years ago—GeoIP geolocation with Google BigQuery— and it's time for an update that includes the best and latest BigQuery features, like using the latest SQL standards, dealing with nested data, and handling joins much faster.

BigQuery is Google Cloud’s serverless data warehouse designed for scalability and fast performance. Using it lets you explore large datasets to find new and meaningful insights. To comply with current policies and regulations, you might need to de-identify the IP addresses of your users when analyzing datasets that contain personal data. For example, under GDPR, an IP address might be considered PII or personal data.

Replacing collected IP addresses with a coarse location is one method to help reduce risk—and BigQuery is ready to help. Let's see how.

How to de-identify IP address data

For this example of how you can easily de-identify IP addresses, let’s use:

  • 76 million IP addresses collected by Wikipedia from anonymous editors between 2001 and 2010
  • MaxMind's Geolite2 free geolocation database
  • BigQuery's improved byte and networking functions NET.SAFE_IP_FROM_STRING(), NET.IP_NET_MASK()
  • BigQuery's new superpowers that deal with nested data, generate arrays, and run incredibly fast joins
  • The new BigQuery Geo Viz tool that uses Google Maps APIs to chart geopoints around the world.

Let's go straight into the query. Use the code below to replace IP addresses with the generic location.

Top countries editing Wikipedia

Here’s the list of countries where users are making edits to Wikipedia, followed by the query to use:

list of countries.png

#standardSQL

# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
  SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
  FROM `publicdata.samples.wikipedia`
  WHERE contributor_ip IS NOT null  
  GROUP BY 1
)

SELECT country_name, SUM(c) c
FROM (
  SELECT ip, country_name, c
  FROM (
    SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
    FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
  )
  JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
  USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC

Query complete (20.9 seconds elapsed, 1.14 GB processed)

Top cities editing Wikipedia

These are the top cities where users are making edits to Wikipedia, collected from 2001 to 2010, followed by the query to use:

top cities.png

map of top cities.png

# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
  SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
  FROM `publicdata.samples.wikipedia`
  WHERE contributor_ip IS NOT null  
  GROUP BY 1
)


SELECT city_name, SUM(c) c, ST_GeogPoint(AVG(longitude), AVG(latitude)) point
FROM (
  SELECT ip, city_name, c, latitude, longitude, geoname_id
  FROM (
    SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
    FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
  )
  JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
  USING (network_bin, mask)
)
WHERE city_name  IS NOT null
GROUP BY city_name, geoname_id
ORDER BY c DESC
LIMIT 5000` 

Exploring some new BigQuery features

These new queries are compliant with the latest SQL standards, enabling a few new tricks that we'll review here.

New MaxMind tables: Goodbye math, hello IP masks

The downloadable GeoLite2 tables are not based in ranges anymore. Now they use proper IP networks, like in "156.33.241.0/22".

Using BigQuery, we parsed these into binary IP addresses with integer masks. We also did some pre-processing of the GeoLite2 tables, combining the networks and locations into a single table, and adding the parsed network columns, as shown here:

#standardSQL
SELECT *
  , NET.IP_FROM_STRING(REGEXP_EXTRACT(network, r'(.*)/' )) network_bin
  , CAST(REGEXP_EXTRACT(network, r'/(.*)' ) AS INT64) mask
FROM `fh-bigquery.geocode.201806_geolite2_city_ipv4` 
JOIN `fh-bigquery.geocode.201806_geolite2_city_locations_en`
USING(geoname_id)

Geolocating one IP address out of millions

To find one IP address within this table, like "103.230.141.7," something like this might work:

SELECT country_name, city_name, mask
FROM `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs` 
WHERE network_bin = NET.IP_FROM_STRING('103.230.141.7')

But that doesn't work. We need to apply the correct mask:

SELECT country_name, city_name, mask
FROM `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs` 
WHERE network_bin = NET.IP_FROM_STRING('103.230.141.7') & NET.IP_NET_MASK(4, 24)

And that gets an answer: this IP address seems to live in Antarctica.

Scaling up

That looked easy enough, but we need a few more steps to figure out the right mask and joins between the GeoLite2 table (more than 3 million rows) and a massive source of IP addresses.

And that's what the next line in the main query does:

SELECT *
     , NET.SAFE_IP_FROM_STRING(ip) 
       & NET.IP_NET_MASK(4, mask) network_bin
  FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask

This is basically applying a CROSS JOIN with all the possible masks (numbers between 9 and 32) and using these to mask the source IP addresses. And then comes the really neat part: BigQuery manages to handle the correct JOIN in a massively fast way:

USING (network_bin, mask)

BigQuery here picks up only one of the masked IPs—the one where the masked IP and the network with that given mask matches. If we dig deeper, we'll find in the execution details tab that BigQuery did an "INNER HASH JOIN EACH WITH EACH ON", which requires a lot of shuffling resources, while still not requiring a full CROSS JOIN between two massive tables.

big query.png

big query 2.png

Go further with anonymizing data

This is how BigQuery can help you to replace IP addresses with coarse locations and also provide aggregations of individual rows. This is just one technique that can help you reduce the risk of handling your data. GCP provides several other tools, including Cloud Data Loss Prevention (DLP), that can help you scan and de-identify data. You now have several options to explore and use datasets that let you comply with regulations. What interesting ways are you using de-identified data? Let us know.

Find the latest MaxMind GeoLite2 table in BigQuery, thanks to our Google Cloud Public Datasets.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .