Reverse geocoding position using Bigquery and Jupyter

Use a notebook to get the country code of GPS position using BigQuery and public dataset

Avatar of Author
Jean-Christophe BaeyJune 01, 2022
Featured image
Photo by Waldemar Brandt on Unsplash

TL;DR;

GCP Bigquery is a powerful tool for analysis data and has geography functions coming with the GEOGRAPHY data type. As said in the article Query AWS Athena using Jupyter, the console interface is great for a quick query but when you need to run analysis for several hours, Jupyter can be a better option.

The first part of this article will give you the steps to run Bigquery queries inside a Jupyter notebook and the second part will show you how to use a public dataset to reverse geocoding position: we will get the country code from a GPS position.

Local Jupyter installation with GCP dependencies

Install Jupyter lab

$ brew update && brew upgrade python
$ brew reinstall pipenv
  • in a dedicated folder:
$ mkdir jupyter && cd "$_"
$ pipenv --python 3

then :

$ pipenv shell
$ pip install jupyterlab 
$ pip install --upgrade pandas
$ pip install google-cloud-bigquery
$ pip install --upgrade "google-cloud-bigquery[bqstorage,pandas]"
$ pip install  pydata_google_auth
$ pip install tqdm
$ pip install ipywidgets
  • Run Jupyter lab:
$ jupyter lab

Usage

Right in the Jupyter notebook, setup the GCP connection with few dedicated cells:

%reload_ext google.cloud.bigquery
import pydata_google_auth
credentials = pydata_google_auth.get_user_credentials(['https://www.googleapis.com/auth/bigquery'])
# Creds are stored in $HOME/.config/pydata/pydata_google_credentials.json
# Use the credentials in other libraries, such as the Google BigQuery client library.
from google.cloud import bigquery
from google.cloud.bigquery import magics 
import pandas as pd

client = bigquery.Client(project='<your gcp project id>', credentials=credentials)
magics.context.credentials = credentials
magics.context.project = '<your gcp project id>'


# Set the number of rows to display before displaying partial result display
pd.options.display.max_rows = 200

Test the connection

  • In a Jupyter cell, request BigQuery using the magic keyword %%bigquery. You can type the SQL directly in the cell:
%%bigquery

SELECT * 
FROM `<project id>.<dataset id>.<table name>` as myTable 
limit 10

If you get the result, you are ready for the next part.

Reverse geo-coding of GPS points

Sample data

In my daily job, I have to deal with million of GPS positions coming from car trips. In one project, I have the need to extract only trips located in France.

To be able to explain the whole process in this article, I will use a sample table. Please import this file into your BigQuery Dataset:

{"id": "1", "longitude": 7.994068000, "latitude": 52.333028000}
{"id": "2", "longitude": 8.799564000, "latitude": 45.676491000}
{"id": "3", "longitude": 4.834516000, "latitude": 45.548067000}
...

To import the file, the simple way is to use the console:

From the dataset, click on create table

Choose the import options

Visualization

We can visualize the data easily using bigquerygeoviz

Once you setup the connection to your GCP account, enter the following query:

SELECT
  ST_GEOGPOINT(longitude, latitude) as geoPt
FROM
  `<project id>.<dataset id>.geo-sample`
LIMIT 1000;

As you can see, each red point is a GPS position around EMEA:

Viz on GeoViz

Extract country names, codes and corresponding boundaries from a public dataset

To identify the countries, we need country boundaries. Luckily for us, Google provides some public datasets in Bigquery.

We will use the table: bigquery-public-data.geo_openstreetmap.planet_features located in the US region.

Notice that the region matters especially because it is not possible to join data in different regions.

This query will extract the required rows from geo_openstreetmap:

%%bigquery

WITH CountryCodeToCountryName AS (
SELECT 
  (SELECT value FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2') as countryISO2Code,
  (SELECT value FROM UNNEST(all_tags) WHERE key = 'int_name') as countryName, 
  (SELECT value FROM UNNEST(all_tags) WHERE key = 'is_in:continent') as is_in,
  (SELECT value FROM UNNEST(all_tags) WHERE key = 'boundary') as boundary,
  (SELECT value FROM UNNEST(all_tags) WHERE key = 'place') as place,
  geometry AS countryGeometry
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE 
   EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2') AND
   EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'int_name')  AND
   EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'boundary' AND value = 'administrative') AND 
   EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'admin_level' AND value = '2') 
)


SELECT * FROM CountryCodeToCountryName ORDER BY countryName limit 5

Few remarks:

  • the tag is_in:continent is not reliable. At first I thought I could use it to filter european countries but most of the time the tag is empty.
  • admin_level = 2 AND boundary = administrative is the country filter. Refer to: Openstreetmap documentation
  • int_name is there to get the country name in addition to the country code ISO3166-1:alpha2
  • geometry is a list of GEOMETRY objects defining the boundaries of a country

Country boundaries

Reverse geo-coding: get the country code from a GPS position

Create a local table with country boundaries

As our dataset is in EU and the public dataset is in US region, we won't be able to join. As a workaround, we will extract the result of our previous query in a JSON file and import it in a new EU table in our dataset.

To export in JSON, I recommend using a temp table to avoid getting multiple empty JSON files:

%%bigquery

BEGIN
    CREATE TEMP TABLE _SESSION.tmpExportTable AS (
        WITH CountryBoundaries AS (
            SELECT 
                (SELECT value FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2') as countryISO2Code,
                (SELECT value FROM UNNEST(all_tags) WHERE key = 'int_name') as countryName, 
                geometry AS countryGeometry
            FROM `bigquery-public-data.geo_openstreetmap.planet_features`
            WHERE 
              EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'boundary' AND value = 'administrative')
              AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'int_name') 
              AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'admin_level' AND value = '2') 
              AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2')
        )
        SELECT countryName, countryISO2Code, countryGeometry FROM CountryBoundaries
    );

    EXPORT DATA OPTIONS(
      uri='gs://<a valid GCS dir>/*.json',
      format='JSON',
      overwrite=true) 
    AS
    (
        SELECT * FROM _SESSION.tmpExportTable
    );
    
    SELECT count(*) FROM _SESSION.tmpExportTable;
END;
  • We got 102 countries in a single JSON file:

GCS file

Import the JSON file from the console.

  • Create a new table eu_countries_boundaries from the console by importing the generated JSON file from GCS with the following schema:
countryGeometry:GEOGRAPHY,
countryISO2Code:STRING,
countryName:STRING

JOIN using ST_CONTAINS

Finally, we can get the country code using the ST_CONTAINS operator:

%%bigquery

SELECT
  longitude,
  latitude,
  countryName,
  countryISO2Code
FROM
  `<project id>.<dataset id>.geo-sample`
JOIN `<project id>.<dataset id>.eu_countries_boundaries` as eu_countries_boundaries
    ON ST_CONTAINS(
        eu_countries_boundaries.countryGeometry,
        ST_GEOGPOINT(longitude, latitude)
    )
LIMIT 15

Result

Filter data

It is now easy to filter the data based on the country code. For instance, to get only GPS point located in France:

%%bigquery

SELECT
  longitude,
  latitude,
  countryName,
  countryISO2Code
FROM
  `<project id>.<dataset id>.geo-sample`
JOIN `<project id>.<dataset id>.eu_countries_boundaries` as eu_countries_boundaries
    ON ST_CONTAINS(
        eu_countries_boundaries.countryGeometry,
        ST_GEOGPOINT(longitude, latitude)
    )
WHERE countryISO2Code = "FR"
LIMIT 15

BONUS

If you want to extract point from a custom region, you can use this website to hand draw a region: https://gismap.huq.io/

gismap

and then use the result in a WHERE condition:

%%bigquery

SELECT
  longitude,
  latitude
FROM
  `<project id>.<dataset id>.geo-sample`

WHERE ST_CONTAINS(
        st_geogfromtext("POLYGON((0.6632024330069441 44.273035751331605,0.025995401756944148 43.78329268647236,2.135370401756944 42.98486217000221,2.838495401756944 43.767427158391826,2.223261026756944 44.52422285028316,0.6632024330069441 44.273035751331605))"),
        ST_GEOGPOINT(longitude, latitude)
    )

LIMIT 15

👉 Don't forget to follow me on Twitter to be notified when new posts are available!