Use a notebook to get the country code of GPS position using BigQuery and public dataset
GCP Bigquery is a powerful tool for analysis data and has geography functions coming with the GEOGRAPHY data type. As said in the article Query AWS Athena using Jupyter, the console interface is great for a quick query but when you need to run analysis for several hours, Jupyter can be a better option.
The first part of this article will give you the steps to run Bigquery queries inside a Jupyter notebook and the second part will show you how to use a public dataset to reverse geocoding position: we will get the country code from a GPS position.
$ brew update && brew upgrade python
$ brew reinstall pipenv
$ mkdir jupyter && cd "$_"
$ pipenv --python 3
then :
$ pipenv shell
$ pip install jupyterlab
$ pip install --upgrade pandas
$ pip install google-cloud-bigquery
$ pip install --upgrade "google-cloud-bigquery[bqstorage,pandas]"
$ pip install pydata_google_auth
$ pip install tqdm
$ pip install ipywidgets
$ jupyter lab
Right in the Jupyter notebook, setup the GCP connection with few dedicated cells:
%reload_ext google.cloud.bigquery
import pydata_google_auth
credentials = pydata_google_auth.get_user_credentials(['https://www.googleapis.com/auth/bigquery'])
# Creds are stored in $HOME/.config/pydata/pydata_google_credentials.json
# Use the credentials in other libraries, such as the Google BigQuery client library.
from google.cloud import bigquery
from google.cloud.bigquery import magics
import pandas as pd
client = bigquery.Client(project='<your gcp project id>', credentials=credentials)
magics.context.credentials = credentials
magics.context.project = '<your gcp project id>'
# Set the number of rows to display before displaying partial result display
pd.options.display.max_rows = 200
%%bigquery
. You can type the SQL directly in the cell:%%bigquery
SELECT *
FROM `<project id>.<dataset id>.<table name>` as myTable
limit 10
If you get the result, you are ready for the next part.
In my daily job, I have to deal with million of GPS positions coming from car trips. In one project, I have the need to extract only trips located in France.
To be able to explain the whole process in this article, I will use a sample table. Please import this file into your BigQuery Dataset:
{"id": "1", "longitude": 7.994068000, "latitude": 52.333028000}
{"id": "2", "longitude": 8.799564000, "latitude": 45.676491000}
{"id": "3", "longitude": 4.834516000, "latitude": 45.548067000}
...
To import the file, the simple way is to use the console:
We can visualize the data easily using bigquerygeoviz
Once you setup the connection to your GCP account, enter the following query:
SELECT
ST_GEOGPOINT(longitude, latitude) as geoPt
FROM
`<project id>.<dataset id>.geo-sample`
LIMIT 1000;
As you can see, each red point is a GPS position around EMEA:
To identify the countries, we need country boundaries. Luckily for us, Google provides some public datasets in Bigquery.
We will use the table: bigquery-public-data.geo_openstreetmap.planet_features
located in the US region.
Notice that the region matters especially because it is not possible to join data in different regions.
This query will extract the required rows from geo_openstreetmap:
%%bigquery
WITH CountryCodeToCountryName AS (
SELECT
(SELECT value FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2') as countryISO2Code,
(SELECT value FROM UNNEST(all_tags) WHERE key = 'int_name') as countryName,
(SELECT value FROM UNNEST(all_tags) WHERE key = 'is_in:continent') as is_in,
(SELECT value FROM UNNEST(all_tags) WHERE key = 'boundary') as boundary,
(SELECT value FROM UNNEST(all_tags) WHERE key = 'place') as place,
geometry AS countryGeometry
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE
EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2') AND
EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'int_name') AND
EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'boundary' AND value = 'administrative') AND
EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'admin_level' AND value = '2')
)
SELECT * FROM CountryCodeToCountryName ORDER BY countryName limit 5
Few remarks:
is_in:continent
is not reliable. At first I thought I could use it to filter european countries but most of the time the tag is empty.admin_level = 2 AND boundary = administrative
is the country filter. Refer to: Openstreetmap documentationint_name
is there to get the country name in addition to the country code ISO3166-1:alpha2
geometry
is a list of GEOMETRY objects defining the boundaries of a countryAs our dataset is in EU and the public dataset is in US region, we won't be able to join. As a workaround, we will extract the result of our previous query in a JSON file and import it in a new EU table in our dataset.
To export in JSON, I recommend using a temp table to avoid getting multiple empty JSON files:
%%bigquery
BEGIN
CREATE TEMP TABLE _SESSION.tmpExportTable AS (
WITH CountryBoundaries AS (
SELECT
(SELECT value FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2') as countryISO2Code,
(SELECT value FROM UNNEST(all_tags) WHERE key = 'int_name') as countryName,
geometry AS countryGeometry
FROM `bigquery-public-data.geo_openstreetmap.planet_features`
WHERE
EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'boundary' AND value = 'administrative')
AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'int_name')
AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'admin_level' AND value = '2')
AND EXISTS (SELECT 1 FROM UNNEST(all_tags) WHERE key = 'ISO3166-1:alpha2')
)
SELECT countryName, countryISO2Code, countryGeometry FROM CountryBoundaries
);
EXPORT DATA OPTIONS(
uri='gs://<a valid GCS dir>/*.json',
format='JSON',
overwrite=true)
AS
(
SELECT * FROM _SESSION.tmpExportTable
);
SELECT count(*) FROM _SESSION.tmpExportTable;
END;
eu_countries_boundaries
from the console by importing the generated JSON file from GCS with the following schema:countryGeometry:GEOGRAPHY,
countryISO2Code:STRING,
countryName:STRING
Finally, we can get the country code using the ST_CONTAINS
operator:
%%bigquery
SELECT
longitude,
latitude,
countryName,
countryISO2Code
FROM
`<project id>.<dataset id>.geo-sample`
JOIN `<project id>.<dataset id>.eu_countries_boundaries` as eu_countries_boundaries
ON ST_CONTAINS(
eu_countries_boundaries.countryGeometry,
ST_GEOGPOINT(longitude, latitude)
)
LIMIT 15
It is now easy to filter the data based on the country code. For instance, to get only GPS point located in France:
%%bigquery
SELECT
longitude,
latitude,
countryName,
countryISO2Code
FROM
`<project id>.<dataset id>.geo-sample`
JOIN `<project id>.<dataset id>.eu_countries_boundaries` as eu_countries_boundaries
ON ST_CONTAINS(
eu_countries_boundaries.countryGeometry,
ST_GEOGPOINT(longitude, latitude)
)
WHERE countryISO2Code = "FR"
LIMIT 15
If you want to extract point from a custom region, you can use this website to hand draw a region: https://gismap.huq.io/
and then use the result in a WHERE condition:
%%bigquery
SELECT
longitude,
latitude
FROM
`<project id>.<dataset id>.geo-sample`
WHERE ST_CONTAINS(
st_geogfromtext("POLYGON((0.6632024330069441 44.273035751331605,0.025995401756944148 43.78329268647236,2.135370401756944 42.98486217000221,2.838495401756944 43.767427158391826,2.223261026756944 44.52422285028316,0.6632024330069441 44.273035751331605))"),
ST_GEOGPOINT(longitude, latitude)
)
LIMIT 15
👉 Don't forget to follow me on Twitter to be notified when new posts are available!
Follow @jcbaey