P3D9: Programming with spatial data (Python)

These slides map to the R example slides on Day 17 or P3D3

Getting spatial tools onto our computers

First we need to make sure we have the spatial tools, mainly GDAL, on our computers.

Macs

If you are having issues on the mac, I recommend Homebrew.

Paste the below code into your Terminal.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

After you have Hombrew installed you can run the following brew commands in your Terminal.

brew install gdal

You could also use KyngChaos’s .dmg file.

Within our interactive python window run the following

import sys
!{sys.executable} -m pip install pyarrow folium mapclassify geopandas rtree

Windows

You will most likely have issues on a Windows computer. I like to avoid using anaconda which is one option that GeoPandas recommends. We should be able to get everything working using the following commands.

import sys
!{sys.executable} -m pip install wheel pipwin pyarrow folium mapclassify rtree
!{sys.executable} -m pipwin install gdal 
!{sys.executable} -m pipwin install pyproj
!{sys.executable} -m pipwin install fiona
!{sys.executable} -m pipwin install shapely
!{sys.executable} -m pip install geopandas # notice this isn't pipwin

Thanks to JDOaktown and Manuel for the above process.

Finding spatial data

Move from R

Install the pertinent packages. We probably just need the third package sfarrow.

sfarrow is a package for reading and writing Parquet and Feather files with sf objects using arrow in R.

Simple features are a popular format for representing spatial vector data using data.frames and a list-like geometry column, implemented in the R package sf. Apache Parquet files are an open-source, column-oriented data storage format (https://parquet.apache.org/) which enable efficient read/writing for large files. Parquet files are becoming popular across programming languages and can be used in R using the package arrow.

The sfarrow implementation translates simple feature data objects using well-known binary (WKB) format for geometries and reads/writes Parquet/Feather files. A key goal of the package is for interoperability of the files (particularly with Python GeoPandas), so coordinate reference system information is maintained in a standard metadata format (https://github.com/geopandas/geo-arrow-spec). Note to users: this metadata format is not yet stable for production uses and may change in the future.

remotes::install_github("ropensci/USAboundaries")
remotes::install_github("ropensci/USAboundariesData")
install.packages("sfarrow")

Load our libraries.

library(tidyverse)
library(USAboundaries)
library(sf)
library(sfarrow)

Get the data from USAboundaries

usa <- USAboundaries::us_boundaries() %>%
    st_transform(4326)

usa_counties <- USAboundaries::us_counties() %>%
    select(-state_name) %>%
    st_transform(4326)

usa_cities <- USAboundaries::us_cities() %>%
    st_transform(4326)

Write the data out to the .parquet and .feather formats.

sfarrow::st_write_feather(usa, "data/usa.feather")
sfarrow::st_write_parquet(usa, "data/usa.parquet")

sfarrow::st_write_feather(usa_counties, "data/usa_counties.feather")
sfarrow::st_write_parquet(usa_counties, "data/usa_counties.parquet")

sfarrow::st_write_feather(usa_cities, "data/usa_cities.feather")
sfarrow::st_write_parquet(usa_cities, "data/usa_cities.parquet")

Download from internet

Using R to get spatial data probably isn’t the most optimal path. However, I highly recommend the simple feature arrow format for storing and sharing spatial data within Python and R. We can also download data from the internet and load the files.

We can load geo.json files. Here is one for Cobb Count
Most government institutions use ArcGIS shape files. Here are files from the US Census: Cartographic Boundary Files - Shapefile. They provide details on the files and data at their Cartographic Boundary File Description.

Leveraging the GeoPandas

GeoPandas has a great User Guide to help us get started. We should have everything installed based on the installation guides above. Now we can load our libraries.

import pandas as pd
import numpy as np
import geopandas as gpd

`gpd.read_file()`

From zipped Shapefiles

ESRI- ArcGIS defined the format.

A shapefile is an Esri vector data storage format for storing the location, shape, and attributes of geographic features. It is stored as a set of related files and contains one feature class. Shapefiles often contain large features with a lot of associated data and historically have been used in GIS desktop applications such as ArcMap.

The primary way to make shapefile data available for others to view through a web browser is to add it to a .zip file, upload it, and publish a hosted feature layer. The .zip file must contain at least the .shp, .shx, .dbf, and .prj files components of the shapefile.

arcGIS

census_url  = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip"
county_shp = gpd.read_file(census_url)

From GeoJSON files

County GeoJSON files for Cobb county provide enough of an example.

cobb_url = "https://github.com/johan/world.geo.json/raw/master/countries/USA/GA/Cobb.geo.json"
cobb_gj = gpd.read_file(cobb_url)

From `.parquet` or `.feather` geo files

We can use either format based on GeoPandas documentation.

usa = gpd.read_parquet("data/usa.parquet")
county = gpd.read_parquet("data/usa_counties.parquet")
cities = gpd.read_parquet("data/usa_cities.parquet")

Getting lat and long columns to spatial objects

Let’s load our SafeGraph data on Chipotle stores and convert the latitude and longitude to a geometry column.

# you may need to change your file path.
dat = pd.read_csv("SafeGraph - Patterns and Core Data - Chipotle - July 2021/Core Places and Patterns Data/chipotle_core_poi_and_patterns.csv")

dat_sp = gpd.GeoDataFrame(
    dat.filter(["placekey", "latitude", "longitude", "median_dwell"]), 
    geometry=gpd.points_from_xy(
        dat.longitude,
        dat.latitude))

Plotting polygons and points

Let’s plot our polygons using the GeoPandas methods.

import folium
from plotnine import *

matplotlib

Maps with layers

county = gpd.read_parquet("data/usa_counties.parquet")
c48=county.query('stusps not in ["HI", "AK", "PR"]')

base = c48.plot(color="white", edgecolor="darkgrey")
dat_sp.plot(ax=base, color="red", markersize=5)

Folium

They also have an interactive option that depends on Folium which leverages leaflet.js like the leaflet package in R. We need to install some additional dependencies.

import sys
!{sys.executable} -m pip install folium matplotlib mapclassify

Now we can build interactive plots using a similar structure as above.

.explore() arguments

dat_sp_lt100 = dat_sp.query("median_dwell < 100")
c48 = county.query('statusps not in ["HI", "AK", "PR"]')

# %%
base_inter = c48.explore(
    style_kwds = {"fill":False, 
        "color":"darkgrey",
        "weight":.4}
)

theplot=dat_sp_lt100.explore(
    m=base_inter,
    column='median_dwell',
    cmap="Set1",
    marker_kwds={"radius":2, "fill":True},
    style_kwds={"fillOpacity":1})

folium.TileLayer('Stamen Toner', control=True).add_to(base_inter)  # use folium to add alternative tiles
folium.LayerControl().add_to(base_inter)  

theplot

If we want to get the plot out of the interactive viewer we save the map object as an .html file and open the file in our web browser.

theplot.save("map.html")

Manipulating geometries

What units are the awater and aland variables in?
Does aland include awater?

We need to manage the projection to get the projection into meters for the units and then calculate the area.

c48 = c48.assign(
    aland_calc = c48.geometry.to_crs(epsg = 3310).area
)

Now we can plot the two columns.

(ggplot(c48, aes(
    x="aland_calc/aland",
    fill="awater")) + 
geom_histogram())