P3D9: Programming with spatial data (Python)
These slides map to the R example slides on Day 17 or P3D3
Getting spatial tools onto our computers
First we need to make sure we have the spatial tools, mainly GDAL, on our computers.
Macs
If you are having issues on the mac, I recommend Homebrew.
- Paste the below code into your Terminal.
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- After you have Hombrew installed you can run the following brew commands in your Terminal.
brew install gdal
You could also use KyngChaos’s .dmg
file.
- Within our interactive python window run the following
import sys
!{sys.executable} -m pip install pyarrow folium mapclassify geopandas rtree
Windows
You will most likely have issues on a Windows computer. I like to avoid using anaconda which is one option that GeoPandas recommends. We should be able to get everything working using the following commands.
import sys
!{sys.executable} -m pip install wheel pipwin pyarrow folium mapclassify rtree
!{sys.executable} -m pipwin install gdal
!{sys.executable} -m pipwin install pyproj
!{sys.executable} -m pipwin install fiona
!{sys.executable} -m pipwin install shapely
!{sys.executable} -m pip install geopandas # notice this isn't pipwin
Thanks to JDOaktown and Manuel for the above process.
Finding spatial data
Move from R
Install the pertinent packages. We probably just need the third package sfarrow
.
sfarrow
is a package for reading and writing Parquet and Feather files withsf
objects usingarrow
inR
.
Simple features are a popular format for representing spatial vector data using
data.frames
and a list-like geometry column, implemented in theR
packagesf
. Apache Parquet files are an open-source, column-oriented data storage format (https://parquet.apache.org/) which enable efficient read/writing for large files. Parquet files are becoming popular across programming languages and can be used inR
using the packagearrow
.
The
sfarrow
implementation translates simple feature data objects using well-known binary (WKB) format for geometries and reads/writes Parquet/Feather files. A key goal of the package is for interoperability of the files (particularly with PythonGeoPandas
), so coordinate reference system information is maintained in a standard metadata format (https://github.com/geopandas/geo-arrow-spec). Note to users: this metadata format is not yet stable for production uses and may change in the future.
remotes::install_github("ropensci/USAboundaries")
remotes::install_github("ropensci/USAboundariesData")
install.packages("sfarrow")
Load our libraries.
library(tidyverse)
library(USAboundaries)
library(sf)
library(sfarrow)
Get the data from USAboundaries
usa <- USAboundaries::us_boundaries() %>%
st_transform(4326)
usa_counties <- USAboundaries::us_counties() %>%
select(-state_name) %>%
st_transform(4326)
usa_cities <- USAboundaries::us_cities() %>%
st_transform(4326)
Write the data out to the .parquet
and .feather
formats.
sfarrow::st_write_feather(usa, "data/usa.feather")
sfarrow::st_write_parquet(usa, "data/usa.parquet")
sfarrow::st_write_feather(usa_counties, "data/usa_counties.feather")
sfarrow::st_write_parquet(usa_counties, "data/usa_counties.parquet")
sfarrow::st_write_feather(usa_cities, "data/usa_cities.feather")
sfarrow::st_write_parquet(usa_cities, "data/usa_cities.parquet")
Download from internet
Using R to get spatial data probably isn’t the most optimal path. However, I highly recommend the simple feature arrow format for storing and sharing spatial data within Python and R. We can also download data from the internet and load the files.
- We can load geo.json files. Here is one for Cobb Count
- Most government institutions use ArcGIS shape files. Here are files from the US Census: Cartographic Boundary Files - Shapefile. They provide details on the files and data at their Cartographic Boundary File Description.
Leveraging the GeoPandas
GeoPandas has a great User Guide to help us get started. We should have everything installed based on the installation guides above. Now we can load our libraries.
import pandas as pd
import numpy as np
import geopandas as gpd
gpd.read_file()
From zipped Shapefiles
ESRI- ArcGIS defined the format.
A shapefile is an Esri vector data storage format for storing the location, shape, and attributes of geographic features. It is stored as a set of related files and contains one feature class. Shapefiles often contain large features with a lot of associated data and historically have been used in GIS desktop applications such as ArcMap.
The primary way to make shapefile data available for others to view through a web browser is to add it to a .zip file, upload it, and publish a hosted feature layer. The .zip file must contain at least the .shp, .shx, .dbf, and .prj files components of the shapefile.
census_url = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_500k.zip"
county_shp = gpd.read_file(census_url)
From GeoJSON files
County GeoJSON files for Cobb county provide enough of an example.
cobb_url = "https://github.com/johan/world.geo.json/raw/master/countries/USA/GA/Cobb.geo.json"
cobb_gj = gpd.read_file(cobb_url)
From .parquet
or .feather
geo files
We can use either format based on GeoPandas documentation.
usa = gpd.read_parquet("data/usa.parquet")
county = gpd.read_parquet("data/usa_counties.parquet")
cities = gpd.read_parquet("data/usa_cities.parquet")
Getting lat and long columns to spatial objects
Let’s load our SafeGraph data on Chipotle stores and convert the latitude
and longitude
to a geometry
column.
# you may need to change your file path.
dat = pd.read_csv("SafeGraph - Patterns and Core Data - Chipotle - July 2021/Core Places and Patterns Data/chipotle_core_poi_and_patterns.csv")
dat_sp = gpd.GeoDataFrame(
dat.filter(["placekey", "latitude", "longitude", "median_dwell"]),
geometry=gpd.points_from_xy(
dat.longitude,
dat.latitude))
Plotting polygons and points
Let’s plot our polygons using the GeoPandas methods.
import folium
from plotnine import *
matplotlib
county = gpd.read_parquet("data/usa_counties.parquet")
c48=county.query('stusps not in ["HI", "AK", "PR"]')
base = c48.plot(color="white", edgecolor="darkgrey")
dat_sp.plot(ax=base, color="red", markersize=5)
Folium
They also have an interactive option that depends on Folium which leverages leaflet.js like the leaflet package in R. We need to install some additional dependencies.
import sys
!{sys.executable} -m pip install folium matplotlib mapclassify
Now we can build interactive plots using a similar structure as above.
.explore()
arguments
dat_sp_lt100 = dat_sp.query("median_dwell < 100")
c48 = county.query('statusps not in ["HI", "AK", "PR"]')
# %%
base_inter = c48.explore(
style_kwds = {"fill":False,
"color":"darkgrey",
"weight":.4}
)
theplot=dat_sp_lt100.explore(
m=base_inter,
column='median_dwell',
cmap="Set1",
marker_kwds={"radius":2, "fill":True},
style_kwds={"fillOpacity":1})
folium.TileLayer('Stamen Toner', control=True).add_to(base_inter) # use folium to add alternative tiles
folium.LayerControl().add_to(base_inter)
theplot
If we want to get the plot out of the interactive viewer we save the map object as an .html
file and open the file in our web browser.
theplot.save("map.html")
Manipulating geometries
- What units are the
awater
andaland
variables in? - Does
aland
includeawater
?
We need to manage the projection to get the projection into meters for the units and then calculate the area.
c48 = c48.assign(
aland_calc = c48.geometry.to_crs(epsg = 3310).area
)
Now we can plot the two columns.
(ggplot(c48, aes(
x="aland_calc/aland",
fill="awater")) +
geom_histogram())