P5D1: We don’t have time for the SQL

We are not going to make it to the project where we us SQL with Python and R (Project 5: Proving your point using SQL). Today’s class will cover how to leverage SQL with pandas and dplyr.

We will use the the 2019 Lahman’s Baseball Database which should be in the Project 5: Proving your point using SQL Github repository.

A history of SQL (and me)

Hathaway’s history

In 2005 I started work at Pacific Northwest National Laboratory, having a master’s in statistics completed and never having seen the SQL language.
In late 2005, I had to pull data from a SQL database for use in R related to our airline network. It wasn’t fun building a tidy table from a SQL database with no background in SQL. I am on the team that wins the R&D 100 Award based primarily on work by Brett Amidan.
In 2012 Ryan Hafen and Jeremiah Rounds introduced me to Hadoop and the concept of key-value databases, and I am enamored.
From 2013 - 2015, I am in deep with Hadoop, using it to answer climate science problems using data well over terabytes in size.
I start developing the data science degree in 2016 with Scott Burton and Brent Morring, and I wonder if SQL is needed in the program. Upon researching, I realize that SQL as a language is having a resurgence regardless of the back-end database.
In 2017, BYU-I’s DS program used CIT 111 and CIT 225 to give students a SQL background.
In 2020, CSE 250 and CSE 451 provide DS students more access to the SQL language for data science applications.

Big data and SQL timeline

Google’s influence on Big Data and Timescale’s narrative on SQL

2004 Paper on MapReduce released by Google
2012 Hadoop 1.0 released for use.
2017 Hadoop hype has come and gone.
Beyond 2017 the rise of new scalable databases that embrace SQL - Spark SQL

SQL origin story

Ray and I were impressed by how compactly Codd’s languages could represent complex queries. However, at the same time, we believed that it should be possible to design a relational language that would be more accessible to users without formal training in mathematics or computer programming. We believed that barriers to widespread acceptance of Codd’s languages existed on two levels.

The first barrier came from the mathematical notation, which was hard to enter at a keyboard. This barrier was superficial and could be easily dealt with by replacing symbols with keywords.

The more difficult barrier was at the semantic level. The basic concepts of Codd’s languages were adapted from set theory and symbolic logic. This was natural given Codd’s background as a mathematician, but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users. We also hoped to extend the language to encompass database updates and administrative tasks such as the creation of new tables and views, which had traditionally been outside the scope of a query language.

but Ray and I hoped to design a relational language based on concepts that would be familiar to a wider population of users.

When we moved to the San Jose Research Laboratory in 1973 to join the System R project, we began work on another new language that we called Sequel. Sequel allowed the well-paid-employee query to be represented in a readable form free from mathematical concepts and symbols. … In 1977, because of a trademark issue, the name Sequel was shortened to SQL.

reference

How does SQL code work then?

From EverSQL we can get some background.

This is the logical order of operations, also known as the order of execution, for an SQL query:

FROM, including JOINs

WHERE

GROUP BY

HAVING

WINDOW functions

SELECT

DISTINCT

UNION

ORDER BY

LIMIT and OFFSET

But the reality isn’t that easy nor straight forward. As we said, the SQL standard defines the order of execution for the different SQL query clauses. Modern databases are already challenging that default order by applying some optimization tricks which might change the actual order of execution, though they must end up returning the same result as if they were running the query at the default execution order.

Don’t think too hard about optimization at this point. Let the database figure out the optimized routine. If we were in Pandas, we would need to think about the optimized order.

Most SQL queries are typed in the following pattern;

SELECT -- <columns> and <column calculations>
FROM -- <table name>
  JOIN -- <table name>
  ON -- <columns to join>
WHERE -- <filter condition on rows>
GROUP BY -- <subsets for column calculations>
HAVING -- <filter conditions on groups>
ORDER BY -- <how the output is returned in sequence>
LIMIT -- <number of rows to return>

What is SQLite?

Wikipedia: SQLite is a popular choice as embedded database software for local/client storage in application software such as web browsers. It is arguably the most widely deployed database engine, as it is used today by several widespread browsers, operating systems, and embedded systems (such as mobile phones), among others. SQLite has bindings to many programming languages.

SQLite.org: SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine. The code for SQLite is in the public domain and is thus free for use for any purpose, commercial or private. SQLite is the most widely deployed database in the world with more applications than we can count, including several high-profile projects.

Codecademy: SQLite is a database engine. It is software that allows users to interact with a relational database. In SQLite, a database is stored in a single file — a trait that distinguishes it from other database engines. This fact allows for a great deal of accessibility: copying a database is no more complicated than copying the file that stores the data, sharing a database can mean sending an email attachment.

Python: Pandas and SQL

The connection

# %%
import pandas as pd 
import numpy as np
import sqlite3
from plotnine import *

# %%
sqlite_file = 'lahmansbaseballdb.sqlite'
con = sqlite3.connect(sqlite_file)
# %%
# See the tables in the database
table = pd.read_sql_query(
    "SELECT * FROM sqlite_master WHERE type='table'",
    con)
print(table.filter(['name']))
print('\n\n')
# 8 is collegeplaying
print(table.sql[8])

R: RSQLite and SQL

# install.packages("dbplyr")
library(DBI)
library(tidyverse)

sqlite_file <- "lahmansbaseballdb.sqlite"

con <- DBI::dbConnect(RSQLite::SQLite(), sqlite_file)

res <- dbSendQuery(con, "SELECT * FROM sqlite_master WHERE type='table'")

table <- dbFetch(res)
table %>% select(name)

table %>%
    slice(9) %>%
    pull(sql) %>%
    cat()

dplyr and SQL

RStudio has detailed documentation on how to use dblyr based commands with a database. You can read their documentation.

library(tidyverse)

sqlite_file <- "lahmansbaseballdb.sqlite"
con <- DBI::dbConnect(RSQLite::SQLite(), sqlite_file)

DBI::dbListTables(con)  

asfull <- tbl(con, "AllstarFull")
bpost <- tbl(con, "BattingPost")

Exploratory Data Analysis in databases

Questions

For seasons after 1999, which year had the most players selected as All Stars but didn’t play in the All Star game?
Of those players selected as All Stars but didn’t play in the All Star game after 1999, provide a summary of how many games, hits, and at bats occurred by those players in each year’s post season.

Practicing with dbplyr

R: SQL queries with DBI

The DBI package provides access to sending SQL queries.

Python SQL queries with Pandas

The Duck is coming (DuckDB)

I think DuckDB is the future of quick larger than memory analysis in Python. But, I need more time to play with it.