Link Search Menu Expand Document

P1D4: Python for data science

Visualization discussion (Effectively Communicating Numbers)

To be truthful and revealing, data graphics must bear on the question at the heart of quantitative thinking: “Compared to what?” The emaciated, data-thin design should always provoke suspicion, for graphics often lie by omission, leaving out data sufficient for comparisons.
Edward Tufte

Remember to follow this process for graph selection and design in order to communicate your information in the most effective manner:

  • Determine your message and identify your data
  • Determine if a table, graph, or combination of both is needed to communicate your message
  • Determine the best means to encode the values
  • Determine where to display each variable
  • Determine the best design for the remaining objects
  • Determine if particular data should be featured, and if so, how

Learning code discussion

Why does this feel hard?

Because learning new tools is almost always confusing. I want to make sure you don’t drown, but I also don’t want you to think that you get a floaty for the rest of your life.

How do you define cheating?

Cheating is when a person misleads, deceives, or acts dishonestly on purpose. Cheating (for Kids) - Nemours Kidshealth

However, that definition often focuses on cheating’s effect on others. How could we define cheating in relation to ourselves?

Cheating is when I skip the skill development and fail to progress in my learning, retained skills, or relationships.

  • Story about Kevin’s colleagues
  • Story about 5th grade social study test

Making sure we have our Python packages

installing the essential packages

For ‘small’ data work, these are my primary packages.

import sys
!{sys.executable} -m pip install numpy pandas scikit-learn plotnine altair 
  • numpy: We will not get into the science of this package. We will use a few elements. Pandas uses it heavily.
  • pandas is the center of the universe for data in Python. We will use it heavily.
  • scikit-learn and pandas define the center of the universe for data science.
  • plotnine is a port of ggplot2 to Python.
  • Altair is a great declarative visualization package connected to Vega/VegaLite/D3.
import sys
!{sys.executable} -m pip install --upgrade pip

Why are we not using Anaconda

Because VS Code fixed all the problems that I have historically had with pip, we are not getting too deep into scientific computing.

Why are we not using Jupyter notebooks

  1. If we hope to have our code work in a production environment, then Jupyter is problematic.
  2. Caching and code chunks are problematic.

Read this reference for more details.

If we have plotnine, why are we using Altair?

  • It uses a clean grammar of graphics that gives us access to interactive charts and web charts ( What is Altair?). It feels more Pythonic.

My Python disclaimers

What are we not learning in this course?

  • Indexing, .loc[], and .iloc[]: I may not be experienced enough to understand why I should teach you these. They all add complexity to what we are learning in the course, and we have elected to avoid it. We will use reset_index() a lot. I think MultiIndex features create complications. I have also elected to use .filter() instead of .loc[] because I like it.
  • Virtual Environments: Virtual Environments appear to be an essential tool as you continue to use Python. We will not be teaching these or supporting these in our course.
  • matplotlib (and most derivatives): It feels old, has a bad api, and isn’t declarative. Note that Plotnine is based on matplotlib. However, it hides most of the problems of matplotlib to the average user.

Our first pandas code

Setting up our script

import pandas as pd
import numpy as np
import altair as alt
from plotnine import *

Getting our data

df = pd.DataFrame(
{"a" : [4 ,5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]})

url = 'https://github.com/byuidatascience/data4soils/raw/master/data-raw/cfbp_handgrenade/cfbp_handgrenade.csv'
dat = pd.read_csv(url)

Writing code

Pandas Cheat Sheet

Use the cheat sheet to find the functions you would need to implement the following steps.

I want to;

  1. sort my table by column a then
  2. only use the first two rows, then
  3. calculate the mean of column b.

I want to;

  1. rename column a to duck, then
  2. subset to only have duck and b columns then
  3. keep all rows where b is less than 9, then
  4. find the min of duck

What is method chaining?

Pandas’ and Altair’s comparable tool is compared to %>% and dplyr’s verbs. It has the same flow and logic as the tidyverse. You can read more details from Adiamman Keerthi

Nested Calls

tumble_after(
  broke(
    fell_down(
      fetch(went_up(jack_jill, "hill"), "water"),
            jack),
      "crown"),
    "jill"
)

Method Chaining

(
jack_jill
  .went_up("hill")
  .fetch("water")
  .fell_down("jack")
  .broke("crown")
  .tumble_after("jill")
)

One obvious advantage of Method chaining is that it is a top-down approach with arguments placed next to the function, unlike the nested calls, where tracking down respective function calls to its arguments is demanding.

What do you notice about the method chaining code above?

Using pandas to build our continent level data

You can see mappings to dplyr in the Python for Data Science book

I think these are the suitable methods to build your continent-level data for use in plotnine. Altair will require you only to use one dataset.