Packages in R and Python
The history of Python and R both start in the early 90s. R has been a programming language for statisticians, and Python was created for developers. Neither language comes with the tools data scientists need for their day-to-day work. To handle those limitations, we will need to find packages that others have built to facilitate data science programming.
Both languages require a two-step process.
- Install the package from the internet or a local file (once per language installation)
- Initiate the library (once per script)
Installation
Python
Python uses pip for its package manager. It runs outside of Python and has a history of being a little picky when new Python programmers want to install packages. Conda is built for the scientific computing community to get tools installed on their computer without the headache of pip.
I prefer to stay with pip. Some of you may get the terminal to work with a simple pip install pandas
. Those who have problems getting pip to install packages into your Python environment should use the following command within interactive Python in VScode.
import sys
!{sys.executable} -m pip install pandas
R
R uses install.packages()
and requires you to type the package name in quotes. This function is run within the R console.
install.packages("tidyverse")
Initiation
Python
Python provides a few methods to initiate a library or functions from a library. We will use import ___ as ___
and from ____ import ____
. Many of the data science Python packages have standard abbreviations that are used. The code below provides some standard examples.
- The first three that use
import ___ as ___
will all require that you start your package command with the abbreviation to access the respective functions (For example,pd.read_csv()
). - For
from plotnine import *
will initiate all the functions with plotnine without requiring the abbreviation or package name to be typed. This will allow you to writeggplot2
code in Python. - For
from sklearn.tree import DecisionTreeClassifier
we are only importing one function or set of methods from a package.
import pandas as pd
import altair as alt
import numpy as np
from plotnine import *
from sklearn.tree import DecisionTreeClassifier
R
Unlike Python, R users load the entire library. Once the library initiates, the functions from the library overwrite any base R functions and previously loaded libraries.
library(tidyverse)
I prefer to use the pacman package to initiate the packages for my script. It provides a couple of advantages. If the package is not installed, it automatically installs the package before starting the library. Additionally, I don’t have to remember whether to put the name in quotes.
pacman:p_load(tidyverse, httpgd)