Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:
Additional Recommended Resources:
Let's get started with our first pandas notebook!
Import Libraries
import pandas as pd
Introduction to pandas Data Structures
pandas Series
pandas Series one-dimensional labeled array.
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
ser
ser.index
ser.loc[['nancy','bob']]
# can even use
# ser[['nancy', 'bob']]
There’s three main options to achieve the selection and indexing activities in Pandas.
The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position.
The iloc indexer syntax is data.iloc[
data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)
data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)
Multiple columns and rows can be selected together using the .iloc indexer
data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1
Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.
The Pandas loc indexer can be used with DataFrames for two different use cases:
Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email']
Select rows where the email column ends with 'hotmail.com', include all columns
data.loc[data['email'].str.endswith("hotmail.com")]
Select rows with last_name equal to some values, all columns
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]
Select rows with first name Antonio AND hotmail email addresses
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')]
select rows with id column between 100 and 200, and just return 'postal' and 'web' columns
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']]
A lambda function that yields True/False values can also be used.
Select rows where the company name has 4 words in it.
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)]
Selections can be achieved outside of the main .loc for clarity:
Form a separate variable with your selections:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
Select only the True values in 'idx' and only the 3 columns specified:
data.loc[idx, ['email', 'first_name', 'company']]
ser[[4, 3, 1]]
ser.iloc[2]
'bob' in ser
# checking whether an index is in a Series
'dan' in ser
'amit' in ser
ser * 2
ser
ser = ser * 2
ser[['nancy', 'eric']] ** 2
#squares of strings cannot be done, so we have to explicitly supply the indexes which arent strings
ser
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.
Create DataFrame from dictionary of Python Series
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
df = pd.DataFrame(d)
print(df)
df
df.index
df.columns
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
Create DataFrame from list of Python dictionaries
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
pd.DataFrame(data)
# here the default index values will be 0 and 1, to add custom values for indexes use
# pd.DataFrame(data, index=['zero', 'one'])
pd.DataFrame(data, index=['orange', 'red'])
pd.DataFrame(data, columns=['joe', 'dora','alice'])
Basic DataFrame operations
df
df['one']
df['three'] = df['one'] * df['two']
df
df['flag'] = df['one'] > 250
df
three = df.pop('three')
three, type(three)
df
del df['two']
df
df.insert(2, 'copy_of_one', df['one']) #df.insert(position_of_column, name_of_column, data_sorce_to_create_column)
df
df['one_upper_half'] = df['one'][:2]
df
Case Study: Movie Data Analysis
Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.
Here are the links to the data source and location:
Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.
Let us look at the files in this dataset using the UNIX command ls.
# Note: Adjust the name of the folder to match your local directory
!ls ./movielens
# these commands wont work on windows as im running Jupyter through anaconda which is a python distribution and python uses bash by default but windows doesnt
!cat ./movielens/movies.csv | wc -l
!head -5 ./movielens/ratings.csv
Use Pandas to Read the Dataset
Using the read_csv function in pandas, we will ingest these three files.
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
# head(no_of_items) is used to display the first no_of_elements items in the DataFrame object
movies.head(15)
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
tags
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
# For current analysis, we will remove timestamp (we will come back to it!)
del ratings['timestamp']
del tags['timestamp']
#Extract 0th row: notice that it is infact a Series
row_0 = tags.iloc[0]
type(row_0)
print(row_0)
#same operation as above using loc instead of iloc
row_999 = tags.loc[0]
type(row_999)
print(row_999)
row_0.index
row_0['userId']
'rating' in row_0
row_0.name
row_0 = row_0.rename('first_row')
row_0.name
row_0.head(5)
tags.head()
tags.index
tags.columns
# Extract row 0, 11, 2000 from DataFrame
tags.iloc[ [0,11,2000] ]
#print(type(tags.iloc[[0, 11, 2000]])) ## type of the tags.iloc[[0, 11, 2000]] is DataFrame
Let's look how the ratings are distributed!
ratings['rating'].describe()
ratings.describe()
ratings['rating'].mean()
ratings.mean()
ratings['rating'].min()
ratings['rating'].max()
ratings['rating'].std()
ratings['rating'].mode()
#Compute pairwise correlation of columns, excluding NA/null values
ratings.corr()
#a negative correlation score, here means those features in our data sets
#are inversely correlated, so if one go up the other goes down
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any() #Return whether any element is True over requested axis.
filter_2 = ratings['rating'] > 0
filter_2.all()
#Return whether all elements are True over series or dataframe axis.
#Returns True if all elements within a series or along a dataframe axis are non-zero, not-empty or not-False.
movies.shape
#is any row NULL ?
movies.isnull().any()
Thats nice ! No NULL values !
ratings.shape
#is any row NULL ?
ratings.isnull().any()
Thats nice ! No NULL values !
tags.shape
#is any row NULL ?
tags.isnull().any()
We have some tags which are NULL.
tags = tags.dropna() #axis=0 for dropping rows and axis=1 for dropping columns with NaN
#Check again: is any row NULL ?
tags.isnull().any()
tags.shape
Thats nice ! No NULL values ! Notice the number of lines have reduced.
%matplotlib inline
ratings.hist(column='rating', figsize=(15,10))
ratings.boxplot(column='rating', figsize=(15,20))
tags['tag'].head()
movies[['title','genres']].head()
ratings[1000:1010]
ratings[:10]
ratings[-10:]
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
tag_counts[:10].plot(kind='bar', figsize=(15,10))
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][30:50]
is_animation = movies['genres'].str.contains('Animation')
movies[is_animation][5:15]
movies[is_animation].head(15)
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()
tags.head()
movies.head()
t = movies.merge(tags, on='movieId', how='inner')
t.head()
More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html
Combine aggreagation, merging, and filters to get useful analytics
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()
is_highly_rated = box_office['rating'] >= 4.0
box_office[is_highly_rated][-5:]
is_comedy = box_office['genres'].str.contains('Comedy')
box_office[is_comedy][:5]
box_office[is_comedy & is_highly_rated][-5:]
movies.head()
Split 'genres' into multiple columns
movie_genres = movies['genres'].str.split('|', expand=True)
movie_genres[:10]
Add a new column for comedy genre flag
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
movie_genres[:10]
Extract year from title e.g. (1995)
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies.tail()
More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!
tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.dtypes
Unix time / POSIX time / epoch time records
time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970
tags.head(5)
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')
Data Type datetime64[ns] maps to either
tags['parsed_time'].dtype
tags.head(2)
Selecting rows based on timestamps
greater_than_t = tags['parsed_time'] > '2015-02-01'
selected_rows = tags[greater_than_t]
tags.shape, selected_rows.shape
Sorting the table using the timestamps
tags.sort_values(by='parsed_time', ascending=True)[:10]
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
Do some years look better for the boxoffice movies than others?
Does any data point seem like an outlier in some sense?