Pandas


pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

  • Generating descriptive statistics on data
  • Data cleaning using built in pandas functions
  • Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
  • Merging multiple datasets using dataframes
  • Working with timestamps and time-series data

Additional Recommended Resources:

Let's get started with our first pandas notebook!


Import Libraries

In [1]:
import pandas as pd

Introduction to pandas Data Structures


*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.

In [2]:
ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])
In [3]:
ser
Out[3]:
tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object
In [4]:
ser.index
Out[4]:
Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')
In [5]:
ser.loc[['nancy','bob']]
# can even use 
# ser[['nancy', 'bob']]
Out[5]:
nancy    300
bob      foo
dtype: object

Pandas Data Selection

There’s three main options to achieve the selection and indexing activities in Pandas.

  • Selecting data by row numbers (.iloc)
  • Selecting data by label or by a conditional statment (.loc)
  • Selecting in a hybrid approach (.ix) (now Deprecated in Pandas 0.20.1)

Selecting pandas data using “iloc”

The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position. The iloc indexer syntax is data.iloc[, ]

1. Single selections using iloc and DataFrame

Rows:

data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)

Columns:

data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)

Multiple columns and rows can be selected together using the .iloc indexer

Multiple row and column selections using iloc and DataFrame

data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1

Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.

2. Selecting pandas data using “loc”

The Pandas loc indexer can be used with DataFrames for two different use cases:

  • Selecting rows by label/index1
  • Selecting rows with a boolean / conditional lookup The Pandas loc indexer can be used with DataFrames for two different use cases:
  • Selecting rows by label/index
  • Selecting rows with a boolean / conditional lookup

Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email']

Select rows where the email column ends with 'hotmail.com', include all columns
data.loc[data['email'].str.endswith("hotmail.com")]

Select rows with last_name equal to some values, all columns
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]

Select rows with first name Antonio AND hotmail email addresses
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')]

select rows with id column between 100 and 200, and just return 'postal' and 'web' columns
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']]

A lambda function that yields True/False values can also be used.
Select rows where the company name has 4 words in it.
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)]

Selections can be achieved outside of the main .loc for clarity:
Form a separate variable with your selections:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
Select only the True values in 'idx' and only the 3 columns specified:
data.loc[idx, ['email', 'first_name', 'company']]

In [6]:
ser[[4, 3, 1]]
Out[6]:
eric    500
dan     bar
bob     foo
dtype: object
In [7]:
ser.iloc[2]
Out[7]:
300
In [8]:
'bob' in ser
# checking whether an index is in a Series
Out[8]:
True
In [9]:
'dan' in ser
Out[9]:
True
In [10]:
'amit' in ser
Out[10]:
False
ser
In [11]:
ser * 2
Out[11]:
tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object
In [12]:
ser
Out[12]:
tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object
In [13]:
ser = ser * 2
In [14]:
ser[['nancy', 'eric']] ** 2
#squares of strings cannot be done, so we have to explicitly supply the indexes which arent strings
Out[14]:
nancy     360000
eric     1000000
dtype: object
In [15]:
ser
Out[15]:
tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

In [16]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}
In [17]:
df = pd.DataFrame(d)
print(df)
df
          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0
Out[17]:
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
In [18]:
df.index
Out[18]:
Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')
In [19]:
df.columns
Out[19]:
Index(['one', 'two'], dtype='object')
In [20]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'])
Out[20]:
one two
dancy NaN 4444.0
ball 200.0 222.0
apple 100.0 111.0
In [21]:
pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])
Out[21]:
two five
dancy 4444.0 NaN
ball 222.0 NaN
apple 111.0 NaN

Create DataFrame from list of Python dictionaries

In [22]:
data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
In [23]:
pd.DataFrame(data)
# here the default index values will be 0 and 1, to add custom values for indexes use
# pd.DataFrame(data, index=['zero', 'one'])
Out[23]:
alex alice dora ema joe
0 1.0 NaN NaN NaN 2.0
1 NaN 20.0 10.0 5.0 NaN
In [24]:
pd.DataFrame(data, index=['orange', 'red'])
Out[24]:
alex alice dora ema joe
orange 1.0 NaN NaN NaN 2.0
red NaN 20.0 10.0 5.0 NaN
In [25]:
pd.DataFrame(data, columns=['joe', 'dora','alice'])
Out[25]:
joe dora alice
0 2.0 NaN NaN
1 NaN 10.0 20.0

Basic DataFrame operations

In [26]:
df
Out[26]:
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
In [27]:
df['one']
Out[27]:
apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64
In [28]:
df['three'] = df['one'] * df['two']
df
Out[28]:
one two three
apple 100.0 111.0 11100.0
ball 200.0 222.0 44400.0
cerill NaN 333.0 NaN
clock 300.0 NaN NaN
dancy NaN 4444.0 NaN
In [29]:
df['flag'] = df['one'] > 250
df
Out[29]:
one two three flag
apple 100.0 111.0 11100.0 False
ball 200.0 222.0 44400.0 False
cerill NaN 333.0 NaN False
clock 300.0 NaN NaN True
dancy NaN 4444.0 NaN False
In [30]:
three = df.pop('three')
In [31]:
three, type(three)
Out[31]:
(apple     11100.0
 ball      44400.0
 cerill        NaN
 clock         NaN
 dancy         NaN
 Name: three, dtype: float64, pandas.core.series.Series)
In [32]:
df
Out[32]:
one two flag
apple 100.0 111.0 False
ball 200.0 222.0 False
cerill NaN 333.0 False
clock 300.0 NaN True
dancy NaN 4444.0 False
In [33]:
del df['two']
In [34]:
df
Out[34]:
one flag
apple 100.0 False
ball 200.0 False
cerill NaN False
clock 300.0 True
dancy NaN False
In [35]:
df.insert(2, 'copy_of_one', df['one']) #df.insert(position_of_column, name_of_column, data_sorce_to_create_column)
df
Out[35]:
one flag copy_of_one
apple 100.0 False 100.0
ball 200.0 False 200.0
cerill NaN False NaN
clock 300.0 True 300.0
dancy NaN False NaN
In [36]:
df['one_upper_half'] = df['one'][:2]
df
Out[36]:
one flag copy_of_one one_upper_half
apple 100.0 False 100.0 100.0
ball 200.0 False 200.0 200.0
cerill NaN False NaN NaN
clock 300.0 True 300.0 NaN
dancy NaN False NaN NaN

Case Study: Movie Data Analysis


This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.

In [ ]:
# Note: Adjust the name of the folder to match your local directory
!ls ./movielens
# these commands wont work on windows as im running Jupyter through anaconda which is a python distribution and python uses bash by default but windows doesnt
In [ ]:
!cat ./movielens/movies.csv | wc -l
In [ ]:
!head -5 ./movielens/ratings.csv

Use Pandas to Read the Dataset


In this notebook, we will be using three CSV files:

  • ratings.csv : userId,movieId,rating, timestamp
  • tags.csv : userId,movieId, tag, timestamp
  • movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

In [37]:
movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
# head(no_of_items) is used to display the first no_of_elements items in the DataFrame object 
movies.head(15)
<class 'pandas.core.frame.DataFrame'>
Out[37]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
11 12 Dracula: Dead and Loving It (1995) Comedy|Horror
12 13 Balto (1995) Adventure|Animation|Children
13 14 Nixon (1995) Drama
14 15 Cutthroat Island (1995) Action|Adventure|Romance
In [38]:
# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()
Out[38]:
userId movieId tag timestamp
0 18 4141 Mark Waters 1240597180
1 65 208 dark hero 1368150078
2 65 353 dark hero 1368150079
3 65 521 noir thriller 1368149983
4 65 592 dark hero 1368150078
In [39]:
tags
Out[39]:
userId movieId tag timestamp
0 18 4141 Mark Waters 1240597180
1 65 208 dark hero 1368150078
2 65 353 dark hero 1368150079
3 65 521 noir thriller 1368149983
4 65 592 dark hero 1368150078
5 65 668 bollywood 1368149876
6 65 898 screwball comedy 1368150160
7 65 1248 noir thriller 1368149983
8 65 1391 mars 1368150055
9 65 1617 neo-noir 1368150217
10 65 1694 jesus 1368149925
11 65 1783 noir thriller 1368149983
12 65 2022 jesus 1368149925
13 65 2193 dragon 1368151314
14 65 2353 conspiracy theory 1368151266
15 65 2662 mars 1368150055
16 65 2726 noir thriller 1368149983
17 65 2840 jesus 1368149925
18 65 3052 jesus 1368149926
19 65 5135 bollywood 1368149876
20 65 6539 treasure 1368149949
21 65 6874 dark hero 1368150079
22 65 7013 noir thriller 1368149983
23 65 7318 jesus 1368149925
24 65 8529 stranded 1368150012
25 65 8622 conspiracy theory 1368151266
26 65 27803 Oscar (Best Foreign Language Film) 1305008715
27 65 27866 New Zealand 1304957153
28 65 48082 surreal 1304958354
29 65 48082 unusual 1304958359
... ... ... ... ...
465534 138436 81932 big hair 1298754654
465535 138436 81932 Mark Wahlberg 1298754636
465536 138436 81932 pigs 1298754658
465537 138436 81932 prostitution 1298754620
465538 138437 77154 This movie should have been called "How Cocain... 1357384633
465539 138446 317 Christmas 1358983914
465540 138446 317 funny 1358983914
465541 138446 317 Judge Reinhold 1358983914
465542 138446 317 Tim Allen 1358983914
465543 138446 317 whiny kid 1358983946
465544 138446 837 family friendly 1358983693
465545 138446 918 halloween scene 1358984062
465546 138446 918 quirky 1358984051
465547 138446 2396 topless scene 1358973995
465548 138446 3086 Christmas 1358983979
465549 138446 3086 classic 1358983979
465550 138446 3086 funny 1358983979
465551 138446 3086 scary 1358984001
465552 138446 3489 Peter Pan 1358983822
465553 138446 3489 soundtrack 1358983822
465554 138446 3489 visually appealing 1358983822
465555 138446 7045 family friendly 1358983660
465556 138446 7045 Scary Movies To See on Halloween 1358983660
465557 138446 7164 Peter Pan 1358983855
465558 138446 7164 visually appealing 1358983855
465559 138446 55999 dragged 1358983772
465560 138446 55999 Jason Bateman 1358983778
465561 138446 55999 quirky 1358983778
465562 138446 55999 sad 1358983772
465563 138472 923 rise to power 1194037967

465564 rows × 4 columns

In [40]:
ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()
Out[40]:
userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580
In [41]:
# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

In [42]:
#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)
Out[42]:
pandas.core.series.Series
In [43]:
print(row_0)
userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object
In [44]:
#same operation as above using loc instead of iloc
row_999 = tags.loc[0]
type(row_999)
Out[44]:
pandas.core.series.Series
In [45]:
print(row_999)
userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object
In [46]:
row_0.index
Out[46]:
Index(['userId', 'movieId', 'tag'], dtype='object')
In [47]:
row_0['userId']
Out[47]:
18
In [48]:
'rating' in row_0
Out[48]:
False
In [49]:
row_0.name
Out[49]:
0
In [50]:
row_0 = row_0.rename('first_row')
row_0.name
Out[50]:
'first_row'
In [51]:
row_0.head(5)
Out[51]:
userId              18
movieId           4141
tag        Mark Waters
Name: first_row, dtype: object

DataFrames

In [52]:
tags.head()
Out[52]:
userId movieId tag
0 18 4141 Mark Waters
1 65 208 dark hero
2 65 353 dark hero
3 65 521 noir thriller
4 65 592 dark hero
In [53]:
tags.index
Out[53]:
RangeIndex(start=0, stop=465564, step=1)
In [54]:
tags.columns
Out[54]:
Index(['userId', 'movieId', 'tag'], dtype='object')
In [55]:
# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]
#print(type(tags.iloc[[0, 11, 2000]]))   ## type of the tags.iloc[[0, 11, 2000]] is DataFrame
Out[55]:
userId movieId tag
0 18 4141 Mark Waters
11 65 1783 noir thriller
2000 910 68554 conspiracy theory

Descriptive Statistics

Let's look how the ratings are distributed!

In [56]:
ratings['rating'].describe()
Out[56]:
count    2.000026e+07
mean     3.525529e+00
std      1.051989e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64
In [57]:
ratings.describe()
Out[57]:
userId movieId rating
count 2.000026e+07 2.000026e+07 2.000026e+07
mean 6.904587e+04 9.041567e+03 3.525529e+00
std 4.003863e+04 1.978948e+04 1.051989e+00
min 1.000000e+00 1.000000e+00 5.000000e-01
25% 3.439500e+04 9.020000e+02 3.000000e+00
50% 6.914100e+04 2.167000e+03 3.500000e+00
75% 1.036370e+05 4.770000e+03 4.000000e+00
max 1.384930e+05 1.312620e+05 5.000000e+00
In [58]:
ratings['rating'].mean()
Out[58]:
3.5255285642993797
In [59]:
ratings.mean()
Out[59]:
userId     69045.872583
movieId     9041.567330
rating         3.525529
dtype: float64
In [60]:
ratings['rating'].min()
Out[60]:
0.5
In [61]:
ratings['rating'].max()
Out[61]:
5.0
In [62]:
ratings['rating'].std()
Out[62]:
1.051988919275684
In [63]:
ratings['rating'].mode()
Out[63]:
0    4.0
dtype: float64
In [64]:
#Compute pairwise correlation of columns, excluding NA/null values
ratings.corr()
#a negative correlation score, here means those features in our data sets
#are inversely correlated, so if one go up the other goes down
Out[64]:
userId movieId rating
userId 1.000000 -0.000850 0.001175
movieId -0.000850 1.000000 0.002606
rating 0.001175 0.002606 1.000000
In [65]:
filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()   #Return whether any element is True over requested axis.
0           False
1           False
2           False
3           False
4           False
5           False
6           False
7           False
8           False
9           False
10          False
11          False
12          False
13          False
14          False
15          False
16          False
17          False
18          False
19          False
20          False
21          False
22          False
23          False
24          False
25          False
26          False
27          False
28          False
29          False
            ...  
20000233    False
20000234    False
20000235    False
20000236    False
20000237    False
20000238    False
20000239    False
20000240    False
20000241    False
20000242    False
20000243    False
20000244    False
20000245    False
20000246    False
20000247    False
20000248    False
20000249    False
20000250    False
20000251    False
20000252    False
20000253    False
20000254    False
20000255    False
20000256    False
20000257    False
20000258    False
20000259    False
20000260    False
20000261    False
20000262    False
Name: rating, Length: 20000263, dtype: bool
Out[65]:
False
In [66]:
filter_2 = ratings['rating'] > 0
filter_2.all()   
#Return whether all elements are True over series or dataframe axis.
#Returns True if all elements within a series or along a dataframe axis are non-zero, not-empty or not-False.
Out[66]:
True

Data Cleaning: Handling Missing Data

In [67]:
movies.shape
Out[67]:
(27278, 3)
In [68]:
#is any row NULL ?

movies.isnull().any()
Out[68]:
movieId    False
title      False
genres     False
dtype: bool

Thats nice ! No NULL values !

In [69]:
ratings.shape
Out[69]:
(20000263, 3)
In [70]:
#is any row NULL ?

ratings.isnull().any()
Out[70]:
userId     False
movieId    False
rating     False
dtype: bool

Thats nice ! No NULL values !

In [71]:
tags.shape
Out[71]:
(465564, 3)
In [72]:
#is any row NULL ?

tags.isnull().any()
Out[72]:
userId     False
movieId    False
tag         True
dtype: bool

We have some tags which are NULL.

In [73]:
tags = tags.dropna() #axis=0 for dropping rows and axis=1 for dropping columns with NaN
In [74]:
#Check again: is any row NULL ?

tags.isnull().any()
Out[74]:
userId     False
movieId    False
tag        False
dtype: bool
In [75]:
tags.shape
Out[75]:
(465548, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

In [76]:
%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))
Out[76]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002B6840FD2B0>]],
      dtype=object)
In [77]:
ratings.boxplot(column='rating', figsize=(15,20))
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b68254da20>

Slicing Out Columns

In [78]:
tags['tag'].head()
Out[78]:
0      Mark Waters
1        dark hero
2        dark hero
3    noir thriller
4        dark hero
Name: tag, dtype: object
In [79]:
movies[['title','genres']].head()
Out[79]:
title genres
0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 Jumanji (1995) Adventure|Children|Fantasy
2 Grumpier Old Men (1995) Comedy|Romance
3 Waiting to Exhale (1995) Comedy|Drama|Romance
4 Father of the Bride Part II (1995) Comedy
In [80]:
ratings[1000:1010]
Out[80]:
userId movieId rating
1000 11 527 4.5
1001 11 531 4.5
1002 11 541 4.5
1003 11 546 5.0
1004 11 551 5.0
1005 11 586 4.0
1006 11 587 4.5
1007 11 588 5.0
1008 11 589 4.5
1009 11 592 4.5
In [81]:
ratings[:10]
Out[81]:
userId movieId rating
0 1 2 3.5
1 1 29 3.5
2 1 32 3.5
3 1 47 3.5
4 1 50 3.5
5 1 112 3.5
6 1 151 4.0
7 1 223 4.0
8 1 253 4.0
9 1 260 4.0
In [82]:
ratings[-10:]
Out[82]:
userId movieId rating
20000253 138493 60816 4.5
20000254 138493 61160 4.0
20000255 138493 65682 4.5
20000256 138493 66762 4.5
20000257 138493 68319 4.5
20000258 138493 68954 4.5
20000259 138493 69526 4.5
20000260 138493 69644 3.0
20000261 138493 70286 5.0
20000262 138493 71619 2.5
In [83]:
tag_counts = tags['tag'].value_counts()
tag_counts[-10:]
Out[83]:
performance                       1
quirkiness                        1
Rodrigo De la Serna               1
screenwriter:Adam Mazer           1
stuart townsend                   1
Jonas Mekas                       1
lifestyle                         1
fuck the pain away                1
it's so bad it's actually good    1
good for sema                     1
Name: tag, dtype: int64
In [85]:
tag_counts[:10].plot(kind='bar', figsize=(15,10))
Out[85]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b684374240>

Filters for Selecting Rows

In [86]:
is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]
Out[86]:
userId movieId rating
68 1 2021 4.0
69 1 2100 4.0
70 1 2118 4.0
71 1 2138 4.0
72 1 2140 4.0
73 1 2143 4.0
74 1 2173 4.0
75 1 2174 4.0
76 1 2193 4.0
79 1 2288 4.0
80 1 2291 4.0
81 1 2542 4.0
82 1 2628 4.0
90 1 2762 4.0
92 1 2872 4.0
94 1 2944 4.0
96 1 2959 4.0
97 1 2968 4.0
101 1 3081 4.0
102 1 3153 4.0
In [87]:
is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]
Out[87]:
movieId title genres
310 313 Swan Princess, The (1994) Animation|Children
360 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
388 392 Secret Adventures of Tom Thumb, The (1993) Adventure|Animation
547 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical
553 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy
582 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical
588 594 Snow White and the Seven Dwarfs (1937) Animation|Children|Drama|Fantasy|Musical
589 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX
590 596 Pinocchio (1940) Animation|Children|Fantasy|Musical
604 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi
In [88]:
movies[is_animation].head(15)
Out[88]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
12 13 Balto (1995) Adventure|Animation|Children
47 48 Pocahontas (1995) Animation|Children|Drama|Musical|Romance
236 239 Goofy Movie, A (1995) Animation|Children|Comedy|Romance
241 244 Gumby: The Movie (1995) Animation|Children
310 313 Swan Princess, The (1994) Animation|Children
360 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX
388 392 Secret Adventures of Tom Thumb, The (1993) Adventure|Animation
547 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical
553 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy
582 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical
588 594 Snow White and the Seven Dwarfs (1937) Animation|Children|Drama|Fantasy|Musical
589 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX
590 596 Pinocchio (1940) Animation|Children|Fantasy|Musical
604 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi

Group By and Aggregate

In [89]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count
Out[89]:
movieId
rating
0.5 239125
1.0 680732
1.5 279252
2.0 1430997
2.5 883398
3.0 4291193
3.5 2200156
4.0 5561926
4.5 1534824
5.0 2898660
In [90]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()
Out[90]:
rating
movieId
1 3.921240
2 3.211977
3 3.151040
4 2.861393
5 3.064592
In [91]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()
Out[91]:
rating
movieId
1 49695
2 22243
3 12735
4 2756
5 12161
In [92]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()
Out[92]:
rating
movieId
131254 1
131256 1
131258 1
131260 1
131262 1

Merge Dataframes

In [93]:
tags.head()
Out[93]:
userId movieId tag
0 18 4141 Mark Waters
1 65 208 dark hero
2 65 353 dark hero
3 65 521 noir thriller
4 65 592 dark hero
In [94]:
movies.head()
Out[94]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
In [95]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()
Out[95]:
movieId title genres userId tag
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1644 Watched
1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 computer animation
2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 Disney animated feature
3 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 Pixar animation
4 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1741 Téa Leoni does not star in this movie


Combine aggreagation, merging, and filters to get useful analytics

In [96]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()
Out[96]:
movieId rating
0 1 3.921240
1 2 3.211977
2 3 3.151040
3 4 2.861393
4 5 3.064592
In [97]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()
Out[97]:
movieId title genres rating
26739 131254 Kein Bund für's Leben (2007) Comedy 4.0
26740 131256 Feuer, Eis & Dosenbier (2002) Comedy 4.0
26741 131258 The Pirates (2014) Adventure 2.5
26742 131260 Rentun Ruusu (2001) (no genres listed) 3.0
26743 131262 Innocence (2014) Adventure|Fantasy|Horror 4.0
In [98]:
is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]
Out[98]:
movieId title genres rating
26737 131250 No More School (2000) Comedy 4.0
26738 131252 Forklift Driver Klaus: The First Day on the Jo... Comedy|Horror 4.0
26739 131254 Kein Bund für's Leben (2007) Comedy 4.0
26740 131256 Feuer, Eis & Dosenbier (2002) Comedy 4.0
26743 131262 Innocence (2014) Adventure|Fantasy|Horror 4.0
In [99]:
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]
Out[99]:
movieId title genres rating
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.921240
2 3 Grumpier Old Men (1995) Comedy|Romance 3.151040
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.861393
4 5 Father of the Bride Part II (1995) Comedy 3.064592
6 7 Sabrina (1995) Comedy|Romance 3.366484
In [100]:
box_office[is_comedy & is_highly_rated][-5:]
Out[100]:
movieId title genres rating
26736 131248 Brother Bear 2 (2006) Adventure|Animation|Children|Comedy|Fantasy 4.0
26737 131250 No More School (2000) Comedy 4.0
26738 131252 Forklift Driver Klaus: The First Day on the Jo... Comedy|Horror 4.0
26739 131254 Kein Bund für's Leben (2007) Comedy 4.0
26740 131256 Feuer, Eis & Dosenbier (2002) Comedy 4.0

Vectorized String Operations

In [101]:
movies.head()
Out[101]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy


Split 'genres' into multiple columns

In [102]:
movie_genres = movies['genres'].str.split('|', expand=True)
In [103]:
movie_genres[:10]
Out[103]:
0 1 2 3 4 5 6 7 8 9
0 Adventure Animation Children Comedy Fantasy None None None None None
1 Adventure Children Fantasy None None None None None None None
2 Comedy Romance None None None None None None None None
3 Comedy Drama Romance None None None None None None None
4 Comedy None None None None None None None None None
5 Action Crime Thriller None None None None None None None
6 Comedy Romance None None None None None None None None
7 Adventure Children None None None None None None None None
8 Action None None None None None None None None None
9 Action Adventure Thriller None None None None None None None


Add a new column for comedy genre flag

In [104]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')
In [105]:
movie_genres[:10]
Out[105]:
0 1 2 3 4 5 6 7 8 9 isComedy
0 Adventure Animation Children Comedy Fantasy None None None None None True
1 Adventure Children Fantasy None None None None None None None False
2 Comedy Romance None None None None None None None None True
3 Comedy Drama Romance None None None None None None None True
4 Comedy None None None None None None None None None True
5 Action Crime Thriller None None None None None None None False
6 Comedy Romance None None None None None None None None True
7 Adventure Children None None None None None None None None False
8 Action None None None None None None None None None False
9 Action Adventure Thriller None None None None None None None False


Extract year from title e.g. (1995)

In [106]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
In [107]:
movies.tail()
Out[107]:
movieId title genres year
27273 131254 Kein Bund für's Leben (2007) Comedy 2007
27274 131256 Feuer, Eis & Dosenbier (2002) Comedy 2002
27275 131258 The Pirates (2014) Adventure 2014
27276 131260 Rentun Ruusu (2001) (no genres listed) 2001
27277 131262 Innocence (2014) Adventure|Fantasy|Horror 2014


More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

In [108]:
tags = pd.read_csv('./movielens/tags.csv', sep=',')
In [109]:
tags.dtypes
Out[109]:
userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [110]:
tags.head(5)
Out[110]:
userId movieId tag timestamp
0 18 4141 Mark Waters 1240597180
1 65 208 dark hero 1368150078
2 65 353 dark hero 1368150079
3 65 521 noir thriller 1368149983
4 65 592 dark hero 1368150078
In [111]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either M8[ns] depending on the hardware

In [112]:
tags['parsed_time'].dtype
Out[112]:
dtype('<M8[ns]')
In [113]:
tags.head(2)
Out[113]:
userId movieId tag timestamp parsed_time
0 18 4141 Mark Waters 1240597180 2009-04-24 18:19:40
1 65 208 dark hero 1368150078 2013-05-10 01:41:18

Selecting rows based on timestamps

In [114]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape
Out[114]:
((465564, 5), (12130, 5))

Sorting the table using the timestamps

In [115]:
tags.sort_values(by='parsed_time', ascending=True)[:10]
Out[115]:
userId movieId tag timestamp parsed_time
333932 100371 2788 monty python 1135429210 2005-12-24 13:00:10
333927 100371 1732 coen brothers 1135429236 2005-12-24 13:00:36
333924 100371 1206 stanley kubrick 1135429248 2005-12-24 13:00:48
333923 100371 1193 jack nicholson 1135429371 2005-12-24 13:02:51
333939 100371 5004 peter sellers 1135429399 2005-12-24 13:03:19
333922 100371 47 morgan freeman 1135429412 2005-12-24 13:03:32
333921 100371 47 brad pitt 1135429412 2005-12-24 13:03:32
333936 100371 4011 brad pitt 1135429431 2005-12-24 13:03:51
333937 100371 4011 guy ritchie 1135429431 2005-12-24 13:03:51
333920 100371 32 bruce willis 1135429442 2005-12-24 13:04:02

Average Movie Ratings over Time

In [116]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()
Out[116]:
movieId rating
26739 131254 4.0
26740 131256 4.0
26741 131258 2.5
26742 131260 3.0
26743 131262 4.0
In [117]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()
Out[117]:
movieId rating
movieId 1.000000 -0.090369
rating -0.090369 1.000000
In [118]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]
Out[118]:
year rating
0 1891 3.000000
1 1893 3.375000
2 1894 3.071429
3 1895 3.125000
4 1896 3.183036
5 1898 3.850000
6 1899 3.625000
7 1900 3.166667
8 1901 5.000000
9 1902 3.738189
In [119]:
yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)
Out[119]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b6896fc550>

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

In [ ]: