Pandas

pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data
Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes
Working with timestamps and time-series data

Additional Recommended Resources:

pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
Python for Data Analysis by Wes McKinney
Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries

import pandas as pd

Introduction to pandas Data Structures

*pandas* has two main data structures it uses, namely, *Series* and *DataFrames*.

pandas Series

pandas Series one-dimensional labeled array.

ser = pd.Series([100, 'foo', 300, 'bar', 500], ['tom', 'bob', 'nancy', 'dan', 'eric'])

ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

ser.loc[['nancy','bob']]
# can even use 
# ser[['nancy', 'bob']]

nancy    300
bob      foo
dtype: object

Pandas Data Selection¶

There’s three main options to achieve the selection and indexing activities in Pandas.

Selecting data by row numbers (.iloc)
Selecting data by label or by a conditional statment (.loc)
Selecting in a hybrid approach (.ix) (now Deprecated in Pandas 0.20.1)

Selecting pandas data using “iloc”¶

The iloc indexer for Pandas Dataframe is used for integer-location based indexing / selection by position. The iloc indexer syntax is data.iloc[, ]

1. Single selections using iloc and DataFrame¶

Rows:¶

data.iloc[0] # first row of data frame (Aleshia Tomkiewicz) - Note a Series data type output.
data.iloc[1] # second row of data frame (Evan Zigomalas)
data.iloc[-1] # last row of data frame (Mi Richan)

Columns:¶

data.iloc[:,0] # first column of data frame (first_name)
data.iloc[:,1] # second column of data frame (last_name)
data.iloc[:,-1] # last column of data frame (id)

Multiple columns and rows can be selected together using the .iloc indexer

Multiple row and column selections using iloc and DataFrame¶

data.iloc[0:5] # first five rows of dataframe
data.iloc[:, 0:2] # first two columns of data frame with all rows
data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame (county -> phone1

Note that .iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected. To counter this, pass a single-valued list if you require DataFrame output.

2. Selecting pandas data using “loc”¶

The Pandas loc indexer can be used with DataFrames for two different use cases:

Selecting rows by label/index1
Selecting rows with a boolean / conditional lookup The Pandas loc indexer can be used with DataFrames for two different use cases:
Selecting rows by label/index
Selecting rows with a boolean / conditional lookup

Select rows with first name Antonio, # and all columns between 'city' and 'email'
data.loc[data['first_name'] == 'Antonio', 'city':'email']

Select rows where the email column ends with 'hotmail.com', include all columns
data.loc[data['email'].str.endswith("hotmail.com")]

Select rows with last_name equal to some values, all columns
data.loc[data['first_name'].isin(['France', 'Tyisha', 'Eric'])]

Select rows with first name Antonio AND hotmail email addresses
data.loc[data['email'].str.endswith("gmail.com") & (data['first_name'] == 'Antonio')]

select rows with id column between 100 and 200, and just return 'postal' and 'web' columns
data.loc[(data['id'] > 100) & (data['id'] <= 200), ['postal', 'web']]

A lambda function that yields True/False values can also be used.
Select rows where the company name has 4 words in it.
data.loc[data['company_name'].apply(lambda x: len(x.split(' ')) == 4)]

Selections can be achieved outside of the main .loc for clarity:
Form a separate variable with your selections:
idx = data['company_name'].apply(lambda x: len(x.split(' ')) == 4)
Select only the True values in 'idx' and only the 3 columns specified:
data.loc[idx, ['email', 'first_name', 'company']]

ser[[4, 3, 1]]

eric    500
dan     bar
bob     foo
dtype: object

ser.iloc[2]

300

'bob' in ser
# checking whether an index is in a Series

True

'dan' in ser

True

'amit' in ser

False

ser * 2

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

ser = ser * 2

ser[['nancy', 'eric']] ** 2
#squares of strings cannot be done, so we have to explicitly supply the indexes which arent strings

nancy     360000
eric     1000000
dtype: object

ser

tom         200
bob      foofoo
nancy       600
dan      barbar
eric       1000
dtype: object

pandas DataFrame

pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

df = pd.DataFrame(d)
print(df)
df

          one     two
apple   100.0   111.0
ball    200.0   222.0
cerill    NaN   333.0
clock   300.0     NaN
dancy     NaN  4444.0

df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

df.columns

Index(['one', 'two'], dtype='object')

pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Create DataFrame from list of Python dictionaries

data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

pd.DataFrame(data)
# here the default index values will be 0 and 1, to add custom values for indexes use
# pd.DataFrame(data, index=['zero', 'one'])

pd.DataFrame(data, index=['orange', 'red'])

pd.DataFrame(data, columns=['joe', 'dora','alice'])

Basic DataFrame operations

df

df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

df['three'] = df['one'] * df['two']
df

df['flag'] = df['one'] > 250
df

three = df.pop('three')

three, type(three)

(apple     11100.0
 ball      44400.0
 cerill        NaN
 clock         NaN
 dancy         NaN
 Name: three, dtype: float64, pandas.core.series.Series)

df

del df['two']

df

df.insert(2, 'copy_of_one', df['one']) #df.insert(position_of_column, name_of_column, data_sorce_to_create_column)
df

df['one_upper_half'] = df['one'][:2]
df

Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using pandas.

Download the Dataset¶

Please note that you will need to download the dataset. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:

Data Source: MovieLens web site (filename: ml-20m.zip)
Location: https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called movielens in your Week-3-pandas folder.

Let us look at the files in this dataset using the UNIX command ls.

# Note: Adjust the name of the folder to match your local directory
!ls ./movielens
# these commands wont work on windows as im running Jupyter through anaconda which is a python distribution and python uses bash by default but windows doesnt

!cat ./movielens/movies.csv | wc -l

!head -5 ./movielens/ratings.csv

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp
tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

movies = pd.read_csv('./movielens/movies.csv', sep=',')
print(type(movies))
# head(no_of_items) is used to display the first no_of_elements items in the DataFrame object 
movies.head(15)

<class 'pandas.core.frame.DataFrame'>

# Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags = pd.read_csv('./movielens/tags.csv', sep=',')
tags.head()

tags

ratings = pd.read_csv('./movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

# For current analysis, we will remove timestamp (we will come back to it!)

del ratings['timestamp']
del tags['timestamp']

Data Structures

Series

#Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)

pandas.core.series.Series

print(row_0)

userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object

#same operation as above using loc instead of iloc
row_999 = tags.loc[0]
type(row_999)

pandas.core.series.Series

print(row_999)

userId              18
movieId           4141
tag        Mark Waters
Name: 0, dtype: object

row_0.index

Index(['userId', 'movieId', 'tag'], dtype='object')

row_0['userId']

18

'rating' in row_0

False

row_0.name

0

row_0 = row_0.rename('first_row')
row_0.name

'first_row'

row_0.head(5)

userId              18
movieId           4141
tag        Mark Waters
Name: first_row, dtype: object

DataFrames

tags.head()

tags.index

RangeIndex(start=0, stop=465564, step=1)

tags.columns

Index(['userId', 'movieId', 'tag'], dtype='object')

# Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]
#print(type(tags.iloc[[0, 11, 2000]]))   ## type of the tags.iloc[[0, 11, 2000]] is DataFrame

Descriptive Statistics

Let's look how the ratings are distributed!

ratings['rating'].describe()

count    2.000026e+07
mean     3.525529e+00
std      1.051989e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

ratings.describe()

ratings['rating'].mean()

3.5255285642993797

ratings.mean()

userId     69045.872583
movieId     9041.567330
rating         3.525529
dtype: float64

ratings['rating'].min()

0.5

ratings['rating'].max()

5.0

ratings['rating'].std()

1.051988919275684

ratings['rating'].mode()

0    4.0
dtype: float64

#Compute pairwise correlation of columns, excluding NA/null values
ratings.corr()
#a negative correlation score, here means those features in our data sets
#are inversely correlated, so if one go up the other goes down

filter_1 = ratings['rating'] > 5
print(filter_1)
filter_1.any()   #Return whether any element is True over requested axis.

0           False
1           False
2           False
3           False
4           False
5           False
6           False
7           False
8           False
9           False
10          False
11          False
12          False
13          False
14          False
15          False
16          False
17          False
18          False
19          False
20          False
21          False
22          False
23          False
24          False
25          False
26          False
27          False
28          False
29          False
            ...  
20000233    False
20000234    False
20000235    False
20000236    False
20000237    False
20000238    False
20000239    False
20000240    False
20000241    False
20000242    False
20000243    False
20000244    False
20000245    False
20000246    False
20000247    False
20000248    False
20000249    False
20000250    False
20000251    False
20000252    False
20000253    False
20000254    False
20000255    False
20000256    False
20000257    False
20000258    False
20000259    False
20000260    False
20000261    False
20000262    False
Name: rating, Length: 20000263, dtype: bool

False

filter_2 = ratings['rating'] > 0
filter_2.all()   
#Return whether all elements are True over series or dataframe axis.
#Returns True if all elements within a series or along a dataframe axis are non-zero, not-empty or not-False.

True

Data Cleaning: Handling Missing Data

movies.shape

(27278, 3)

#is any row NULL ?

movies.isnull().any()

movieId    False
title      False
genres     False
dtype: bool

Thats nice ! No NULL values !

ratings.shape

(20000263, 3)

#is any row NULL ?

ratings.isnull().any()

userId     False
movieId    False
rating     False
dtype: bool

Thats nice ! No NULL values !

tags.shape

(465564, 3)

#is any row NULL ?

tags.isnull().any()

userId     False
movieId    False
tag         True
dtype: bool

We have some tags which are NULL.

tags = tags.dropna() #axis=0 for dropping rows and axis=1 for dropping columns with NaN

#Check again: is any row NULL ?

tags.isnull().any()

userId     False
movieId    False
tag        False
dtype: bool

tags.shape

(465548, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

%matplotlib inline

ratings.hist(column='rating', figsize=(15,10))

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002B6840FD2B0>]],
      dtype=object)

ratings.boxplot(column='rating', figsize=(15,20))

<matplotlib.axes._subplots.AxesSubplot at 0x2b68254da20>

Slicing Out Columns

tags['tag'].head()

0      Mark Waters
1        dark hero
2        dark hero
3    noir thriller
4        dark hero
Name: tag, dtype: object

movies[['title','genres']].head()

ratings[1000:1010]

ratings[:10]

ratings[-10:]

tag_counts = tags['tag'].value_counts()
tag_counts[-10:]

performance                       1
quirkiness                        1
Rodrigo De la Serna               1
screenwriter:Adam Mazer           1
stuart townsend                   1
Jonas Mekas                       1
lifestyle                         1
fuck the pain away                1
it's so bad it's actually good    1
good for sema                     1
Name: tag, dtype: int64

tag_counts[:10].plot(kind='bar', figsize=(15,10))

<matplotlib.axes._subplots.AxesSubplot at 0x2b684374240>

Filters for Selecting Rows

is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]

is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]

movies[is_animation].head(15)

Group By and Aggregate

ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

Merge Dataframes

tags.head()

movies.head()

t = movies.merge(tags, on='movieId', how='inner')
t.head()

More examples: http://pandas.pydata.org/pandas-docs/stable/merging.html

Combine aggreagation, merging, and filters to get useful analytics

avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
avg_ratings.head()

box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]

is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

box_office[is_comedy & is_highly_rated][-5:]

Vectorized String Operations

movies.head()

Split 'genres' into multiple columns

movie_genres = movies['genres'].str.split('|', expand=True)

movie_genres[:10]

Add a new column for comedy genre flag

movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

movie_genres[:10]

Extract year from title e.g. (1995)

movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)

movies.tail()

More here: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

Parsing Timestamps

Timestamps are common in sensor data or other time series datasets. Let us revisit the tags.csv dataset and read the timestamps!

tags = pd.read_csv('./movielens/tags.csv', sep=',')

tags.dtypes

userId        int64
movieId       int64
tag          object
timestamp     int64
dtype: object

Unix time / POSIX time / epoch time records time in seconds
since midnight Coordinated Universal Time (UTC) of January 1, 1970

tags.head(5)

tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

Data Type datetime64[ns] maps to either M8[ns] depending on the hardware

tags['parsed_time'].dtype

dtype('<M8[ns]')

tags.head(2)

Selecting rows based on timestamps

greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

tags.shape, selected_rows.shape

((465564, 5), (12130, 5))

Sorting the table using the timestamps

tags.sort_values(by='parsed_time', ascending=True)[:10]

Average Movie Ratings over Time

average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()
joined.corr()

yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[:10]

yearly_average[-20:].plot(x='year', y='rating', figsize=(15,10), grid=True)

<matplotlib.axes._subplots.AxesSubplot at 0x2b6896fc550>

Do some years look better for the boxoffice movies than others?

Does any data point seem like an outlier in some sense?

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller
10	11	American President, The (1995)	Comedy\|Drama\|Romance
11	12	Dracula: Dead and Loving It (1995)	Comedy\|Horror
12	13	Balto (1995)	Adventure\|Animation\|Children
13	14	Nixon (1995)	Drama
14	15	Cutthroat Island (1995)	Action\|Adventure\|Romance

	userId	movieId	tag	timestamp
0	18	4141	Mark Waters	1240597180
1	65	208	dark hero	1368150078
2	65	353	dark hero	1368150079
3	65	521	noir thriller	1368149983
4	65	592	dark hero	1368150078
5	65	668	bollywood	1368149876
6	65	898	screwball comedy	1368150160
7	65	1248	noir thriller	1368149983
8	65	1391	mars	1368150055
9	65	1617	neo-noir	1368150217
10	65	1694	jesus	1368149925
11	65	1783	noir thriller	1368149983
12	65	2022	jesus	1368149925
13	65	2193	dragon	1368151314
14	65	2353	conspiracy theory	1368151266
15	65	2662	mars	1368150055
16	65	2726	noir thriller	1368149983
17	65	2840	jesus	1368149925
18	65	3052	jesus	1368149926
19	65	5135	bollywood	1368149876
20	65	6539	treasure	1368149949
21	65	6874	dark hero	1368150079
22	65	7013	noir thriller	1368149983
23	65	7318	jesus	1368149925
24	65	8529	stranded	1368150012
25	65	8622	conspiracy theory	1368151266
26	65	27803	Oscar (Best Foreign Language Film)	1305008715
27	65	27866	New Zealand	1304957153
28	65	48082	surreal	1304958354
29	65	48082	unusual	1304958359
...	...	...	...	...
465534	138436	81932	big hair	1298754654
465535	138436	81932	Mark Wahlberg	1298754636
465536	138436	81932	pigs	1298754658
465537	138436	81932	prostitution	1298754620
465538	138437	77154	This movie should have been called "How Cocain...	1357384633
465539	138446	317	Christmas	1358983914
465540	138446	317	funny	1358983914
465541	138446	317	Judge Reinhold	1358983914
465542	138446	317	Tim Allen	1358983914
465543	138446	317	whiny kid	1358983946
465544	138446	837	family friendly	1358983693
465545	138446	918	halloween scene	1358984062
465546	138446	918	quirky	1358984051
465547	138446	2396	topless scene	1358973995
465548	138446	3086	Christmas	1358983979
465549	138446	3086	classic	1358983979
465550	138446	3086	funny	1358983979
465551	138446	3086	scary	1358984001
465552	138446	3489	Peter Pan	1358983822
465553	138446	3489	soundtrack	1358983822
465554	138446	3489	visually appealing	1358983822
465555	138446	7045	family friendly	1358983660
465556	138446	7045	Scary Movies To See on Halloween	1358983660
465557	138446	7164	Peter Pan	1358983855
465558	138446	7164	visually appealing	1358983855
465559	138446	55999	dragged	1358983772
465560	138446	55999	Jason Bateman	1358983778
465561	138446	55999	quirky	1358983778
465562	138446	55999	sad	1358983772
465563	138472	923	rise to power	1194037967

	userId	movieId	rating
userId	1.000000	-0.000850	0.001175
movieId	-0.000850	1.000000	0.002606
rating	0.001175	0.002606	1.000000

	title	genres
0	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	Jumanji (1995)	Adventure\|Children\|Fantasy
2	Grumpier Old Men (1995)	Comedy\|Romance
3	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	Father of the Bride Part II (1995)	Comedy

	userId	movieId	rating
1000	11	527	4.5
1001	11	531	4.5
1002	11	541	4.5
1003	11	546	5.0
1004	11	551	5.0
1005	11	586	4.0
1006	11	587	4.5
1007	11	588	5.0
1008	11	589	4.5
1009	11	592	4.5

	one	two
apple	100.0	111.0
ball	200.0	222.0
cerill	NaN	333.0
clock	300.0	NaN
dancy	NaN	4444.0

	one	flag
apple	100.0	False
ball	200.0	False
cerill	NaN	False
clock	300.0	True
dancy	NaN	False

	userId	movieId	rating	timestamp
0	1	2	3.5	1112486027
1	1	29	3.5	1112484676
2	1	32	3.5	1112484819
3	1	47	3.5	1112484727
4	1	50	3.5	1112484580

	userId	movieId	rating
count	2.000026e+07	2.000026e+07	2.000026e+07
mean	6.904587e+04	9.041567e+03	3.525529e+00
std	4.003863e+04	1.978948e+04	1.051989e+00
min	1.000000e+00	1.000000e+00	5.000000e-01
25%	3.439500e+04	9.020000e+02	3.000000e+00
50%	6.914100e+04	2.167000e+03	3.500000e+00
75%	1.036370e+05	4.770000e+03	4.000000e+00
max	1.384930e+05	1.312620e+05	5.000000e+00

	userId	movieId	rating
0	1	2	3.5
1	1	29	3.5
2	1	32	3.5
3	1	47	3.5
4	1	50	3.5
5	1	112	3.5
6	1	151	4.0
7	1	223	4.0
8	1	253	4.0
9	1	260	4.0

	userId	movieId	rating
20000253	138493	60816	4.5
20000254	138493	61160	4.0
20000255	138493	65682	4.5
20000256	138493	66762	4.5
20000257	138493	68319	4.5
20000258	138493	68954	4.5
20000259	138493	69526	4.5
20000260	138493	69644	3.0
20000261	138493	70286	5.0
20000262	138493	71619	2.5

	userId	movieId	rating
68	1	2021	4.0
69	1	2100	4.0
70	1	2118	4.0
71	1	2138	4.0
72	1	2140	4.0
73	1	2143	4.0
74	1	2173	4.0
75	1	2174	4.0
76	1	2193	4.0
79	1	2288	4.0
80	1	2291	4.0
81	1	2542	4.0
82	1	2628	4.0
90	1	2762	4.0
92	1	2872	4.0
94	1	2944	4.0
96	1	2959	4.0
97	1	2968	4.0
101	1	3081	4.0
102	1	3153	4.0

	movieId	title	genres
310	313	Swan Princess, The (1994)	Animation\|Children
360	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
388	392	Secret Adventures of Tom Thumb, The (1993)	Adventure\|Animation
547	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
553	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
582	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
588	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
589	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
590	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
604	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi

	movieId
rating
0.5	239125
1.0	680732
1.5	279252
2.0	1430997
2.5	883398
3.0	4291193
3.5	2200156
4.0	5561926
4.5	1534824
5.0	2898660

	movieId	title	genres	userId	tag
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1644	Watched
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	computer animation
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	Disney animated feature
3	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	Pixar animation
4	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1741	TÃ©a Leoni does not star in this movie

	alex	alice	dora	ema	joe
orange	1.0	NaN	NaN	NaN	2.0
red	NaN	20.0	10.0	5.0	NaN

	joe	dora	alice
0	2.0	NaN	NaN
1	NaN	10.0	20.0

	rating
movieId
1	3.921240
2	3.211977
3	3.151040
4	2.861393
5	3.064592

	rating
movieId
1	49695
2	22243
3	12735
4	2756
5	12161

	userId	movieId	rating
0	1	2	3.5
1	1	29	3.5
2	1	32	3.5
3	1	47	3.5
4	1	50	3.5
5	1	112	3.5
6	1	151	4.0
7	1	223	4.0
8	1	253	4.0
9	1	260	4.0

	movieId	title	genres	rating
26739	131254	Kein Bund für's Leben (2007)	Comedy	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	4.0
26741	131258	The Pirates (2014)	Adventure	2.5
26742	131260	Rentun Ruusu (2001)	(no genres listed)	3.0
26743	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	4.0

	movieId	title	genres	rating
26737	131250	No More School (2000)	Comedy	4.0
26738	131252	Forklift Driver Klaus: The First Day on the Jo...	Comedy\|Horror	4.0
26739	131254	Kein Bund für's Leben (2007)	Comedy	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	4.0
26743	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	4.0

	movieId	title	genres	rating
26736	131248	Brother Bear 2 (2006)	Adventure\|Animation\|Children\|Comedy\|Fantasy	4.0
26737	131250	No More School (2000)	Comedy	4.0
26738	131252	Forklift Driver Klaus: The First Day on the Jo...	Comedy\|Horror	4.0
26739	131254	Kein Bund für's Leben (2007)	Comedy	4.0
26740	131256	Feuer, Eis & Dosenbier (2002)	Comedy	4.0

	0	1	2	3	4	5	6	7	8	9
0	Adventure	Animation	Children	Comedy	Fantasy	None	None	None	None	None
1	Adventure	Children	Fantasy	None	None	None	None	None	None	None
2	Comedy	Romance	None	None	None	None	None	None	None	None
3	Comedy	Drama	Romance	None	None	None	None	None	None	None
4	Comedy	None	None	None	None	None	None	None	None	None
5	Action	Crime	Thriller	None	None	None	None	None	None	None
6	Comedy	Romance	None	None	None	None	None	None	None	None
7	Adventure	Children	None	None	None	None	None	None	None	None
8	Action	None	None	None	None	None	None	None	None	None
9	Action	Adventure	Thriller	None	None	None	None	None	None	None

	movieId	title	genres	year
27273	131254	Kein Bund für's Leben (2007)	Comedy	2007
27274	131256	Feuer, Eis & Dosenbier (2002)	Comedy	2002
27275	131258	The Pirates (2014)	Adventure	2014
27276	131260	Rentun Ruusu (2001)	(no genres listed)	2001
27277	131262	Innocence (2014)	Adventure\|Fantasy\|Horror	2014

	userId	movieId	tag	timestamp	parsed_time
333932	100371	2788	monty python	1135429210	2005-12-24 13:00:10
333927	100371	1732	coen brothers	1135429236	2005-12-24 13:00:36
333924	100371	1206	stanley kubrick	1135429248	2005-12-24 13:00:48
333923	100371	1193	jack nicholson	1135429371	2005-12-24 13:02:51
333939	100371	5004	peter sellers	1135429399	2005-12-24 13:03:19
333922	100371	47	morgan freeman	1135429412	2005-12-24 13:03:32
333921	100371	47	brad pitt	1135429412	2005-12-24 13:03:32
333936	100371	4011	brad pitt	1135429431	2005-12-24 13:03:51
333937	100371	4011	guy ritchie	1135429431	2005-12-24 13:03:51
333920	100371	32	bruce willis	1135429442	2005-12-24 13:04:02

	year	rating
0	1891	3.000000
1	1893	3.375000
2	1894	3.071429
3	1895	3.125000
4	1896	3.183036
5	1898	3.850000
6	1899	3.625000
7	1900	3.166667
8	1901	5.000000
9	1902	3.738189