TMDB Movie Dataset Analysis
In this report, I will use python to analyse the trend in movie market.
Packages: Pandas, Numpy, Matplotlib, Seaborn, Json
SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。IDE: Pycharm
Major questions:
- How genres of movies change over time?
- How is the comparison between universal pictures and paramount pictures?
- How is the comparison between the movies based on novel and original?
1. Data import and cleaning
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import json
import numpy as np
moviesdf = pd.read_csv('movies.csv')
movdf = pd.read_csv('credits.csv')
1) Fill missing values
null=moviesdf["release_date"].isnull() moviesdf.loc[null,:] moviesdf['release_date'] = moviesdf['release_date'].fillna( '2017-11-01' )
2) Convert data type
Date
moviesdf.loc[:,'release_date']=pd.to_datetime(moviesdf.loc[:,'release_date'],
format='%Y-%m-%d',
errors='coerce')
Json into characters
#genres
moviesdf['genres'] = moviesdf['genres'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['genres']):
l = []
for j in range(len(i)):
l.append((i[j]['name']))
moviesdf.loc[index, 'genres'] = str(l)
#keywords
moviesdf['keywords'] = moviesdf['keywords'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['keywords']):
l = []
for j in range(len(i)):
l.append((i[j]['name']))
moviesdf.loc[index, 'keywords'] = str(l)
#production_companies
moviesdf['production_companies'] = moviesdf['production_companies'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['production_companies']):
l = []
for j in range(len(i)):
l.append((i[j]['name']))
moviesdf.loc[index, 'production_companies'] = str(l)
#production_countries
moviesdf['production_countries'] = moviesdf['production_countries'].apply(json.loads)
for index, i in zip(moviesdf.index, moviesdf['production_countries']):
l = []
for j in range(len(i)):
l.append((i[j]['name']))
moviesdf.loc[index, 'production_countries'] = str(l)
2. Data processing and visualising
Summarise genres in list
moviesdf['genres']=moviesdf['genres'].str.strip('[]').str.replace(' ','').str.replace("'",'')
moviesdf['genres']=moviesdf['genres'].str.split(',')
list1=[]
for i in moviesdf['genres']:
list1.extend(i)
genres=pd.Series(list1).value_counts().sort_values(ascending=False)
genres[:10]
genres=pd.DataFrame(genres[:10])
genres.rename(columns={0:"total"},inplace=True)
1) Barplot: Genres of movies & Amount
f,ax=plt.subplots(figsize=(12,10)) g=sns.barplot(y=genres.index,x="total",data=genres,palette="Blues_d",ax=ax) plt.show()
2) Q1: How genres of movies change over time?
years=[]
for x in moviesdf["release_date"]:
year=x.year
years.append(year)
Years=pd.Series(years)
moviesdf['year']=Years
moviesdf['year'].head()
min_year = moviesdf['year'].min()
max_year = moviesdf['year'].max()
liste_genres = set()
for s in moviesdf['genres']:
liste_genres = set().union(s, liste_genres)
liste_genres = list(liste_genres)
liste_genres
genre_df = pd.DataFrame( index = liste_genres,columns= range(min_year, max_year + 1))
genre_df = genre_df.fillna(value = 0)
year = np.array(moviesdf['year'])
z = 0
for i in moviesdf['genres']:
split_genre = list(i)
for j in split_genre:
genre_df.loc[j, year[z]] = genre_df.loc[j, year[z]] + 1
z+=1
genre_df
plt.figure(figsize=(15,8))
plt.plot(genre_df.T)
plt.title('rrr')
plt.xticks(range(1910,2020,5))
plt.legend(genre_df.index)
plt.show()
*Genres of movies increase over time, booming from 1975-1995.
*After 1995, dramas, comedies and thrillers increased dramatically.
3) Q2: How is the comparison between universal pictures and paramount pictures?
plt.figure(figsize = (7,4))
two = ['Universal Pictures', 'Paramount Pictures']
num = [77015832,70100000]
plt.bar(np.arange(len(two)), num, color = 'c', width = 0.1, align = 'center')
plt.ylabel('revenue')
plt.xticks(np.arange(len(two)), two)
plt.title('Universal Pictures VS Paramount Pictures ')
plt.grid(True)
plt.show()
*Until 2017, Universal Pictures has a slightly higher revenue than Paramount Pictures.
4) Q3: How is the comparison between the movies based on novel and original?
keylist = ['based on novel','original']
nums = [197,4606]
plt.figure(figsize=(7, 4))
plt.bar(np.arange(len(keylist)), nums, color = 'c' , width = 0.1, align = 'center')
plt.ylabel('Amount',fontsize = 12)
plt.xticks(np.arange(len(keylist)), keylist,fontsize = 12)
plt.title('Original VS Based on novel',fontsize = 14)
plt.grid(True)
plt.show()
*Until 2017, most movies are original rather than based on novel.

