We all watch movies for entertainment, some of us never rate it, while some viewers always rate every movie they watch. This type of viewer helps in rating movies for people who go through the movie reviews before watching any movie to make sure they are about to watch a good movie. So, if you are new to data science and want to learn how to analyze movie ratings using the Python programming language, this article is for you. In this article, I will walk you through the task of Movie Rating Analysis using Python.
Movie Rating Analysis using Python
Analyzing the rating given by viewers of a movie helps many people decide whether or not to watch that movie. So, for the Movie Rating Analysis task, you first need to have a dataset that contains data about the ratings given by each viewer. For this task, I have collected a dataset from Kaggle that contains two files:
- one file contains the data about the movie Id, title and the genre of the movie
- and the other file contains the user id, movie id, ratings given by the user and the timestamp of the ratings
You can download both these datasets from here.
Now let's get started with the task of movie rating analysis by importing the necessary Python libraries and the datasets:
import numpy as np import pandas as pd movies = pd.read_csv("movies.dat", delimiter='::') print(movies.head())
0 10 La sortie des usines Lumière (1895) Documentary|Short 1 12 The Arrival of a Train (1896) Documentary|Short 2 25 The Oxford and Cambridge University Boat Race ... NaN 3 91 Le manoir du diable (1896) Short|Horror 4 131 Une nuit terrible (1896) Short|Comedy|Horror
In the above code, I have only imported the movies dataset that does not have any column names, so let's define the column names:
movies.columns = ["ID", "Title", "Genre"] print(movies.head())
ID Title Genre 0 10 La sortie des usines Lumière (1895) Documentary|Short 1 12 The Arrival of a Train (1896) Documentary|Short 2 25 The Oxford and Cambridge University Boat Race ... NaN 3 91 Le manoir du diable (1896) Short|Horror 4 131 Une nuit terrible (1896) Short|Comedy|Horror
Now let's import the ratings dataset:
ratings = pd.read_csv("ratings.dat", delimiter='::') print(ratings.head())
1 0114508 8 1381006850 0 2 499549 9 1376753198 1 2 1305591 8 1376742507 2 2 1428538 1 1371307089 3 3 75314 1 1595468524 4 3 102926 9 1590148016
The rating dataset also doesn't have any column names, so let's define the column names of this data also:
ratings.columns = ["User", "ID", "Ratings", "Timestamp"] print(ratings.head())
User ID Ratings Timestamp 0 2 499549 9 1376753198 1 2 1305591 8 1376742507 2 2 1428538 1 1371307089 3 3 75314 1 1595468524 4 3 102926 9 1590148016
Now I am going to merge these two datasets into one, these two datasets have a common column as ID, which contains movie ID, so we can use this column as the common column to merge the two datasets:
data = pd.merge(movies, ratings, on=["ID", "ID"]) print(data.head())
ID Title ... Ratings Timestamp 0 10 La sortie des usines Lumière (1895) ... 10 1412878553 1 12 The Arrival of a Train (1896) ... 10 1439248579 2 25 The Oxford and Cambridge University Boat Race ... ... 8 1488189899 3 91 Le manoir du diable (1896) ... 6 1385233195 4 91 Le manoir du diable (1896) ... 5 1532347349 [5 rows x 6 columns]
As it is a beginner level task, so I will first have a look at the distribution of the ratings of all the movies given by the viewers:
ratings = data["Ratings"].value_counts() numbers = ratings.index quantity = ratings.values import plotly.express as px fig = px.pie(data, values=quantity, names=numbers) fig.show()
So, according to the pie chart above, most movies are rated 8 by users. From the above figure, it can be said that most of the movies are rated positively.
As 10 is the highest rating a viewer can give, let's take a look at the top 10 movies that got 10 ratings by viewers:
data2 = data.query("Ratings == 10") print(data2["Title"].value_counts().head(10))
Joker (2019) 1479 Interstellar (2014) 1382 1917 (2019) 819 Avengers: Endgame (2019) 808 The Shawshank Redemption (1994) 699 Gravity (2013) 653 The Wolf of Wall Street (2013) 581 Hacksaw Ridge (2016) 570 Avengers: Infinity War (2018) 534 La La Land (2016) 510 Name: Title, dtype: int64
So, according to this dataset, Joker (2019) got the highest number of 10 ratings from viewers. This is how you can analyze movie ratings using Python as a data science beginner.
Summary
So this is how you can do movie rating analysis by using the Python programming language as a data science beginner. Analyzing the ratings given by viewers of a movie helps many people decide whether or not to watch that movie. I hope you liked this article on Movie rating analysis using Python. Feel free to ask your valuable questions in the comments section below.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.