Project Machine Learning with Python Building Recommender System



Project Summary

Improving the Understanding of Python function to help read, gather, manipulate, and analyze the data and generate useful insights to drive an informed decision-making process, and this 4 main things that I did in this project :

  1. Conducted data cleaning and preparation.
  2. Conducted exploratory data such as data aggregation, descriptive statistic, analyzing using visualization
  3. Created Recommendation system using Python
  4. Created the insights for the project

Project Files

For a more comprehensive analysis and visualization, please open the project files.

Project Background & Dataset

Project Background

We will create a recommendation system using Python, the data we will use here is a movie database from imdb complete with metadata, In this case, we will use a combination of average rating, number of votes, and form a new metric from an existing metric, then we will sort this metric from highest to lowest.

Dataset :Dataset 1

Contains details of the movies / films, including movies / films title, year release, runtimes , genre etc

Dataset :Dataset 2

Contains Movies rating details

1.Cleaning on the movie_df . table

first of all we have to check data type and other information of each column in the movie table (movie_df)



total data is 9025, but for primaryTitle, originalTitle and genres have some Null values. So it is necessary to do cleaning on the column.

1.1 Checking NULL Values movie_df . table

the next thing we will do is check whether there is NULL value data in each column in the movie table (movie_df)



it is known that the primaryTitle and originalTitle columns have a lot of data that are NULL values.


1.2 Analysis with NULL value data movie_df . table

The next thing we will do is check the data form of the primaryTitle and originalTitle columns which have NULL values, whether one or both of the columns in question have NULL data.


it can be seen that all the data has no title and we can discard the data.


1.3 Discarding Data with NULL Value movie_df . table

The next thing we will do is to remove the data with the NULL value and see the amount of data that exists after the NULL data is discarded.


this will bring up the total data remaining only 9011 of the previous 9025 data.


1.4 Analysis with NULL value data movie_df . table

In addition to the 'primaryTitle' and 'originalTitle' columns, there are other columns that have NULL data. The column is the 'genres' column Next, we're going to do the same thing as we did with the 'primaryTitle' and 'originalTitle' columns.
Check the data form of the genres column which is NULL.



it can be seen that all data has no title and we can discard the data


1.5 Discarding genre data with NULL values in movie_df . table

The next thing we will do is discard the data with the NULL value and see the amount of data that exists after the NULL data is discarded.


this will bring up the total data remaining only 9000 of the previous 9011 data.


to discard the data in the primaryTitle and originalTitle columns can be as follows:


this will bring up the total data remaining only 9011 of the previous 9025 data.

1.6 Changing Value '\\N' in movie_df . table

If we look at the columns 'startYear' , 'endYear', and 'runtimeMinutes', there is data with the value '\\N' which means NULL.
The next thing we're going to do is change the value of the \\\N to np.nan and cast the startYear, endYear, and runtimeMinutes columns to float64.



1.7 Change the genres value to a list in movie_df . table

we will create a function called transform_to_list to convert the genre value to a list.


2.Cleaning on the rating_df . table

we have to check data type and other information of each column in the rating table (rating_df)



on the rating data does not have null value data.

3.Inner Join movie table and rating table

Let's do an inner join between rating_df and movie_df to get the rating for each available movie, then display the top 5 data and data types from each column.



on the rating data does not have null value data.

4. Reduce table size

The next thing we will do is reduce the table size by removing all NULL values from the startYear and runtimeMinutes columns because it makes no sense if the film is not known when the release year and duration are.



5.IMDB Formula with Weighted Rating


  • v: the number of votes for the film
  • m: the minimum number of votes needed to enter the chart
  • R: average rating of the film
  • C: the average number of votes from the entire film universe
  • 5.1 The Value of "C"

    The first thing we will look for is the value of C which is the average of the averageRating


    this will bring up C = 6.829581673306773.


    5.2 The Value of "m"

    Let's take the example of a film with numVotes above 80% of the population, so the population we will take is only 20%.


    this will bring up m = 229.0


    5.3 create a weighted formula function

    Next we have to create a function using a dataframe as a variable.



    5.4 create a simple recommender system

    There has been an additional 'score' field. Then we will filter numVotes that are more than m then sort the scores from highest to lowest to take the values of some of the top values.



    5.5 create a simple recommender system with user preferences

    From the tasks that have been done previously, it can be seen now that the list of films has been sorted from the highest score to the lowest. Films with a high average rating do not always get a higher position than films with a lower average rating, this is because we also factor in the number of votes This recommendation system can still be improved by adding specific filters about titleType, startYear, or other filters
    The next job we will do is create a function to filter based on isAdult, startYear, and genres.