This project is intended to recreate the work I had in college. Using new methods I learned. Previously it was something very simple and when I had a little time to apply new methods I would create a new notebook.
I developed a jupyter notebook file on the Kaggle platform that will be found in my profile that I will put the references at the end. Where you will be able to download, clone, analyze or in any way you see fit.
Let's start !!!!!!!!!
✏️ Goal ✏️
To develop a movie and TV Show recommendation function, the user has the need to search for a movie or TV Show based on its name and receive recommendations and be able to choose the amount of recommendations.
🚧 Environment 🚧
In this project I used:
- Pandas: To be able to analyze my dataset and also be able to manipulate .
- Numpy: To use some array resources.
- Sk-Learn: To take advantage of the K-Means algorithm and be able to apply it to this project.
- Yellowbrick: To use your visualization tool and apply in K-Means analysis.
- Dataset: I used the dataset from "Netflix Movies and TV Shows". Where I found it on the Kaggle platform.
- GPU: I had to use the GPU t4 x2 option, due to the need for high processing.
🔎 Analyze Data 🔎
First I read my dataset using pandas.
data_netflix = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
data_netflix
I noticed that we have mostly columns of text. Focusing on the objective, I will use the columns "title" and "listed_in".
title: Where I have the names of movies and tv shows
listed_in: Where are the genres movies and tv shows (action, adventure, horror …..)
Following, as a force of habit. I visualized the absence of my data.
data_netflix.isnull().sum()
There are some missing data, we will correct them in the next step.
How are my distributions within the set.
data_netflix.describe(include='all')
🛠️ Clear Dataset and Preparate New Dataset 🛠️
Now arriving at this stage, something that I didn't like was the name of some columns. So I renamed it, I like to be very explicit on that point. It helps a lot for development.
data_netflix = data_netflix.rename(columns={'date_added': 'date_added_platform', 'duration': 'duration_seconds', 'listed_in': 'gender_type', 'type': 'movie_or_tv_show'})
data_netflix[:5]
Of the columns that I renamed being the most important of the objective was "listed_in" to "gender_type".
I even dropped two columns that are not needed at this moment:
data_netflix.drop(columns=['rating', 'show_id'], axis=1, inplace=True)
From the missing values, I made a substitution for something that references its own column.
data_netflix['cast'] = data_netflix['cast'].fillna('uninformed cast')
data_netflix['director'] = data_netflix['director'].fillna('uninformed director')
data_netflix['country'] = data_netflix['country'].fillna('uninformed country')
Here at this point we want to draw attention to something that can make a difference in the data, using the word "Documentaries" as an example. The Python language understands this string as unique, if we have "documentaries" it will understand it as another string. The language could not visualize that they are the same because of their characters.
if 'Documentaries' != 'documentaries':
print('True')
else:
print('False')
# OUTPUT: True
In order not to have two identical pieces of data, I used a simple method to avoid this. Of course there are other ways, but for that moment this resource was useful.
data_netflix['gender_type'] = data_netflix['gender_type'].apply(lambda x: x.upper())
Now this entire column is upper case. So we don't even have it starting with "D" or "d", or even "DocuMentariEs".
Let's now sort out the genre types of Movies and Tv Shows. Observe the previous image of the gender_type column, look how much data we are losing being marked between the types with ",".
For that I used:
df_split = data_netflix['gender_type'].str.split(',', expand=True)
df_split= df_split.fillna('-')
df_split
I separated all values that were marked by ",", and generated a new set based on that data. In addition, for the data that were missing, I added a simple "-", to know that there was no data there.
In the next step I use a Pandas feature called get_dummies. For what reason I did this, look at the previous image, I have 3 columns having different classes (action, horror, anime ....). First, because the algorithm does not understand categorical data, having the need to transform it into numerical data. With this pandas feature, I can transform all these classes into columns and transform their values in the line into numeric data that are pointed to the column, if the value is not there, it will be 0 and if it exists, it will be 1. that I have 3 columns, I had to do this processing for each one, generating a new set for each column.
group_dummies = [pd.get_dummies(df_split[y].apply(lambda x: x.strip()), dtype='int') for y in df_split.columns]
group_dummies[0].shape
# output: (8807, 36)
In the end, I used the shape to check if it kept the same amount of lines as the moment it was entered.
Now after processing the columns and their data, we need to join the 3 sets generated for each column. Become just a set.
group_dummies = pd.concat(group_dummies, axis=1)
group_dummies = group_dummies.fillna(0).astype('uint8')
group_dummies
Continuing, I changed the text of the 'title' column for the same reason explained above. To get something standardized I chose this way. But note that it is not the same set, the group_dummies set is separated with all the values of the "genre_type" column, only numeric. This one below is the initial set, where it needed changes.
data_netflix['title'] = data_netflix['title'].apply(lambda x : x.upper())
Finally, I took the "group_dummies" set and transformed it into a numpy array in order to have the "X" input for the algorithm.
X_genre_type = np.array(group_dummies)
X_genre_type
💪 Elbow method 💪
We arrived at an important stage Elbow Method. This method aims to determine a numerical value for the clusters. If we need to have a grouping of 3 clusters, 4 clusters, 30 clusters and so on. Well here, as I want to have more ideal clusters for my set of "group_dummies", I'm going to play very high values to know what the ideal value for K is. Here you won't see all the plots I made to evaluate. I will present what was chosen and what I evaluated best. But you can check it out in the link that I'll leave there at the end for you to view in the repository.
Well for this evaluation I used a very good Python library called Yellowbrick, this library has several features. One of them is for "Elbow method".
It has three types of metrics for evaluation as shown in the documentation:
distortion: mean sum of squared distances to centers
silhouette: mean ratio of intra-cluster and nearest-cluster distance
calinski_harabasz: ratio of within to between cluster dispersion
Of all the ones I tested, the one that pleased the most and brought the best number of "K", was that it is already the default for the "distortion" function.
model = KMeans()
visualizer = KElbowVisualizer(model_view_elbow, k=(3,(3*107)), metric='distortion')
visualizer.fit(X_genre_type)
visualizer.show()
After the "KElbowVisualizer" function, it gave me the ideal k 34, I didn't think it was bad. I liked it quite a lot and I will continue like this.
From the result, I will use the "KMeans" algorithm imported from the sk-learn library. I'm going to define "n_clusters" as a parameter, being 34 like the previous result. Here I didn't go into much detail, I wanted something simple. Then I didn't use the fit method, but the fit_predict where I can already get the results for the clusters that the algorithm used and be able to define my y, which will be my new column in the set.
kmeans_model = KMeans(n_clusters = 34, random_state=0)
y_Kmeans34 = kmeans_model.fit_predict(X_genre_type)
print(y_Kmeans34)
print(np.unique(y_Kmeans34))
print(len(np.unique(y_Kmeans34)))
print(f'Amount genre: {len(group_dummies.columns)}')
# output:
# [1 6 3 ... 2 5 8]
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33]
# 34
# Amount genre: 108
Here I made a copy of the original set, so that I don't need to modify the original in case I need to return to it again.
data_netflix_cluster = data_netflix.copy()
Now I'm going to take every prediction made by K-Means, and add it to a new column.
data_netflix_cluster['clusters_genre'] = y_Kmeans34
Now you're wondering why I didn't add the "group_dummies" set I prepared all before to merge with original. Note the image below. Before I had a column, full of data. But it was compressed. It is difficult to get results that way, so what I did for processing was expanding this data, cleaning it and then grouping it all through a classification. Now each value in the old column has a number of "K" as a result of K-Means. This makes it easier to get which movie or tv show is related.
🍿 Recommend Movies and Tv Shows 🍿
Now we arrive at the expected stage, the recommendations. Now with the whole set ready, let's create the function that will receive the client's data. Notice in the function that I input the prepared data set. It wouldn't be necessary to be like this, because in a possible situation this set would not be with the client, I added it so you can understand the context of the steps that are passed. I used the dataclass to simulate a situation of a request, where JSON will be received. Which is pretty much like a Python dictionary. But as the focus here is not on creating an API, using the requests library. It was that simple.
The "QueryRecommends" class has the following data:
- dataset: The data we have from movies and TV Shows
- name: The name of the movie that the user will choose
- top_n: The amount of results, whether the client wants top 10 movies or top 20.
Now entering the function, I do an internal search of the data set by performing a filter between the title and the name of the movie that the customer chose. Then I transform the text so that we don't have a problem with the client entering, this point can be better worked on. But as we are testing this functionality, it was pleasant that way. Then I restart my index, because the filter result will come with the index of the input set. Then I use the "at" method to capture the value of my "clusters_genre" column, which I will then perform a new filter from that result to obtain other similar values, then I choose the output columns that are "title " and "gender_type" and then the number of outputs you want.
from dataclasses import dataclass
@dataclass
class QueryRecommends:
dataset: pd.core.frame.DataFrame
name: str
top_n: int = 10
def recommends(query: QueryRecommends) -> pd.core.frame.DataFrame:
result = query['dataset'][query['dataset']['title'] == query['name'].upper()][['clusters_genre']].reset_index()
result = result.at[0, 'clusters_genre']
return query['dataset'][query['dataset']['clusters_genre'] == int(result)][['title', 'gender_type']][:query['top_n']]
Now I make the calls in my function:
Internacional
result = recommends({'dataset': data_netflix_cluster, 'name': 'Narcos', 'top_n': 10})
result
Action
result = recommends({'dataset': data_netflix_cluster, 'name': 'The Stronghold', 'top_n': 10})
result
Comedies
result = recommends({'dataset': data_netflix_cluster, 'name': 'Zombieland', 'top_n': 10})
result
Anime
result = recommends({'dataset': data_netflix_cluster, 'name': 'Yu-Gi-Oh! Arc-V', 'top_n': 10})
result
result = recommends({'dataset': data_netflix_cluster, 'name': 'ATTACK ON TITAN', 'top_n': 10})
result
Horror
result = recommends({'dataset': data_netflix_cluster, 'name': 'Would You Rather', 'top_n': 10})
result
Comments
Thanks for reading this far. I hope I can help you understand. Any code or text errors please do not hesitate to return. Don’t forget to leave a like so you can reach more people.
Resources
About the author:
A little more about me...
Graduated in Bachelor of Information Systems, in college I had contact with different technologies. Along the way, I took the Artificial Intelligence course, where I had my first contact with machine learning and Python. From this it became my passion to learn about this area. Today I work with machine learning and deep learning developing communication software. Along the way, I created a blog where I create some posts about subjects that I am studying and share them to help other users.
I'm currently learning TensorFlow and Computer Vision
Curiosity: I love coffee