You don't need to be a data scientist to work with data. In fact, you most probably use a lot of data every day in both your professional and personal life. For example, price comparison sites, movie rankings, sales leads, and trending topics are all examples of data that is available online in some sort of table form.
Why web-scraping?
If you want to collect or analyse this data, copy-pasting each record is most likely out of the question. This manual task is not only time-consuming, but also expensive and prone to error – not to mention it's also downright boring.
Here's where automatic web-scraping can help! Web scraping is a method of automatically gathering data from websites in a structured manner and storing it into a local database or spreadsheet.
There are many no-code tools for web scraping, like browser plug-ins (e.g. Webscraper) and software (e.g. Parsehub). However, if you need more advanced scraping settings and have basic coding skills, I recommend the Python libraries Beautiful Soup or Selenium, and the R package rvest. The latter is the one I used for scraping IMDb and you can find the commented code on my GitHub.
Before I proceed to the fun part, note that the legality of web scraping is not clearly defined around the world, so you should check the website's terms of use before scraping it!
Now let's dive in. I wanted my data analysis to answer three questions:
- What are the most successful movies released in 2018?
- What genres do the popular movies belong to?
- What is the duration of the most popular movies?
Top popular movies
I used IMDb as a reference, because it contains all the information I need. On the website I selected the movies released between 01.01.-31.12.2018, sorted by popularity, and limited my search to the first page, so the top 50 movies.
library(rvest)
url <- "https://www.imdb.com/search/title?year=2018"
imdb <- read_html(url)
head(imdb)
The top 5 popular movies in 2018 were:
- Aquaman
- Green Book
- Bohemian Rhapsody
- Spider-Man: Into the Spider-Verse
- Avengers: Infinity War
Top movie genres
Scraping the genre tags of each movie is pretty straightforward:
genre_data_html <- html_nodes(imdb, ".genre")
genre_data <- html_text(genre_data_html)
head(genre_data)
However, this returns a list of genres for each movie, because the movies are labeled with multiple genres. The text data needs to be cleaned a bit:
#remove the \n in front of the genres
genre_data <- gsub("\n", "", genre_data)
#remove the spaces between genres
genre_data <- gsub(" ", "", genre_data)
Now, another tricky thing is that the genres of each movie are enumerated alphabetically, not in order of importance. To simplify my work, I selected only the first genre:
#display only the first genre in the list
genre_data <- gsub(",.*", "", genre_data)
At a glance, I noticed that 3 out 5 are action-hero movies, so I visualized closer at the genre distribution:
#plot the number of movies by genre
library(ggplot2)
ggplot(imdb_df, aes(x=genre_data)) +
geom_bar(color="purple", fill="green", alpha=0.3) +
ggtitle("Number of movies by genre") +
xlab("Genre") + ylab("Number of movies")
My initial observation was confirmed: Action and Drama are the most popular genres, followed by Biography. I guess most people enjoy, on one hand, movies that transport them into wild worlds and simulate experiences out of the ordinary, and on the other hand, movies that depict dramatic life stories and relate to some extent to their real life.
Top movies duration
Next, I analyzed the distribution of movie duration:
#plot the movies by runtime
barplot(table(imdb_df$Runtime))
hist(imdb_df$Runtime)
ggplot(imdb_df, aes(x=runtime_data)) +
geom_histogram(color="purple", fill="green", alpha=0.3) +
ggtitle("Distribution of movie runtimes") +
xlab("Minutes") + ylab("Number of movies")
The plot shows that most popular movies last on average 104 minutes (median 117 minutes). The longest movie is Avengers: Infinity War (149 minutes) and the shortest movie (excluding TV-shows) is A.I. Rising (85 minutes). From the histogram it is clear that the bars on the left represent the TV-shows (under 60 minutes).
I also analyzed the runtime distribution by genre. First, I aggregated the movies by genre:
#group movies by genre
library(dplyr)
genre_cat = group_by(imdb_df, Genre)
genre_runtime = summarize(genre_cat, Minutes=mean(Runtime))
plot(genre_runtime)
Then, I visualized the average duration for each genre:
counts = table(genre_runtime$Genre, genre_runtime$Minutes)
ggplot(data=genre_runtime, aes(x=Genre, y=Minutes)) +
geom_bar(stat = "identity", color="purple", fill="green", alpha=0.3) +
ggtitle("Mean movie duration by genre")
I found that among genres Biographies are longest (on average 127 minutes) and Crimes are shortest (on average 85 minutes). This is not entirely surprising, since I think that, first, it is quite a challenge to pack a lifetime in a biographical movie, and second, there's only so much nerve-wrecking tension a person can take following a crime. However, I was expecting the average duration of Animations to be shorter than 110 minutes, because they are produced mainly for children, who have a short attention span and low patience to sit through a two-hour movie. But then again, we are talking about the most popular movies of last year on IMDb, which means that adults made up the large audience.
Conclusion
This is a simple web scraping project that can reveal a lot of information about people's movie preferences. It would be interesting to also analyze the total gross and see which movies and genres have sold best in 2018. Now you could try to scrape and analyze this information with your preferred tool and let me know what you found out!