import pandas as pd
= pd.read_csv("results.csv") actors
Introduction
In this post, I will showcase my implementation of Scrapy to create recommendations of movies based on a favorite movie and the number of shared actors between that movie and others.
Scrapy
First, we start a Scrapy project. The following code that I will show you is in a new Python file in the directory called “spiders”
Creating our Spider
First, we need to import Scrapy and the Request module
import scrapy
from scrapy.http import Request
Then, we create our spider:
class TmdbSpider(scrapy.Spider):
= 'tmdb_spider'
name def __init__(self, subdir="", *args, **kwargs):
self.start_urls = [f"https://www.themoviedb.org/movie/{subdir}/"]
Inside this class, we will need three parse functions. The first will send the spider from the main page(s) to the credits page. The second will send the spider from the credits page to the pages of the actors. Finally, the third parse function will glean information from the actors’ pages.
parse
This function takes the main urls and sends the spider to the credits page. This can be hard coded, because the movie database is consistent with their url naming conventions.
def parse(self, response):
= [url + "cast" for url in self.start_urls] # get absolute urls
cast_page_urls
for url in cast_page_urls:
yield Request(url, callback=self.parse_full_credits) # send the spider to parse credits
This function is pretty simple, as we are essentially just tacking on “cast” to the end of each url, which will send us to the credits page. Then, we send the spider to parse the credits page using the Request
function. Setting callback=self.parse_full_credits
makes the spider use the appropriate function to parse this credits page.
parse_full_credits
This method receives the credit page and finds the actors’ urls, which it gives the spider to parse further. First, we create the function with the required parameters from Scrapy:
def parse_full_credits(self, response):
Isolating the urls
We want to get the different urls that lead to each of the actors’ pages, so inside the function we start by selecting the element that contains all of the people on the page
= response.css("section.panel.pad") people
Now we have all of the people who worked on this film. However, there are cast and crew members, so we need to figure out how they are ordered.
= people.css("h3::text").getall().index("Cast ") # make a list of the headers, find where the actors are cast_index
Then, we use this cast_index
to isolate the elements that contain the urls
= people[cast_index].css("a") # isolate the elements with the links actors
Finally, since we have a list of the relative urls, we need to add them on to the current url to create a complete link:
= [response.urljoin(actor_url.attrib["href"]) for actor_url in actors] actor_urls
Yielding the urls
Now we want to give the actors’ urls to our spider to investigate further with another method that we will define later.
for actor_url in actor_urls:
yield Request(actor_url, callback=self.parse_actor_page)
In full, our function looks like this:
def parse_full_credits(self, response):
= response.css("section.panel.pad") # isolate the people
people = people.css("h3::text").getall().index("Cast ") # find where the actors are
cast_index = people[cast_index].css("a") # find where the actors' urls are
actors = [response.urljoin(actor_url.attrib["href"]) for actor_url in actors] # make the full urls
actor_urls
for actor_url in actor_urls:
yield Request(actor_url, callback=self.parse_actor_page) # send the spider to parse the actors' pages
parse_actor_page
Now, we will create our final parsing method that takes in an actor’s page and yields data about their acting roles.
def parse_actor_page(self, response):
Isolating data
First, we need to get the name from our actor’s page. This can be figured out by looking in the source code of the page and finding where the name is, which is consistent across different actors’ pages.
= response.css("h2.title a::text").get() name
Now we need to get the acting roles that this actor has been in. First, we isolate the part of their page that has what we want in it.
= response.css("div.credits_list") credits
This credits_list
has more than just acting roles, however, so we need to find out where their acting roles are:
= credits.css("h3::text").getall().index("Acting") acting_index
Now that we know where it is, we can get the titles of the movies or TV shows that they were in by specifying which table we want to look at.
= credits.css("table.card.credits")[acting_index].css("bdi::text").getall() acting_titles
Yielding the data
In order for Scrapy to be able to put this data into a .csv file, we want to format how this data is yielded, so we make it into a dictionary for each of the different TV shows or movies that they were in.
for title in acting_titles:
yield {"name": name, "movie_or_TV_name": title}
In full, our function looks like this:
def parse_actor_page(self, response):
= response.css("h2.title a::text").get() # get the name of the actor
name
= response.css("div.credits_list") # isolate their roles
credits
= credits.css("h3::text").getall().index("Acting") # make sure we only find their acting roles
acting_index
= credits.css("table.card.credits")[acting_index].css("bdi::text").getall() # get the movies they were in
acting_titles
for title in acting_titles:
yield {"name": name, "movie_or_TV_name": title} # format the parsed data
Running scrapy crawl tmdb_spider -o results.csv -a subdir=149871
in anaconda prompt while in your project folder makes a file called results.csv that we can further analyze to make recommendations. This parses information about my favorite movie, The Tale of the Princess Kaguya. If instead you want to do your own movie, go to its website at the movie database and set subdir
equal to the part of the url that comes after “/movie/”.
Creating recommendations
Now, we will create our recommendations. Our first step is to import pandas
and read in our dataframe.
Then, we group the actors by common movies.
= actors.groupby("movie_or_TV_name") show_groups
After that, we get a pandas series of the lengths of each groups and what movie they correspond to, sorting them in descending order.
= show_groups.size().sort_values(ascending = False) sorted_actor_counts
We then format our series object so that it is a dataframe with proper column labels and so that we don’t include the obvious movie that has all of these actors together (the original movie).
= sorted_actor_counts.to_frame().rename(columns={0:"Number of Shared Actors"}).iloc[1:, :].reset_index(drop=False) recommendations
Now, we have our recommendations!
0:10] recommendations[
movie_or_TV_name | Number of Shared Actors | |
---|---|---|
0 | Isao Takahata and His Tale of the Princess Kaguya | 12 |
1 | AIBOU: Tokyo Detective Duo | 3 |
2 | Solitary Gourmet | 3 |
3 | After School | 3 |
4 | Supermarket Woman | 3 |
5 | 北の国から 記憶 前編 | 2 |
6 | The Return | 2 |
7 | First Class | 2 |
8 | Amachan | 2 |
9 | Woman in Witness Protection | 2 |
Visualization
We can make a dot plot that shows how frequent movies with a lot of shared actors are. First, we import plotly.
import plotly.express as px
Then we find how many movies have the same number of actors that were in the original movie.
"Number of Movies"] = recommendations.groupby("Number of Shared Actors").transform("size") recommendations[
Then we plot.
= px.scatter(recommendations, x="Number of Shared Actors", y="Number of Movies")
fig fig.show()
We can see that the vast majority of movies that these actors worked on were not with each other.