Introduction

In this post, I will showcase my implementation of Scrapy to create recommendations of movies based on a favorite movie and the number of shared actors between that movie and others.

Scrapy

First, we start a Scrapy project. The following code that I will show you is in a new Python file in the directory called “spiders”

Creating our Spider

First, we need to import Scrapy and the Request module

import scrapy
from scrapy.http import Request

Then, we create our spider:

class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    def __init__(self, subdir="", *args, **kwargs):
        self.start_urls = [f"https://www.themoviedb.org/movie/{subdir}/"]

Inside this class, we will need three parse functions. The first will send the spider from the main page(s) to the credits page. The second will send the spider from the credits page to the pages of the actors. Finally, the third parse function will glean information from the actors’ pages.

`parse`

This function takes the main urls and sends the spider to the credits page. This can be hard coded, because the movie database is consistent with their url naming conventions.

def parse(self, response):
        cast_page_urls = [url + "cast" for url in self.start_urls] # get absolute urls

        for url in cast_page_urls:
            yield Request(url, callback=self.parse_full_credits) # send the spider to parse credits

This function is pretty simple, as we are essentially just tacking on “cast” to the end of each url, which will send us to the credits page. Then, we send the spider to parse the credits page using the Request function. Setting callback=self.parse_full_credits makes the spider use the appropriate function to parse this credits page.

`parse_full_credits`

This method receives the credit page and finds the actors’ urls, which it gives the spider to parse further. First, we create the function with the required parameters from Scrapy:

def parse_full_credits(self, response):

Isolating the urls

We want to get the different urls that lead to each of the actors’ pages, so inside the function we start by selecting the element that contains all of the people on the page

people = response.css("section.panel.pad")

Now we have all of the people who worked on this film. However, there are cast and crew members, so we need to figure out how they are ordered.

cast_index = people.css("h3::text").getall().index("Cast ") # make a list of the headers, find where the actors are

Then, we use this cast_index to isolate the elements that contain the urls

actors = people[cast_index].css("a") # isolate the elements with the links

Finally, since we have a list of the relative urls, we need to add them on to the current url to create a complete link:

actor_urls = [response.urljoin(actor_url.attrib["href"]) for actor_url in actors]

Yielding the urls

Now we want to give the actors’ urls to our spider to investigate further with another method that we will define later.

for actor_url in actor_urls:
    yield Request(actor_url, callback=self.parse_actor_page)

In full, our function looks like this:

def parse_full_credits(self, response):
        people = response.css("section.panel.pad") # isolate the people
        cast_index = people.css("h3::text").getall().index("Cast ") # find where the actors are
        actors = people[cast_index].css("a") # find where the actors' urls are
        actor_urls = [response.urljoin(actor_url.attrib["href"]) for actor_url in actors] # make the full urls

        for actor_url in actor_urls:
            yield Request(actor_url, callback=self.parse_actor_page) # send the spider to parse the actors' pages

`parse_actor_page`

Now, we will create our final parsing method that takes in an actor’s page and yields data about their acting roles.

def parse_actor_page(self, response):

Isolating data

First, we need to get the name from our actor’s page. This can be figured out by looking in the source code of the page and finding where the name is, which is consistent across different actors’ pages.

name = response.css("h2.title a::text").get()

Now we need to get the acting roles that this actor has been in. First, we isolate the part of their page that has what we want in it.

credits = response.css("div.credits_list")

This credits_list has more than just acting roles, however, so we need to find out where their acting roles are:

acting_index = credits.css("h3::text").getall().index("Acting")

Now that we know where it is, we can get the titles of the movies or TV shows that they were in by specifying which table we want to look at.

acting_titles = credits.css("table.card.credits")[acting_index].css("bdi::text").getall()

Yielding the data

In order for Scrapy to be able to put this data into a .csv file, we want to format how this data is yielded, so we make it into a dictionary for each of the different TV shows or movies that they were in.

for title in acting_titles:
    yield {"name": name, "movie_or_TV_name": title}

In full, our function looks like this:

def parse_actor_page(self, response):
        name = response.css("h2.title a::text").get() # get the name of the actor

        credits = response.css("div.credits_list") # isolate their roles

        acting_index = credits.css("h3::text").getall().index("Acting") # make sure we only find their acting roles

        acting_titles = credits.css("table.card.credits")[acting_index].css("bdi::text").getall() # get the movies they were in

        for title in acting_titles:
            yield {"name": name, "movie_or_TV_name": title} # format the parsed data

Running scrapy crawl tmdb_spider -o results.csv -a subdir=149871 in anaconda prompt while in your project folder makes a file called results.csv that we can further analyze to make recommendations. This parses information about my favorite movie, The Tale of the Princess Kaguya. If instead you want to do your own movie, go to its website at the movie database and set subdir equal to the part of the url that comes after “/movie/”.

Creating recommendations

Now, we will create our recommendations. Our first step is to import pandas and read in our dataframe.

import pandas as pd
actors = pd.read_csv("results.csv")

Then, we group the actors by common movies.

show_groups = actors.groupby("movie_or_TV_name")

After that, we get a pandas series of the lengths of each groups and what movie they correspond to, sorting them in descending order.

sorted_actor_counts = show_groups.size().sort_values(ascending = False)

We then format our series object so that it is a dataframe with proper column labels and so that we don’t include the obvious movie that has all of these actors together (the original movie).

recommendations = sorted_actor_counts.to_frame().rename(columns={0:"Number of Shared Actors"}).iloc[1:, :].reset_index(drop=False)

Now, we have our recommendations!

recommendations[0:10]

	movie_or_TV_name	Number of Shared Actors
0	Isao Takahata and His Tale of the Princess Kaguya	12
1	AIBOU: Tokyo Detective Duo	3
2	Solitary Gourmet	3
3	After School	3
4	Supermarket Woman	3
5	北の国から記憶前編	2
6	The Return	2
7	First Class	2
8	Amachan	2
9	Woman in Witness Protection	2

Visualization

We can make a dot plot that shows how frequent movies with a lot of shared actors are. First, we import plotly.

import plotly.express as px

Then we find how many movies have the same number of actors that were in the original movie.

recommendations["Number of Movies"] = recommendations.groupby("Number of Shared Actors").transform("size")

Then we plot.

fig = px.scatter(recommendations, x="Number of Shared Actors", y="Number of Movies")
fig.show()

We can see that the vast majority of movies that these actors worked on were not with each other.