Creating a Network of Harry Potter Characters - Part 1: Web Scraping
The magical world of J.K. Rowling's Harry Potter series is a treasure trove of rich characters, intricate relationships, and captivating narratives. Analyzing the interactions and connections between these characters can unveil fascinating insights into the wizarding universe. In this blog series, we'll explore how to create a character interaction network for the Harry Potter series.
Part 1: Web Scraping the List of Characters
To kickstart our journey into the Harry Potter universe, the first step is to gather a comprehensive list of characters, including details such as their first appearance, aliases, loyalties, and more. Fortunately, the Harry Potter Fandom Wiki's Character Indexes page provides this information, making it an ideal source for our web scraping endeavor. For each book in the series, these indexes contain a list of characters by the chapter in which they first appear. For example, here's the list of characters from the first chapter of the first book, Harry Potter and the Sorcerer's Stone, in the series:
Vernon Dursley
Lily Potter
Lord Voldemort
Vernon Dursley's secretary
Dedalus Diggle
Rubeus Hagrid
Petunia Dursley
James Potter
Jim McGuffin
Albus Dumbledore
Poppy Pomfrey
Sirius Black
Dudley Dursley
Harry Potter
Ted
Minerva McGonagall
Gemma Jones
We'll use Python and a few libraries to scrape and extract the character data from the wiki. The first step is to identify the HTML elements that contain the character names. We can do this by inspecting the HTML of the page. For example, here's the HTML for the first character in the list above, Vernon Dursley:
The characters are contained in a table with the class article-table. Each character is contained in the <a/> tag with the title attribute set to the character's name. We can use the requests library to send an HTTP request to the page and the BeautifulSoup library to parse the HTML and extract the character names.
Setting up the Environment
Before we dive into the web scraping process, let's make sure we have the necessary tools and libraries at our disposal.
Python: Ensure you have Python installed on your system.
Requests: This library allows us to send HTTP requests to the web pages we want to scrape.
Beautiful Soup is a Python library that helps parse HTML and XML documents. It makes it easy to navigate, search, and modify the parse tree.
We can install these libraries using the pip package manager. If you don't have pip installed, you can follow the instructions here to install it. Once you have pip installed, you can install the libraries by running the following commands in your terminal:
Scraping the list of Characters
We first define the URL of the page we want to scrape. We'll use the URL of the first book in the series, Harry Potter and the Sorcerer's Stone, as an example.
Each character's page on the wiki contains additional information about the character, such as their aliases, loyalties, and family relations.
We can use the requests library to send an HTTP request to the page and the BeautifulSoup library to parse the HTML and extract these fields.
Other fields such as blood_status, nationality, species, house, and gender can be extracted using a similar approach.
We'll define a list of tuples, where each tuple contains the name of the field and a function that extracts the value of the field from the HTML. We'll use this list to extract the values of these fields from the HTML.
We define a Character dataclass to represent a character. The class contains the character's name, the URL of their page on the wiki, and the additional information we extract from the page. We also define a Chapter dataclass to represent a chapter in a Harry Potter book and the characters mentioned in it.
We use the get_characters_by_chapter method to scrape the list of characters from the wiki. The method takes the URL of the page to scrape as an argument and returns a list of Chapter objects. Each Chapter object contains the chapter number and a list of Character objects. We use the requests library to send an HTTP request to the page and the BeautifulSoup library to parse the HTML and extract the character names.
Finally, we can put everything together to scrape the list of characters from the wiki. We define a scrape method that takes a dictionary of book numbers and their corresponding URLs as an argument and returns a dictionary of book numbers and a list of Chapter objects. We use the tqdm library to display a progress bar while the scraping is in progress.