Ozgur's Blog

Random ramblings of personal nature

Web Scraping with Python & Beautiful Soup


Installation of BeautifulSoup4

To install BeautifulSoup and its requirements we can use pip:

pip install bs4  # This installs the BeautifulSoup4 library

Goals and Initial Steps

As an example I will be using a downloaded copy of my website's homepage; residing in the same folder as my script - test.html.

From here I want to scrape the

  • title of the post
  • date of publish
  • summary of the post
  • url of the post

First I need to check the structure of the data I want to scrape. To do it I first open the webpage with my browser and then rightclick on a post and click Inspect to see its html properties.

I see that my articles reside in an article tag. So if I somehow can get ALL the articles and parse their data I can fulfill my goals.

To do that I create a script like this and do an initial examination whether my selection is true:

from bs4 import BeautifulSoup


webpage = open("test.html")
soup = BeautifulSoup(webpage,"html.parser")  # Passing the file to the bs4 parser, notifying that it should interpret this as html



for article in soup.find_all("article",attrs={"class":"row teaser"}):  # Sure article is enough but attaching a class makes it more clear
    print(article) # Let's see what we get here

It opens the document like any other file and passes this to the BeautifulSoup parser and it enables us to use functions like find_all.

The printout of this statement gives us this result:

<article class="row teaser">
<header class="col-sm-4 text-muted">
<ul>
<li title="2019-11-19T10:20:00+03:00">
<i class="fa fa-clock-o"></i>
            Tue 19 November 2019
        </li>
<li>
<i class="fa fa-folder-open-o"></i>
<a href="./category/20mins.html">20Mins</a>
</li>
<li>
<i class="fa fa-user-o"></i>
<a href="./author/ozgur-ozan-cakmak.html">Ozgur Ozan Cakmak</a> </li>
</ul>
</header>
<div class="col-sm-8">
<h4 class="title">
<a href="./nietzsche-20-mins.html">Nietzsche - 20 Mins</a>
</h4>
<div class="content">
        Nietzsche is a philosopher I love and adore. Here is me trying to sum up his thoughts in 20 Mins
        </div>
</div>
</article>

<article class="row teaser">
<header class="col-sm-4 text-muted">
<ul>
<li title="2019-11-19T00:00:00+03:00">
<i class="fa fa-clock-o"></i>
            Tue 19 November 2019
        </li>
<li>
<i class="fa fa-folder-open-o"></i>
<a href="./category/20mins.html">20Mins</a>
</li>
<li>
<i class="fa fa-user-o"></i>
<a href="./author/ozgur-ozan-cakmak.html">Ozgur Ozan Cakmak</a> </li>
</ul>
</header>
<div class="col-sm-8">
<h4 class="title">
<a href="./two-factor-authentication-20-mins.html">Two Factor Authentication - 20 Mins</a>
</h4>
<div class="content">
        What is Two Factor Authentication written in 20 Mins
        </div>
</div>
</article>
[...]

As you can see, we are getting the correct part of the website; only article nodes are outputted and they have all the information we need. Now let's parse them to get the information we need, stripped off of other details.

Data Extraction

Title

Let's print out the title first. To do that I must find the h4 node in each article item and access its text:

from bs4 import BeautifulSoup

webpage = open("test.html")
soup = BeautifulSoup(webpage,"html.parser")  # Passing the file to the bs4 parser, notifying that it should interpret this as html

for article in soup.find_all("article",attrs={"class":"row teaser"}): 
    print(article.find("h4").text)

Boom! Let's see the results:

Nietzsche - 20 Mins


Two Factor Authentication - 20 Mins


Staged Scanning with Nmap

The problem here is, the text itself is not clean and the newline characters mess up the output. So to clean it we need to use our friend strip():

for article in soup.find_all("article",attrs={"class":"row teaser"}):  
    print(article.find("h4").text.strip())

Yaas! We are on fire:

Nietzsche - 20 Mins
Two Factor Authentication - 20 Mins
Staged Scanning with Nmap

Date of Writing

At first glance the date of writing seems to reside in a i with a class of fa fa-clock-o. But if we are trying to get a numerical value we see that it is on a li and its title attribute carries the data we want. To access a node's attribute we use [] notation. Notice how there is no .text is involved:

print(article.find("li")["title"]) # Date of writing vol1 2019-10-29T19:15:00+03:00

This is good and fine but let's say all I want from this data is the year-month-day. To do that I run a split on the T character and get the first element:

print(article.find("li")["title"].split("T")[0]) # 2019-10-29

The end result can even be divided more and set in their respective variables. But that is enough for me.

Summary

When we want to scrape the summary we see that they reside in a div with a class named as content. Let's grab that sucka! We'll use our trustworthy find and run a strip on the text it finds:

print(article.find("div", attrs={"class":"content"}).text.strip())

URL

Finally let's nab the URL. But as you can see we have a problem with URL. We have three a's in the articles and we only want to take the blogpost url, not my author's bio or the category link. We can do this in two ways, the bad way involving magic numbers and the good way which involves proper selection of the a node.

Bad way involves running a find_all on all a's and grab the last one and access its href attribute. Again notice that we didn't use any additional .text property to access the data:

print(article.find_all("a")[2]["href"])

This gives us the correct data but I personally dislike this usage, because it uses magic numbers. This will break if I add another link.

Good way involves noticing that our target a node is residing inside an h4. And no matter how many links I add, the a in the h4 tag will get us the post link. So let's push the title to a variable and get the a within it:

title = article.find("h4")
print(title.find("a")["href"])

Of course if I am putting title in a variable, I should change the structure of my program not to repeat myself (and be more readable)

title = article.find("h4")
print(title.text.strip()) # title
print(article.find("li")["title"].split("T")[0]) # date of writing
print(article.find("div", attrs={"class":"content"}).text.strip()) # Summary    
print(title.find("a")["href"]) # URL

In its final state our program looks like something like this:

from bs4 import BeautifulSoup

webpage = open("test.html") # Open the file
soup = BeautifulSoup(webpage,"html.parser")  # Passing the file to the bs4 parser, notifying that it should interpret this as html

for article in soup.find_all("article",attrs={"class":"row teaser"}): # Get all article nodes with the classes row teaser
    title = article.find("h4")  # Get the h4 tag and set a variable
    print(title.text.strip()) # Get the title text
    print(article.find("li")["title"].split("T")[0]) # date of writing
    print(article.find("div", attrs={"class":"content"}).text.strip()) # Summary    
    print(title.find("a")["href"]) # URL

Thanks for reading!