Ozgur's Blog

Random ramblings of personal nature

Running Beautiful Soup in Real World


Introduction

Yesterday we looked at how to use Beautiful Soup on a local web page and had shown some basic methods of scraping. Today, we are going to deploy this against a real web page, which is again my own site. But this time we are going to make requests and get the response data as a scraping source. Also I'm going to show some good neighbor policies to not to get banned due to acting like a Denial of Service agent. Speaking of... Pls do not DDoS me people!

Making Requests

Python has a built in request library within its urllib package. To make requests we should import a class and a function from it:

from urllib.request import Request, urlopen

After making them usable we will craft our Request first and shoot it with urlopen:

url = "https://ozgurcakmak.com.tr"
request = Request(url,headers={"User-Agent":"Mozilla/5.0"})  # Header setup - I will explain this below
webpage = urlopen(request).read() # Sending the request and reading the response
print(webpage)

The crucial piece here is the header part. Without the headers the website you are sending your request may see your attempt as illegitimate and send you a forbidden response. With this we are basically saying:

Hi Mr. Web Server, sir! I am an user with a Mozilla Web browser! No! I am totes not lying! Why do you think so?

Then we use this request to read the url specified in it with the urlopen function.

The output is too ugly to copypasta here but what you'll get is the html structure of my webpage with all the \n's and whatnot.

Passing the request to BeautifulSoup

If everything worked correctly and we saw this HTML data, we can pass it to our Beautiful Soup as yesterday and print out the data:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

url = "https://ozgurcakmak.com.tr"
request = Request(url,headers={"User-Agent":"Mozilla/5.0"})  
webpage = urlopen(request).read()

soup = BeautifulSoup(webpage,"html.parser")  # Passing the response to the bs4 parser, notifying that it should interpret this as html

for article in soup.find_all("article",attrs={"class":"row teaser"}):  
    title = article.find("h4")
    print(title.text.strip()) # title
    print(article.find("li")["title"].split("T")[0]) # date of writing
    print(article.find("div", attrs={"class":"content"}).text.strip()) # Summary    
    print(title.find("a")["href"]) # URL

Being a Good Neighbor

As you can imagine I can put this type of code within a for loop and make tens of thousands requests in a very short time. There is no nice way of saying this, don't do it. It reads like a denial of service attack and practically it is. To give some sort of breathing space between our requests we can make the loop sleep with time.sleep() function:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import time

for url in some_url_list:
    request = Request(url,headers={"User-Agent":"Mozilla/5.0"})  
    webpage = urlopen(request).read()

    soup = BeautifulSoup(webpage,"html.parser")

    for article in soup.find_all("article",attrs={"class":"row teaser"}):  
    title = article.find("h4")
    print(title.text.strip()) # title
    print(article.find("li")["title"].split("T")[0]) # date of writing
    print(article.find("div", attrs={"class":"content"}).text.strip()) # Summary    
    print(title.find("a")["href"]) # URL

    time.sleep(3) # sleep for 3 seconds before moving to next iteration

Thanks for reading!