Want some awesome facts to start your day or just want to learn something new or out here for some fun !
Well, in this blog I will be sharing a script that I did over a weekend just for fun (and for a side project ๐).
What I will not discuss here:
How to Scrap ? That we will discuss !!
TechStack
- Python
- BeautifulSoup
Final Result (Before diving into the steps)
We will be saving all the scraped quotes in a text file. That would look something like this
Let's Do Scraping
Analyse The Website To Scrape
As we want to scrape the awesome facts, we are going to get that from Computer Facts
A look at webpage
Now, we want facts like 1. The first electronic computer ENIAC weighed more than 27 tons and took up 1800 square feet.
, so we should take a look at its source page first and find where is this present, to get the pattern (we will use that pattern later to scrape these quotes).
As we can see on page source, the facts start from <p>
tag on line 475
and there are several <p>
tags that contains these facts. If we look further, we would find that there are a lot of <p>
tags being used within the HTML. How would we differentiate our facts with other <p>
tags ?
If you looked hard, then you know ! There is a unique descriptor where facts are present, each fact precedes a with number (0-9) followed by (.) and space. Hence, we have our pattern ๐
We can ignore all other <p>
tags and take only that matches a particular pattern we identified. To match pattern, we will use regex module of python, re
.
Hence, the pattern we identified is
- All facts are inside
<p>
tags. - There are other contents inside
<p>
tag as well. - To get facts, we will use regex to extract required content, leaving other
<p>
tags as it is !
Import Required Libraries And Define Constants
Before doing the task of scraping, we need to import libraries and define some constants.
# Standard Imports
import os
import re
from typing import List
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from bs4.element import Tag
import os
will import python standard library, used to make directory and other OS stuff.re
will be used to extract a pattern using some regex.- Then we import
List
fromtyping
, to explicitly type hint some arguments passed to some functions. Request
andurlopen
from standard libraryurllib
. They would help us to get HTML data we want.BeautifulSoup
from third-party librarybs4
, this creates a easy to manage BeautifulSoup object, that represents HTML content.Tag
frombs4.element
, again to be used for type hinting.
QUOTES_FILENAME = "/funfacts.txt"
QUOTES_TXT_PATH = os.getcwd() + "/funfacts"
QUOTES_FILE_PATH = QUOTES_TXT_PATH + QUOTES_FILENAME
QUOTES_URL = "https://gotechug.com/interesting-facts-about-computers-you-didnt-know/"
QUOTES_FILENAME
represents the filename of the text file where the facts will be stored.QUOTES_TXT_PATH
represents the folder where thefunfacts.txt
will reside.QUOTES_FILE_PATH
file path, to be used further to store data.QUOTES_URL
url to webpage we want to scrape.
Why name the variables as QUOTES
? I had another script to scrape some great quotes and reused that same for scraping awesome facts. Need to know more about scraping quotes ? Visit this link.
Create BeautifulSoup Object
Now, we have some constants to play with, we will use QUOTES_URL
to get the HTML of the page and create a BeautifulSoup object to filter our facts.
def get_bs4_obj(url: str) -> BeautifulSoup:
'''
Get BeautifulSoup object for given QUOTES_URL.
'''
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
bs4Obj = BeautifulSoup(html, 'html.parser')
return bs4Obj
We create a function get_bs4_obj
that takes url
a string type object and returns the required object bs4Obj
.
Request
helps us to bypass the security that is blocking the use ofurllib
based on the user agent. Hence, we need to giveMozilla
or any other browser user agent to let us access their source page.- Then we do
urlopen
with the defined user agent inRequest
and read it's content. - At last we create
BeautifulSoup
object usinghtml.parser
to parse the html contents and return the required object.
Steps I used to check if we are allowed to scrape the site or not
import requests #pip install requests
req = requests.get(QUOTES_URL)
print(req.status_code) #should be 200, other than 200 means scraping not/partially allowed
Filter Tags
With having bs4Obj
, now it is possible to filter the HTML on <p>
tags, we can achieve that using below functions
def filter_p_tags(ptag: Tag) -> bool:
'''
Get required p tags only.
'''
return re.match(r"[0-9]+.", ptag.get_text())
def get_p_tags(bs4Obj: BeautifulSoup) -> List[Tag]:
'''
Get all p tags from the BeautifulSoup obj.
Note: It is the requirement for given QUOTES_URL, it shall be different for different URL to scrap.
'''
allP = bs4Obj.find_all('p')
allReleventP = list(filter(filter_p_tags, allP))
return allReleventP
See how we used the List
and Tag
for type hints and BeautifulSoup
as well. Type hints are not mandatory and are not checked by python, these are just for developers and know that Python will always be dynamically typed !
allP
list contains all <p>
tags available in HTML. We filter them using python's inbuilt filter
function on the condition that, if text of any <p>
tag does not have a pattern matching regex [0-9]+.
then filter them out from allP
list.
Pattern matching regex, we define that in function filter_p_tags
, where we use re.match
to match the occurence of our pattern in the given text. You can do that in get_p_tags
itself, but I think it is better to have such task done in pure functions instead of making them in single function.
Store all relevent tags in variable allReleventP
and return that (to be used by another function !)
Get Facts
Now, we will create a generator to yield the facts we need to save in a text file. We will be using allReleventP
variable passed to the new function.
def get_all_facts(ptags: List[Tag]):
'''
Yield all facts present in p tags.
'''
for p in ptags:
fact = re.sub(r"[0-9]+\. ", "", p.get_text())
yield fact + "\n"
Here we are extracting text present in <p>
tag. But wait !! We don't need preceding number, right ? That's why we are using re.sub
to substitute it with empty string. Regex we used "[0-9]+\. "
, it will take any number followed by .
and a space.
We will use this generator in next section, inside another function.
Save Facts
Here is the function
def save_fun_facts(ptags: List[Tag]):
'''
Save extracted facts in a text file, create a new folder if not already present
'''
global QUOTES_TXT_PATH, QUOTES_FILE_PATH
if not os.path.exists(QUOTES_TXT_PATH):
os.mkdir(QUOTES_TXT_PATH)
with open(QUOTES_FILE_PATH, 'w') as file:
for txt in get_all_facts(ptags):
file.write(txt)
print(f'All Fun Facts written to file: {QUOTES_FILE_PATH}')
We take global constants QUOTES_TXT_PATH
and QUOTES_FILE_PATH
to be used for writing to the file. We check if the directory exists or not (the folder where we will save our file). Then we open the file in the created directory (if it not exists).
Now here comes the use of generator, we call generator on each <p>
tag that yields text present in it.
Main
Now the script to execute the code
if __name__ == "__main__":
bs4Obj = get_bs4_obj(QUOTES_URL)
allP = get_p_tags(bs4Obj)
save_fun_facts(allP)
If you check now, the facts should be appearing in the file created funfacts.txt
inside the folder /funfacts
.
Well, that was it, if you followed this then you now know the gist of general process for web scraping and how to scrape Awesome Facts !
It was just a drop in the ocean full of different functionalities you can achieve with web scraping, but the general idea remains the same:
- Get the website we want to scrape.
- Analyse the required components we need and find the pattern.
- Create BeautifulSoup object to ease the process and access HTML without much problem.
- Create small functions to get the task done.
- Finally integrate all function and see the magic of the script !
Full python script can be found here.
GitHub action where this script is used. Would love to hear your feedback on this as well.
Just starting your Open Source Journey ? Don't forget to check out Hello Open Source
Want to make a simple and awesome game from scratch ? Check PongPong
Till next time !
Namaste ๐