Web scraping with Beautifulsoup - tutorial series
Python Tutorials & Methods

from bs4 import beautifulsoup

Beautifulsoup is one of the best tools to do web scraping with Python. But hey! Beautifulsoup isn’t a scraper, actually, it is better to say it’s a reader that allows us to read/extract useful information from the webpage.

So let’s import beautifulsoup from bs4…

I know a lot of questions are flying in your mind with extreme chaos about web scraping, python and data cleaning… We will surely discuss that, have some chill bro.

Web scraping with Beautifulsoup – Tutorial Series 1

Web scraping in Python - Process
Web scraping in Python with BeautifulSoup – Process

Many people out there confuse bs4 with Scrapy and Selenium, but these are totally different things, we will be discussing how these all are different from each other.

But first…

What is Beautifulsoup?

BeautifulSoup is a library that reads the web page and lets you extract information from that web page in a well-managed way. BeautifulSoup is easy to learn and has a nice learning curve.

BS4 or Beautifulsoup makes our web scraping journey super easy.

Getting started with Beautifulsoup

Actually, it’s a tutorial series article that’s why I am going to make this part super short.

Install beautifulsoup4 and requests

Here is how you can install bs4 or beautifulsoup4 and requests

pip install beautifulsoup4 requests

So, what is requests now?

What does requests do in Python?

Requests allows us to send HTTP/1.1 requests and dowload those requested pages using Python. Headers, form data, multipart files, and parameters content can be added in the request via some simple Python libraries.

What can you do with BeautifulSoup?

As I’ve told you, Beautifulsoup converts the downloaded web page into a more readable and more accessible format, via which required information can be extracted easily.

Import beautifulsoup4 and requests

Let’s import requests and beautifulsoup4 to get started…

from bs4 import BeautifulSoup
import requests

Instead of doing from bs4 import beautifulsoup, you can also do import bs4 and later bs4.BeautifulSoup

from bs4 import beautifulsoup
from bs4 import Beautifulsoup

Writing some codes

This is a simple code to scrape a site in beautifulsoup4.

<!-- INSIDE URL_TO_SCRAPE.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My title</title>
</head>
<body>
    <ul class="vegatables">
        <li class="vegatable">Cabbage</li>
        <li class="vegatable">Cauli-flower</li>
        <li class="vegatable">Tomato</li>
    </ul>
</body>
</html>

this webpage can be scraped with the below bs4 code in Python.

response = requests.get('URL_TO_SCRAPE.html')
soup = BeautifulSoup(response.text, 'html5lib')

# vegetables here refers to arttribute value of class
veggies = soup.find('ul', class_ ='vegatables'}).find_all('li', {'class': 'vegetable'})
print(veggies)

Here, URL_TO_SCRAPE means the URL of the site or web app you wanna scrape. And, veggies is the list of vegetables you scraped from that URL.

[
   '<li class="vegatable">Cabbage</li>',
   '<li class="vegatable">Cauli-flower</li>',
   '<li class="vegatable">Tomato</li>'
]

All about web scraping

Now, I am going to answer some most repeated in web scraping. Web scraping deserves respect, so read the whole part.

Is web scraping legal?

Web scraping is totally legal if you use it to scrape public content and articles available on the internet. Yes, using it to scrape login-required content, could be an issue. Asking permission from the companies you’re scraping would be a better and more authorized way to scrape login-required content.

How difficult is web scraping?

Web scraping is extremely easy if the site you’re scraping is a static website. The problem may arise when web scraping is tried on a Javascript-based website.
But yes, mostly Javascript based website uses API and post request, and gaining information from API is not that hard.

Which programming language is best for web scraping?

Without any doubts Python, ’cause it’s easy, flexible and syntax is super short. Languages like Java and JavaScript (Node.Js) do have the libraries for web scraping, and if you can sacrifice your time for these languages, you’re good to do.

Why is BeautifulSoup used in Python?

BeautifulSoup is basically a Python library that converts HTML and XML files into hierarchical and more readable format via which we can access the data/info easily from that file, and yes it makes our web scraping process super easier.
It makes a parse tree from page source code that can be utilized to extricate information in a hierarchical and more decipherable way.

Thanks for reading.

I’ll soon post Best Practices For Web Scraping and How to scrape websites without getting blocked. You can check our web scraping with Python Part 2.

Happy Web Scraping and Automation!

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x