Scrape website using python
Python Tutorials & Methods

Scrape website using python

Welcome again to the Python web scraping tutorial series. In today’s #2 article we will be scraping a website with Python and extracting data from it with Beautifulsoup.

The website we gonna scrape is a simple HTML/CSS image sharing site so that we can concentrate on understanding the core concept of web scraping rather than worrying about the website’s structure.

Scraping a website using python

Now, let’s start our mission of scraping the given website with Python.

Configuring Virtual Environment

Using a virtual environment and working inside it is one of the best practices while working on Python projects.

Follow the below codes if you’re on a windows device.

pip install virtualenv   # install Virtual Environment
python -m venv env       # create Virtual Environment
env/Scripts/Activate.ps1 # activate Virtual Environment

Follow the below codes if you’re on a Linux device. You can use the compatible package manager (apt/yum/pacman..) with your current distro to install Virtual Environment.

sudo apt install virtualenv # install Virtual Environment
virtualenv env           # create Virtual Environment
source env/bin/activate  # activate Virtual Environment

Install dependencies and requirements

After activating virtual env for our scraping project, let’s install all the requirements.

pip install requests   # Python Requests
pip install bs4        # BeautifulSoup4
pip install htmlparser # HTLM5 Parser
pip install requests bs4 htmlparser

Initializing web scraping

Create a new file name with Scrape.py or any name you want. Import Beautifulsoup and Python requests in that file.

from bs4 import BeautifulSoup
import requests

Now, time for some variables! Let’s write our first variable as site_url in our next line.

site_url = 'https://img.webmatrices.com/'

Eventually, if you got any issues with this site I’ve given you the HTML source in the GitHub repo

Advertisement
.

Extracting HTML source

We need the HTML source to scrape the site because the HTML source is the actual source of data.

In order to extract or download the HTML source of any website, we use Python Requests. So, the job will be done with a requests.get and translating it to text/string.

response = requests.get(site_url)
source = response.text

Here’s requests.get sends/receives a response from the server. And with response.text, the whole HTML source of the website gets extracted.

Parsing HTML with Beautifulsoup

Now, here comes the use of HTMLParser. We will be using BeautifulSoup and HTMLParser library to parse HTML.

soup = BeautifulSoup(source, 'html.parser')

Here, BeautifulSoup cooks and parses the HTML source of the webpage using HTMLParser. The parsed soup is well-managed and well-stored in such a way with can extract any HTML element just with find and find_all command.

Scraping data begins

So, we need to understand how the chunk of HTML is rendered in the webpage. In order to understand the webpage, we need to understand the data requirements.

Data requirements: We will be extracting descriptions, image sources and image alt of all the images.

After knowing the requirements, we need to divide the necessary piece of HTML from the rendered HTML chunks.

<div class="gallery">
  <a target="_blank" href="./site-source-code_files/Bishwas_Bhandari.jpg">
    <img
      src="./site-source-code_files/Bishwas_Bhandari.jpg"
      alt="Subscribe to Developer Bishwas Bhandari"
      width="600"
      height="400"
    />
  </a>
  <div class="desc">Subscribe to Developer Bishwas Bhandari</div>
</div>

Now, we will be finding all the divisions with gallery class-name (i.e <div class="gallery">), because those division elements contain the required data in them.

gallery_elements = soup.find_all('div', class_ = 'gallery')
# or .find_all('div', {'class': 'gallery'})

soup.find_all('div', class_ = 'gallery') returns all the elements as a list in the gallery_elements variable.

Now, we will be looping through gallery_elements in order to access each element inside it.

for index, gallery_element in enumerate(gallery_elements):

After, this loop we will be finding the img-tag (i.e img) inside each gallery_element.

image = gallery_element.find('img')

So, we got the img-tag, now we will be getting the attributes inside it. In general, to get an attribute of an souped-HTML-element,

<tag attribute_name='my-attribute'>Inside Tag.. </tag>
souped_HTML_element = soup.find('tag', class_='some-class')
arrtibute = souped_HTML_element['arrtibute_name']

So, to get source/path of an image we need to do,

image_source = image['src']

and same for the image alt,

image_alt = image['alt']

Something similar goes for the discription

discription = gallery_element.find('div', class_="desc")

Finally, we looped through the gallery_elements and iterated it, but we still need to collect clean data within the loop.

for index, gallery_element in enumerate(gallery_elements):
   image = gallery_element.find('img')
   image_source = image['src']
   image_alt = image['alt']
   # for description
   description = gallery_element.find('div', class_="desc")
   # use .text to get text from the description

For collecting clean data, we will be creating a empty dictionary variable before the loop starts, gallery_data = {}.

gallery_data = {}
for index, gallery_element in enumerate(gallery_elements):
    ...
    

Updating the data in gallery_update within the loop.

...
gallery_data.update({
  index: {
     'source': image_source,
     'alt': image_alt,
     'description': description.text
     }
 })

Finally printing gallery_data gives,

{0: {'source': '/media/Bishwas_Bhandari.jpg', 'alt': 'Subscribe to Developer Bishwas Bhandari', 'discription': 'Subscribe to Developer Bishwas Bhandari'}, 1: {'source': '/media/Existence_of_1d_object_and_its_motion_with1.gif', 'alt': 'Existance of 2d object', 'discription': 'Existance of 2d object'}, 2: {'source': '/media/7495.jpg', 'alt': 'Developer Illustration - Programming Image', 'discription': 'Developer Illustration - Programming Image'}, 3: {'source': '/media/linkedin_hashtag_generator.png', 'alt': 'linkedin hashtag generator', 'discription': 'linkedin hashtag generator'}, 4: {'source': '/media/LinkedIn_Hashtag_Generator_icon.png', 'alt': 'LinkedIn Hashtag Generator Icon', 'discription': 'LinkedIn Hashtag Generator Icon'}, 5: {'source': '/media/coder2.gif', 'alt': 'programmer coding gif', 'discription': 'programmer coding gif'}, 6: {'source': '/media/suprised_with_big_eyes.gif', 'alt': 'Suprised with BIG eyes', 'discription': 'Suprised with BIG eyes'}, 7: {'source': '/media/laughing_gif_meme.gif', 'alt': 'Laughing Gif Meme', 'discription': 'Laughing Gif Meme'}, 8: {'source': '/media/lemme_see_lemme_see_gif_meme.gif', 'alt': 'Lemme see gif meme', 'discription': 'Lemme see gif meme'}, 9: {'source': '/media/sheldon_meme_gif.gif', 'alt': "It's funny because it's true - sheldon meme", 'discription': "It's funny because it's true - sheldon meme"}, 10: {'source': '/media/wow_meme_gif_template.gif', 'alt': 'Wow meme gif template', 'discription': 'Wow meme gif template'}, 11: {'source': '/media/tej_magar_clone.jpg', 'alt': "tej magar clone'", 'discription': "tej magar clone'"}, 12: {'source': '/media/dekha_apne_laparwahi_ka_natija_gif.gif', 'alt': 'dekha apne laparwahi ka natija gif', 'discription': 'dekha apne laparwahi ka natija gif'}, 13: {'source': '/media/Affiliate_marketing_flarum_extension_free.png', 'alt': 'Affiliate marketing flarum extension free', 'discription': 'Affiliate marketing flarum extension free'}, 14: {'source': '/media/linux-physics-chemistry.jpg', 'alt': 'Linux wallpaper for mobile, linux: physics, chemist', 'discription': 'Linux wallpaper for mobile, linux: physics, chemist'}}

How to clean scraped data in Python?

But hey, that data is not that easy to read and understand, so we will be using json formatting with the help an in-built library in python called json

Import the json library

In order to use the json library, we must need to import it. so …

import json

Dump dictionary into json format

Dumping or cleaning the python dictionary to json-format is a very clever practice. And giving indent=4 is the cleverest one.

gallery_data = json.dumps(gallery_data, indent=4)

After printing this data, we will get…

{
    "0": {
        "source": "/media/Bishwas_Bhandari.jpg",
        "alt": "Subscribe to Developer Bishwas Bhandari",
        "description": "Subscribe to Developer Bishwas Bhandari"
    },
    "1": {
        "source": "/media/Existence_of_1d_object_and_its_motion_with1.gif",
        "alt": "Existance of 2d object",
        "description": "Existance of 2d object"
    },
    "2": {
        "source": "/media/7495.jpg",
        "alt": "Developer Illustration - Programming Image",
        "description": "Developer Illustration - Programming Image"
    },
    ...
    "14": {
        "source": "/media/linux-physics-chemistry.jpg",
        "alt": "Linux wallpaper for mobile, linux: physics, chemist",
        "description": "Linux wallpaper for mobile, linux: physics, chemist"
    }
}

Task for you

As the source has the value similar to /media/image.png, which doesn’t contains the whole-actual path of the image. Why don’t you make it something similar to https://img.webmatrices.com/media/image.png.

Also, have a look at final code (GitHub repo).

Happy coding.

0 0 votes
Article Rating
Subscribe
Notify of
guest
1 Comment
Most Voted
Newest Oldest
Inline Feedbacks
View all comments

[…] hope you guys know how to create a virtual environment and the required packages for this task, if not you can check the given links. So, let’s […]

1
0
Would love your thoughts, please comment.x
()
x