Extract Data from HTML With BeautifulSoup

Hello Guys hope you all are having fun time going through our tutorial. In this article we will learn how to pull data from HTML and XML in other words Web Scraping ( extracting informations from websites ).

For this i will be making use of BeautifulSoup module.

BeautifulSoup

BeautifulSoup is python library to pull data out of HTML and XML files. This modules provides few methods and python idioms for navigating, searching, and modifying a parse tree. It basically saves programmers few hours of coding.

Installation

This are the modules that we need to have to get started and install in you local machine using PIP.

$pip install requests
$pip install beautifulsoup4

Let learn this module feature with Examples. We will be using the Following HTML document to cover this module. Note that the examples in this tutorial should work the same way in Python2.7 and Python3.2

html_doc = """
<html>
<head><title>Python Lovers</title></head>
<body>
<p class="hello">We are the co-founders of Python Lovers</p>
<p>1. Kamal </p>
<p>2. Ankur </p>
<p>3. Manish </p>
<p>4. Jaswinder </p>
<p>5. Mulasi </p>
<p>6. Aditya </p>
<a href="https://www.pythonlovers.net/">Python Lovers</a>
</body>
</html>"""

Now you must see the output like shown below when you run your commands, it return us a BeautifulSoup object.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print soup.prettify()
<html>
 <head>
  <title>
   Python Lovers
  </title>
 </head>
 <body>
  <p class="hello">
   We are the co-founders of Python Lovers
  </p>
  <p>
   1. Kamal
  </p>
  <p>
   2. Ankur
  </p>
  <p>
   3. Manish
  </p>
  <p>
   4. Jaswinder
  </p>
  <p>
   5. Mulasi
  </p>
  <p>
   6. Aditya
  </p>
  <a href="https://www.pythonlovers.net/">
   Python Lovers
  </a>
 </body>
</html>
>>>

As mentioned in the introduction we can navigate through the Data Structure. Let explore about it in some simple steps.

>>> soup.title
<title>Python Lovers</title>
>>> soup.title.name
u'title'
>>> soup.p
<p class="hello">We are the co-founders of Python Lovers</p>
>>> soup.p['class']
[u'hello']
>>> soup.a
<a href="https://www.pythonlovers.net/">Python Lovers</a>
>>> soup.find_all('p')
[<p class="hello">We are the co-founders of Python Lovers</p>, <p>1. Kamal </p>, <p>2. Ankur </p>, <p>3. Manish </p>, <p>4. Jaswinder </p>, <p>5. Mulasi </p>, <p>6. Aditya </p>]
>>> print soup.get_text()
Python Lovers
We are the co-founders of Python Lovers
1. Kamal 
2. Ankur 
3. Manish 
4. Jaswinder 
5. Mulasi 
6. Aditya 
Python Lovers

Let us consider in the next section on how can we extract URL’s from a given website.

Here is a small program that will help me to achieve the task. So prior to this i expect you guys have installed the necessary modules to move ahead, if not please do it we only require beautifulsoup4 and requests.

from bs4 import BeautifulSoup
import requests
UrlEntered = raw_input("Please enter a Website to fetch the various URL's ( begin with https://) : ")
requesting = requests.get(UrlEntered)
information = requesting.text
soupObject = BeautifulSoup(information)
for urls in soupObject.find_all('a'):
 print(urls.get(‘href'))

This is a small program where from user i am requesting a Website and display the URL’s the Website contains. In my example of demonstration i will be requesting “https://www.google.com/“ site and the URL’s will be responded.

We get the following output

Ankurs-MacBook-Pro:documents ankurgupta$ python beauti.py
Please enter a Website to fetch the various URL's ( begin with https://) : https://www.google.com/
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
To get rid of this warning, change this:
 BeautifulSoup([your markup])
to this:
 BeautifulSoup([your markup], "html.parser")
  markup_type=markup_type))
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/options/
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.co.in/%3Fgfe_rd%3Dcr%26ei%3DLQW2VumcGsaFvATzwbKYBw
/chrome/browser/?hl=en&brand=CHNG&utm_source=en-hpp&utm_medium=hpp&utm_campaign=en
/advanced_search?hl=en-IN&authuser=0
/language_tools?hl=en-IN&authuser=0
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=hi&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=bn&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=te&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=mr&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=ta&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=gu&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=kn&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=ml&source=homepage
https://www.google.co.in/setprefs?sig=0_ir4pnyw3rqkaQ4mhDB4c0hux8zA%3D&hl=pa&source=homepage
/intl/en/ads/
http://www.google.co.in/services/
https://plus.google.com/104205742743787718296
/intl/en/about.html
https://www.google.co.in/setprefdomain?prefdom=US&sig=__a8OSuO-Cwp6aJdUiXH2B0HRAhDw%3D
/intl/en/policies/privacy/
/intl/en/policies/terms/

This is pretty much to be covered in this article to begin with the rest of the Exploration. I hope you guys have understood this part of our tutorial. If u want to explore more about the interesting sides of BeautifulSoup4 then in that case you can visit the following sites.

http://www.crummy.com/software/BeautifulSoup/

http://docs.python-requests.org/en/latest/index.html

So in case of any queries or question do reach us, We will help you in best possible way to keep the things going in you favour, Thank you.