Implementing Web Scraping in Python with BeautifulSoup?

PythonServer Side ProgrammingProgramming

BeautifulSoup is a class in the bs4 module of python. Basic purpose of building beautifulsoup is to parse HTML or XML documents.

Installing bs4 (in-short beautifulsoup)

It is easy to install beautifulsoup on using pip module. Just run the below command on your command shell.

pip install bs4

Running above command on your terminal, will see your screen something like -

C:\Users\rajesh>pip install bs4
Collecting bs4
Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Requirement already satisfied: beautifulsoup4 in c:\python\python361\lib\site-packages (from bs4) (4.6.0)
Building wheels for collected packages: bs4
Building wheel for bs4 (setup.py) ... done
Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1

To verify, if BeautifulSoup is successfully installed in your machine or not, just run below command in the same terminal−

C:\Users\rajesh>python
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>>

Successful, great!.

Example 1 

Find all the links from an html document Now, assume we have a HTML document and we want to collect all the reference links in the document. So first we will store the document as a string like below −

html_doc='''<a href='wwww.Tutorialspoint.com.com'/a>
<a href='wwww.nseindia.com.com'/a>
<a href='wwww.codesdope.com'/a>
<a href='wwww.google.com'/a>
<a href='wwww.facebook.com'/a>
<a href='wwww.wikipedia.org'/a>
<a href='wwww.twitter.com'/a>
<a href='wwww.microsoft.com'/a>
<a href='wwww.github.com'/a>
<a href='wwww.nytimes.com'/a>
<a href='wwww.youtube.com'/a>
<a href='wwww.reddit.com'/a>
<a href='wwww.python.org'/a>
<a href='wwww.stackoverflow.com'/a>
<a href='wwww.amazon.com'/a>
<a href=‘wwww.linkedin.com'/a>
<a href='wwww.finace.google.com'/a>'''

Now we will create a soup object by passing the above variable html_doc in the initializer function of beautifulSoup.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Now we have the soup object, we can apply methods of the BeautifulSoup class on it. Now we can find all the attributes of a tag and values in the attributes given in the html_doc.

for tag in soup.find_all('a'):
print(tag.get('href'))

From above code we are trying to get all the links in the html_doc string through a loop to get every <a> in the document and get the href attribute.

Below is our complete code to get all the links from the html_doc string.

from bs4 import BeautifulSoup

html_doc='''<a href='www.Tutorialspoint.com'/a>
<a href='www.nseindia.com.com'/a>
<a href='www.codesdope.com'/a>
<a href='www.google.com'/a>
<a href='www.facebook.com'/a>
<a href='www.wikipedia.org'/a>
<a href='www.twitter.com'/a>
<a href='www.microsoft.com'/a>
<a href='www.github.com'/a>
<a href='www.nytimes.com'/a>
<a href='www.youtube.com'/a>
<a href='www.reddit.com'/a>
<a href='www.python.org'/a>
<a href='www.stackoverflow.com'/a>
<a href='www.amazon.com'/a>
<a href='www.rediff.com'/a>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for tag in soup.find_all('a'):
print(tag.get('href'))

Result

www.Tutorialspoint.com
www.nseindia.com.com
www.codesdope.com
www.google.com
www.facebook.com
www.wikipedia.org
www.twitter.com
www.microsoft.com
www.github.com
www.nytimes.com
www.youtube.com
www.reddit.com
www.python.org
www.stackoverflow.com
www.amazon.com
www.rediff.com

Example 2

 Prints all the links from a website with specific element (for example: python) mentioned in the link.

Below program will print all the URLs from a specific website which contains “python” in there link.

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html = urlopen("http://www.python.org")
content = html.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True):
if re.findall('python', a['href']):
print("Python URL:", a['href'])

Result

Python URL: https://docs.python.org
Python URL: https://pypi.python.org/
Python URL: https://www.facebook.com/pythonlang?fref=ts
Python URL: http://brochure.getpython.info/
Python URL: https://docs.python.org/3/license.html
Python URL: https://wiki.python.org/moin/BeginnersGuide
Python URL: https://devguide.python.org/
Python URL: https://docs.python.org/faq/
Python URL: http://wiki.python.org/moin/Languages
Python URL: http://python.org/dev/peps/
Python URL: https://wiki.python.org/moin/PythonBooks
Python URL: https://wiki.python.org/moin/
Python URL: https://www.python.org/psf/codeofconduct/
Python URL: http://planetpython.org/
Python URL: /events/python-events
Python URL: /events/python-user-group/
Python URL: /events/python-events/past/
Python URL: /events/python-user-group/past/
Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
Python URL: //docs.python.org/3/tutorial/controlflow.html#defining-functions
Python URL: //docs.python.org/3/tutorial/introduction.html#lists
Python URL: http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
Python URL: //docs.python.org/3/tutorial/
Python URL: //docs.python.org/3/tutorial/controlflow.html
Python URL: /downloads/release/python-373/
Python URL: https://docs.python.org
Python URL: //jobs.python.org
Python URL: http://blog.python.org
Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/Joo0vg55HKo/python-373-is-now-available.html
Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/N5tvkDIQ47g/python-3410-is-now-available.html
Python URL: http://feedproxy.google.com/~r/PythonInsider/~3/n0mOibtx6_A/python-3.html
Python URL: /events/python-events/805/
Python URL: /events/python-events/817/
Python URL: /events/python-user-group/814/
Python URL: /events/python-events/789/
Python URL: /events/python-events/831/
Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/
Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/
Python URL: http://wiki.python.org/moin/TkInter
Python URL: http://www.wxpython.org/
Python URL: http://ipython.org
Python URL: #python-network
Python URL: http://brochure.getpython.info/
Python URL: https://docs.python.org/3/license.html
Python URL: https://wiki.python.org/moin/BeginnersGuide
Python URL: https://devguide.python.org/
Python URL: https://docs.python.org/faq/
Python URL: http://wiki.python.org/moin/Languages
Python URL: http://python.org/dev/peps/
Python URL: https://wiki.python.org/moin/PythonBooks
Python URL: https://wiki.python.org/moin/
Python URL: https://www.python.org/psf/codeofconduct/
Python URL: http://planetpython.org/
Python URL: /events/python-events
Python URL: /events/python-user-group/
Python URL: /events/python-events/past/
Python URL: /events/python-user-group/past/
Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
Python URL: https://devguide.python.org/
Python URL: https://bugs.python.org/
Python URL: https://mail.python.org/mailman/listinfo/python-dev
Python URL: #python-network
Python URL: https://github.com/python/pythondotorg/issues
Python URL: https://status.python.org/

raja
Published on 02-May-2019 12:04:57
Advertisements