Python Web Scraping - Data Processing

In earlier chapters, we learned about extracting the data from web pages or web scraping by various Python modules. In this chapter, let us look into various techniques to process the data that has been scraped.

Introduction

To process the data that has been scraped, we must store the data on our local machine in a particular format like spreadsheet (CSV), JSON or sometimes in databases like MySQL.

CSV and JSON Data Processing

First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us first understand through a simple example in which we will first grab the information using BeautifulSoup module, as did earlier, and then by using Python CSV module we will write that textual information into CSV file.

First, we need to import the necessary Python libraries as follows −

import requests
from bs4 import BeautifulSoup
import csv

In this following line of code, we use requests to make a GET HTTP requests for the url: https://authoraditiagarwal.com/ by making a GET request.

r = requests.get('https://authoraditiagarwal.com/')

Now, we need to create a Soup object as follows −

soup = BeautifulSoup(r.text, 'lxml')

Now, with the help of next lines of code, we will write the grabbed data into a CSV file named dataprocessing.csv.

f = csv.writer(open(' dataprocessing.csv ','w'))
f.writerow(['Title'])
f.writerow([soup.title.text])

After running this script, the textual information or the title of the webpage will be saved in the above mentioned CSV file on your local machine.

Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python script for doing the same in which we are grabbing the same information as we did in last Python script, but this time the grabbed information is saved in JSONfile.txt by using JSON Python module.

import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
   json.dump(y, outfile)

After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned text file on your local machine.

Data Processing using AWS S3

Sometimes we may want to save scraped data in our local storage for archive purpose. But what if the we need to store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3 (Simple Storage Service). Basically AWS S3 is an object storage which is built to store and retrieve any amount of data from anywhere.

We can follow the following steps for storing data in AWS S3 −

Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while storing the data. It will create a S3 bucket in which we can store our data.

Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of the following command −

pip install boto3

Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3 bucket.

First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data to S3 bucket.

import requests
import boto3

Now we can scrape the data from our URL.

data = requests.get("Enter the URL").text

Now for storing data to S3 bucket, we need to create S3 client as follows −

s3 = boto3.client('s3')
bucket_name = "our-content"

Next line of code will create S3 bucket as follows −

s3.create_bucket(Bucket = bucket_name, ACL = 'public-read')
s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public-read")

Now you can check the bucket with name our-content from your AWS account.

Data processing using MySQL

Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link https://www.tutorialspoint.com/mysql/.

With the help of following steps, we can scrape and process data into MySQL table −

Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data. For example, we are creating the table with following query −

CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,
title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));

Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to turn on this feature with the help of following commands which will change the default character set for the database, for the table and for both of the columns −

ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help of the following command

pip install PyMySQL

Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved into our database.

First, we need to import the required Python modules.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re

Now, make a connection, that is integrate this with Python.

conn = pymysql.connect(host='127.0.0.1',user='root', passwd = None, db = 'mysql',
charset = 'utf8')
cur = conn.cursor()
cur.execute("USE scrap")
random.seed(datetime.datetime.now())
def store(title, content):
   cur.execute('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title, content))
   cur.connection.commit()

Now, connect with Wikipedia and get data from it.

def getLinks(articleUrl):
   html = urlopen('http://en.wikipedia.org'+articleUrl)
   bs = BeautifulSoup(html, 'html.parser')
   title = bs.find('h1').get_text()
   content = bs.find('div', {'id':'mw-content-text'}).find('p').get_text()
   store(title, content)
   return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
   while len(links) > 0:
      newArticle = links[random.randint(0, len(links)-1)].attrs['href']
      print(newArticle)
      links = getLinks(newArticle)

Lastly, we need to close both cursor and connection.

finally:
   cur.close()
   conn.close()

This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and web scraping, then the above code would not be tough to understand.

Data processing using PostgreSQL

PostgreSQL, developed by a worldwide team of volunteers, is an open source relational database Management system (RDMS). The process of processing the scraped data using PostgreSQL is similar to that of MySQL. There would be two changes: First, the commands would be different to MySQL and second, here we will use psycopg2 Python library to perform its integration with Python.

If you are not familiar with PostgreSQL then you can learn it at https://www.tutorialspoint.com/postgresql/. And with the help of following command we can install psycopg2 Python library −

pip install psycopg2

Print Page