Beautiful Soup - Parsing Tables



In addition to a textual content, a HTML document may also have a structured data in the form of HTML tables. With Beautiful Soup, we can extract the tabular data in Python objects such as list or dictionary, if required store it in databases or spreadsheets, and perform processing. In this chapter, we shall parse HTML table using Beautiful Soup.

Although Beautiful Soup doesn't any special function or method for extracting table data, we can achieve it by simple scraping techniques. Just like any table, say in SQL or spreadsheet, HTML table consists of rows and columns.

HTML has <table> tag to build a tabular structure. There are one or more nested <tr> tags one each for a row. Each row consists of <td> tags to hold the data in each cell of the row. First row usually is used for column headings, and the headings are placed in <th> tag instead of <td>

Following HTML script renders a simple table on the browser window −

<html>
   <body>
   <h2>Beautiful Soup - Parse Table</h2>
      <table border="1">
         <tr>
            <th>Name</th>
            <th>Age</th>
            <th>Marks</th>
         </tr>
         <tr class='data'>
            <td>Ravi</td>
            <td>23</td>
            <td>67</td>
         </tr>
         <tr class='data'>
            <td>Anil</td>
            <td>27</td>
            <td>84</td>
         </tr>
      </table>
   </body>
</html>

Note that, the appearance of data rows is customized with a CSS class data, in order to distinguish it from the header row.

We shall now see how to parse the table data. First, we obtain the document tree in the BeautifulSoup object. Then collect all the column headers in a list.

Example

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "html.parser")

tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)

The data row tags with class='data' attribute following the header row are then fetched. A dictionary object with column header as key and corresponding value in each cell is formed and appended to a list of dict objects.

rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
   row = {}
   
   data = i.find_all('td')
   n=0
   for j in data: 
      
      row[headers[n]] = j.string
      n+=1
   trows.append(row)

A list of dictionary objects is collected in trows. You can then use it for different purposes such as storing in a SQL table, saving as a JSON or pandas dataframe object.

The complete code is given below −

markup = """
<html>
	<body>
	   <p>Beautiful Soup - Parse Table</p>
		<table>
			<tr>
				<th>Name</th>
				<th>Age</th>
				<th>Marks</th>
			</tr>
			<tr class='data'>
				<td>Ravi</td>
				<td>23</td>
				<td>67</td>
			</tr>
			<tr class='data'>
				<td>Anil</td>
				<td>27</td>
				<td>84</td>
			</tr>
		</table>
	</body>
</html>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(markup, "html.parser")

tbltag = soup.find('table')
headers = []
headings = tbltag.find_all('th')
for h in headings: headers.append(h.string)
print (headers)

rows = tbltag.find_all_next('tr', {'class':'data'})
trows=[]
for i in rows:
   row = {}
   
   data = i.find_all('td')
   n=0
   for j in data: 
      
      row[headers[n]] = j.string
      n+=1
   trows.append(row)

print (trows)

Output

[{'Name': 'Ravi', 'Age': '23', 'Marks': '67'}, {'Name': 'Anil', 'Age': '27', 'Marks': '84'}]
Advertisements