Beautiful Soup - Functions Reference

Beautiful Soup Useful Resources

Beautiful Soup - Extract Email IDs



To Extract Email addresses from a web page is an important application a web scraping library such as BeautifulSoup. In any web page, the Email IDs usually appear in the href attribute of anchor <a> tag. The Email ID is written using mailto URL scheme. Many a times, the Email Address may be present in page content as a normal text (without any hyperlink). In this chapter, we shall use BeautifulSoup library to fetch Email IDs from HTML page, with simple techniques.

A typical usage of Email ID in href attribute is as below −

<a href = "mailto:xyz@abc.com">test link</a>

Example - Checking href for mailTo

Here's the Python code that finds the Email Ids. We collect all the <a> tags in the document, and check if the tag has href attribute. If true, the part of its value after 6th character is the email Id.

from bs4 import BeautifulSoup
import re

html = """
<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li><a href = "mailto:sales@company.com">Sales Enquiries</a></li>
      <li><a href = "mailto:careers@company.com">Careers</a></li>
      <li><a href = "mailto:partner@company.com">Partner with us</a></li>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all("a")
for tag in tags:
   if tag.has_attr("href") and tag['href'][:7]=='mailto:':
      print (tag['href'][7:])

Output

For the given HTML content, the Email IDs will be extracted as follows −

sales@company.com
careers@company.com
partner@company.com

In the second example, we assume that the Email IDs appear anywhere in the text. To extract them, we use the regex searching mechanism. Regex is a complex character pattern. Python's re module helps in processing the regex (Regular Expression) patterns. The following regex pattern is used for searching the email address −

pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'

Using the email regex, we'll find the appearance of the pattern in each <li> tag string. Here is the Python code −

Example - Usage of email regex

from bs4 import BeautifulSoup
import re

html = """
<html>
   <head>
      <title>BeautifulSoup - Scraping Email IDs</title>
   </head>
   <body>
      <h2>Contact Us</h2>
      <ul>
      <li>Sales Enquiries: sales@company.com</a></li>
      <li>Careers: careers@company.com</a></li>
      <li>Partner with us: partner@company.com</a></li>
      </ul>
   </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

def isemail(s):
   pat = r'[\w.+-]+@[\w-]+\.[\w.-]+'
   grp=re.findall(pat,s)
   return (grp)

tags = soup.find_all('li')

for tag in tags:
   emails = isemail(tag.string)
   if emails:
      print (emails)

Output

['sales@company.com']
['careers@company.com']
['partner@company.com']

Using the simple techniques described above, we can use BeautifulSoup to extract Email IDs from web pages.

Advertisements