Beautiful Soup - Get Tag Position



The Tag object in Beautiful Soup possesses two useful properties that give the information about its position in the HTML document. They are −

sourceline − line number at which the tag is found

sourcepos − The starting index of the tag in the line in which it is found.

These properties are supported by the html.parser which is Python's in-built parser and html5lib parser. They are not available when you are using lmxl parser.

In the following example, a HTML string is parsed with html.parser and we find the line number and position of <p> tag in the HTML string.

Example

html = '''
<html>
   <body>
      <p>Web frameworks</p>
      <ul>
      <li>Django</li>
      <li>Flask</li>
      </ul>
      <p>GUI frameworks</p>
      <ol>
      <li>Tkinter</li>
      <li>PyQt</li>
      </ol>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

p_tags = soup.find_all('p')
for p in p_tags:
   print (p.sourceline, p.sourcepos, p.string)

Output

4 0 Web frameworks
9 0 GUI frameworks

For html.parser, these numbers represent the position of the initial less-than sign, which is 0 in this example. It is slightly different when html5lib parser is used.

Example

html = '''
<html>
   <body>
      <p>Web frameworks</p>
      <ul>
      <li>Django</li>
      <li>Flask</li>
      </ul>
      <p>GUI frameworks</p>
      <ol>
      <li>Tkinter</li>
      <li>PyQt</li>
      </ol>
   </body>
</html>
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html5lib')

li_tags = soup.find_all('li')
for l in li_tags:
   print (l.sourceline, l.sourcepos, l.string)

Output

6 3 Django
7 3 Flask
11 3 Tkinter
12 3 PyQt

When using html5lib, the sourcepos property returns the position of the final greater-than sign.

Advertisements