Beautiful Soup - diagnose() Method



Method Description

The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you're facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you're missing a parser.

Syntax

diagnose(data)

Parameters

  • data − the document string.

Return Value

The diagnose() method prints the result of parsing the given document according all the available parsers.

Example

Let us take this simple document for our exercise −

<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>

The following code runs the diagnostics on the above HTML script −

markup = '''
<h1>Hello World
<b>Welcome</b>
<P><b>Beautiful Soup</a> <i>Tutorial</i><p>
'''

from bs4.diagnose import diagnose

diagnose(markup)

The diagonose() output starts with a message showing what all parsers are available −

Diagnostic running on Beautiful Soup 4.12.2
Python version 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
Found lxml version 4.9.2.0
Found html5lib version 1.1

If the document to be diagnosed is a perfect HTML document, the result for all parsers is just about similar. However, in our example, there are many errors.

To begin the built-in html.parser is take up. The report will be as follows −

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
   <h1>
      Hello World
   <b>
      Welcome
   </b>
   <p>
      <b>
         Beautiful Soup
         <i>
            Tutorial
         </i>
         <p>
         </p>
      </b>
   </p>
</h1>

You can see that Python's built-in parser doesn't insert the <html> and <body> tags. The unclosed <h1> tag is provided with matching <h1> at the end.

Both the html5lib and lxml parsers complete the document by wrapping it in <html>, <head> and <body> tags.

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
   <head>
   </head>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
         <p>
            <b>
               Beautiful Soup
               <i>
                  Tutorial
               </i>
            </b>
         </p>
         <p>
            <b>
            </b>
         </p>
      </h1>
   </body>
</html>

With lxml parser, note where the closing </h1> is inserted. Also the incomplete <b> tag is rectified, and the dangling </a> is removed.

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
   <body>
      <h1>
         Hello World
         <b>
            Welcome
         </b>
      </h1>
      <p>
         <b>
            Beautiful Soup
            <i>
               Tutorial
            </i>
         </b>
      </p>
      <p>
      </p>
   </body>
</html>

The diagnose() method parses the document as XML document also, which probably is superfluous in our case.

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<h1>
   Hello World
   <b>
      Welcome
   </b>
   <P>
      <b>
         Beautiful Soup
      </b>
      <i>
         Tutorial
      </i>
   <p/>
   </P>
</h1>

Let us give the diagnose() method a XML document instead of HTML document.

<?xml version="1.0" ?>
   <books>
      <book>
         <title>Python</title>
         <author>TutorialsPoint</author>
         <price>400</price>
      </book>
   </books>

Now if we run the diagnostics, even if it's a XML, the html parsers are applied.

Trying to parse your markup with html.parser

Warning (from warnings module):
  File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545
    warnings.warn(
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.

With html.parser, a warning message is displayed. With html5lib, the fist line which contains XML version information is commented and rest of the document is parsed as if it is a HTML document.

Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<!--?xml version="1.0" ?-->
<html>
   <head>
   </head>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

The lxml html parser doesn't insert the comment, but parses it as HTML.

Trying to parse your markup with lxml
Here's what lxml did with the markup:
<?xml version="1.0" ?>
<html>
   <body>
      <books>
         <book>
            <title>
               Python
            </title>
            <author>
               TutorialsPoint
            </author>
            <price>
               400
            </price>
         </book>
      </books>
   </body>
</html>

The lxml-xml parser parses the document as XML.

Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" ?>
   <books>
      <book>
         <title>
            Python
         </title>
         <author>
            TutorialsPoint
         </author>
         <price>
            400
         </price>
      </book>
   </books>

The diagnostics report may prove to be useful in finding errors in HTML/XML documents.

Advertisements