- Beautiful Soup Tutorial
- Beautiful Soup - Home
- Beautiful Soup - Overview
- Beautiful Soup - Web Scraping
- Beautiful Soup - Installation
- Beautiful Soup - Souping the Page
- Beautiful Soup - Kinds of objects
- Beautiful Soup - Inspect Data Source
- Beautiful Soup - Scrape HTML Content
- Beautiful Soup - Navigating by Tags
- Beautiful Soup - Find Elements by ID
- Beautiful Soup - Find Elements by Class
- Beautiful Soup - Find Elements by Attribute
- Beautiful Soup - Searching the Tree
- Beautiful Soup - Modifying the Tree
- Beautiful Soup - Parsing a Section of a Document
- Beautiful Soup - Find all Children of an Element
- Beautiful Soup - Find Element using CSS Selectors
- Beautiful Soup - Find all Comments
- Beautiful Soup - Scraping List from HTML
- Beautiful Soup - Scraping Paragraphs from HTML
- BeautifulSoup - Scraping Link from HTML
- Beautiful Soup - Get all HTML Tags
- Beautiful Soup - Get Text Inside Tag
- Beautiful Soup - Find all Headings
- Beautiful Soup - Extract Title Tag
- Beautiful Soup - Extract Email IDs
- Beautiful Soup - Scrape Nested Tags
- Beautiful Soup - Parsing Tables
- Beautiful Soup - Selecting nth Child
- Beautiful Soup - Search by text inside a Tag
- Beautiful Soup - Remove HTML Tags
- Beautiful Soup - Remove all Styles
- Beautiful Soup - Remove all Scripts
- Beautiful Soup - Remove Empty Tags
- Beautiful Soup - Remove Child Elements
- Beautiful Soup - find vs find_all
- Beautiful Soup - Specifying the Parser
- Beautiful Soup - Comparing Objects
- Beautiful Soup - Copying Objects
- Beautiful Soup - Get Tag Position
- Beautiful Soup - Encoding
- Beautiful Soup - Output Formatting
- Beautiful Soup - Pretty Printing
- Beautiful Soup - NavigableString Class
- Beautiful Soup - Convert Object to String
- Beautiful Soup - Convert HTML to Text
- Beautiful Soup - Parsing XML
- Beautiful Soup - Error Handling
- Beautiful Soup - Trouble Shooting
- Beautiful Soup - Porting Old Code
- Beautiful Soup - Functions Reference
- Beautiful Soup - contents Property
- Beautiful Soup - children Property
- Beautiful Soup - string Property
- Beautiful Soup - strings Property
- Beautiful Soup - stripped_strings Property
- Beautiful Soup - descendants Property
- Beautiful Soup - parent Property
- Beautiful Soup - parents Property
- Beautiful Soup - next_sibling Property
- Beautiful Soup - previous_sibling Property
- Beautiful Soup - next_siblings Property
- Beautiful Soup - previous_siblings Property
- Beautiful Soup - next_element Property
- Beautiful Soup - previous_element Property
- Beautiful Soup - next_elements Property
- Beautiful Soup - previous_elements Property
- Beautiful Soup - find Method
- Beautiful Soup - find_all Method
- Beautiful Soup - find_parents Method
- Beautiful Soup - find_parent Method
- Beautiful Soup - find_next_siblings Method
- Beautiful Soup - find_next_sibling Method
- Beautiful Soup - find_previous_siblings Method
- Beautiful Soup - find_previous_sibling Method
- Beautiful Soup - find_all_next Method
- Beautiful Soup - find_next Method
- Beautiful Soup - find_all_previous Method
- Beautiful Soup - find_previous Method
- Beautiful Soup - select Method
- Beautiful Soup - append Method
- Beautiful Soup - extend Method
- Beautiful Soup - NavigableString Method
- Beautiful Soup - new_tag Method
- Beautiful Soup - insert Method
- Beautiful Soup - insert_before Method
- Beautiful Soup - insert_after Method
- Beautiful Soup - clear Method
- Beautiful Soup - extract Method
- Beautiful Soup - decompose Method
- Beautiful Soup - replace_with Method
- Beautiful Soup - wrap Method
- Beautiful Soup - unwrap Method
- Beautiful Soup - smooth Method
- Beautiful Soup - prettify Method
- Beautiful Soup - encode Method
- Beautiful Soup - decode Method
- Beautiful Soup - get_text Method
- Beautiful Soup - diagnose Method
- Beautiful Soup Useful Resources
- Beautiful Soup - Quick Guide
- Beautiful Soup - Useful Resources
- Beautiful Soup - Discussion
Beautiful Soup - diagnose() Method
Method Description
The diagnose() method in Beautiful Soup is a diagnostic suite for isolating common problems. If you're facing difficulty in understanding what Beautiful Soup is doing to a document, pass the document as argument to the diagnose() function. A report showing you how different parsers handle the document, and tell you if you're missing a parser.
Syntax
diagnose(data)
Parameters
data − the document string.
Return Value
The diagnose() method prints the result of parsing the given document according all the available parsers.
Example
Let us take this simple document for our exercise −
<h1>Hello World <b>Welcome</b> <P><b>Beautiful Soup</a> <i>Tutorial</i><p>
The following code runs the diagnostics on the above HTML script −
markup = ''' <h1>Hello World <b>Welcome</b> <P><b>Beautiful Soup</a> <i>Tutorial</i><p> ''' from bs4.diagnose import diagnose diagnose(markup)
The diagonose() output starts with a message showing what all parsers are available −
Diagnostic running on Beautiful Soup 4.12.2 Python version 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] Found lxml version 4.9.2.0 Found html5lib version 1.1
If the document to be diagnosed is a perfect HTML document, the result for all parsers is just about similar. However, in our example, there are many errors.
To begin the built-in html.parser is take up. The report will be as follows −
Trying to parse your markup with html.parser Here's what html.parser did with the markup: <h1> Hello World <b> Welcome </b> <p> <b> Beautiful Soup <i> Tutorial </i> <p> </p> </b> </p> </h1>
You can see that Python's built-in parser doesn't insert the <html> and <body> tags. The unclosed <h1> tag is provided with matching <h1> at the end.
Both the html5lib and lxml parsers complete the document by wrapping it in <html>, <head> and <body> tags.
Trying to parse your markup with html5lib Here's what html5lib did with the markup: <html> <head> </head> <body> <h1> Hello World <b> Welcome </b> <p> <b> Beautiful Soup <i> Tutorial </i> </b> </p> <p> <b> </b> </p> </h1> </body> </html>
With lxml parser, note where the closing </h1> is inserted. Also the incomplete <b> tag is rectified, and the dangling </a> is removed.
Trying to parse your markup with lxml Here's what lxml did with the markup: <html> <body> <h1> Hello World <b> Welcome </b> </h1> <p> <b> Beautiful Soup <i> Tutorial </i> </b> </p> <p> </p> </body> </html>
The diagnose() method parses the document as XML document also, which probably is superfluous in our case.
Trying to parse your markup with lxml-xml Here's what lxml-xml did with the markup: <?xml version="1.0" encoding="utf-8"?> <h1> Hello World <b> Welcome </b> <P> <b> Beautiful Soup </b> <i> Tutorial </i> <p/> </P> </h1>
Let us give the diagnose() method a XML document instead of HTML document.
<?xml version="1.0" ?> <books> <book> <title>Python</title> <author>TutorialsPoint</author> <price>400</price> </book> </books>
Now if we run the diagnostics, even if it's a XML, the html parsers are applied.
Trying to parse your markup with html.parser Warning (from warnings module): File "C:\Users\mlath\OneDrive\Documents\Feb23 onwards\BeautifulSoup\Lib\site-packages\bs4\builder\__init__.py", line 545 warnings.warn( XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
With html.parser, a warning message is displayed. With html5lib, the fist line which contains XML version information is commented and rest of the document is parsed as if it is a HTML document.
Trying to parse your markup with html5lib Here's what html5lib did with the markup: <!--?xml version="1.0" ?--> <html> <head> </head> <body> <books> <book> <title> Python </title> <author> TutorialsPoint </author> <price> 400 </price> </book> </books> </body> </html>
The lxml html parser doesn't insert the comment, but parses it as HTML.
Trying to parse your markup with lxml Here's what lxml did with the markup: <?xml version="1.0" ?> <html> <body> <books> <book> <title> Python </title> <author> TutorialsPoint </author> <price> 400 </price> </book> </books> </body> </html>
The lxml-xml parser parses the document as XML.
Trying to parse your markup with lxml-xml Here's what lxml-xml did with the markup: <?xml version="1.0" encoding="utf-8"?> <?xml version="1.0" ?> <books> <book> <title> Python </title> <author> TutorialsPoint </author> <price> 400 </price> </book> </books>
The diagnostics report may prove to be useful in finding errors in HTML/XML documents.