Beautiful Soup - Remove All Scripts
One of the often used tags in HTML is the <script> tag. It facilitates embedding a client side script such as JavaScript code in HTML. In this chapter, we will use BeautifulSoup to remove script tags from the HTML document.
The <script> tag has a corresponding </script> tag. In between the two, you may include either a reference to an external JavaScript file, or include JavaScript code inline with the HTML script itself.
To include an external Javascript file, the syntax used is −
<head> <script src="javascript.js"></script> </head>
You can then invoke the functions defined in this file from inside HTML.
Instead of referring to an external file, you can put JavaScipt code inside the HTML within the <script> and </script> code. If it is put inside the <head> section of the HTML document, then the functionality is available throughout the document tree. On the other hand, if put anywhere in the <body> section, the JavaScript functions are available from that point on.
<body>
<p>Hello World</p>
<script>
alert("Hello World")
</script>
</body>
To remove all script tags with Beautiful is easy. You have to collect the list of all script tags from the parsed tree and extract them one by one.
Example - Removing Scripts from HTML Content
html = '''
<html>
<head>
<script src="javascript.js"></scrript>
</head>
<body>
<p>Hello World</p>
<script>
alert("Hello World")
</script>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all('script'):
tag.extract()
print (soup)
Output
<html> <head> </head> </html>
You can also use the decompose() method instead of extract(), the difference being that that the latter returns the thing that was removed, whereas the former just destroys it. For a more concise code, you may also use list comprehension syntax to achieve the soup object with script tags removed, as follows −
[tag.decompose() for tag in soup.find_all('script')]