Difference between BeautifulSoup and Scrapy Crawler


Beautiful Soup and Scrapy Crawler are used for doing web scraping in Python. Both of these tools have the same use case but have different functionalities. Web scraping is useful in data collection and analysis in fields like research, marketing, and business intelligence. In this article, we will understand the difference between Beautiful Soup and Scrapy Crawler and how they are used in web scraping.

Feature

Beautiful Soup

Scrapy

Parsing

Used for parsing HTML and XML documents

Uses a combination of parsing and crawling to extract data from websites.

Ease of Use

Simple to use the library

More complex library to use and the user should have good command over programming.

Concurrency

Does not support concurrency, and can only scrap one page at a time.

Supports concurrency, and can scrap multiple pages at the same time, making it faster and more efficient for large-scale web scraping projects.

Middleware

Does not provide any middleware system

Provides a middleware system that allows developers to customize the behavior of the spider at different stages of the scrapping process.

Data storage

Does not provide a build in Data storage support, and requires the developer to handle data storage manually

Provides built-in support for storing data in various formats, such as CSV, JSON, and XML, and also supports integrating with databases such as MySQL and MongoDB

Robustness

Less robust and fault-tolerant compared to Scrapy

More robust and has built-in error handling mechanisms, such as retrying failed requests, handling timeouts, and avoiding common errors such as 404s and 403s

Community

Has a community, but it is not as large and active as Scrapy's

Has a large and active community of developers and contributors who continuously improve and update the framework.

Beautiful Soup

Beautiful Soup is an open-source Python library that is used to parse HTML and XML pages. The parsing of HTML pages helps to extract data from web pages. The library contains various functions which can be used to search for specific tags, links, and other attributes in an HTML document. If the data is available on a single page for scraping such types of data Beautiful soap is the best option.

Example

In the below example, print all the links present in a webpage using Beautiful Soup and the request library. Firstly you need to import the requests library and Beautiful Soup and then make a get request to the URL of the page and parse the HTML content received as a response using Beautiful Soup. Once the Page is parsed then you can find all the links on the page using beautiful soup methods.

import requests
from bs4 import BeautifulSoup

# Make a request to the webpage
url = 'https://www.tutorialspoint.com/index.htm'
response = requests.get(url)

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all links on the page
links = soup.find_all('a')

# Print the links
for link in links:
   print(link.get('href'))

Output

https://www.tutorialspoint.com/index.htm
https://www.tutorialspoint.com/codingground.htm
https://www.tutorialspoint.com/about/about_careers.htm
https://www.tutorialspoint.com/whiteboard.htm
https://www.tutorialspoint.com/online_dev_tools.htm
https://www.tutorialspoint.com/business/index.asp
https://www.tutorialspoint.com/market/teach_with_us.jsp
https://www.facebook.com/tutorialspointindia
https://www.instagram.com/tutorialspoint_/
https://twitter.com/tutorialspoint
https://www.youtube.com/channel/UCVLbzhxVTiTLiVKeGV7WEBg
https://www.linkedin.com/authwall?trk=bf&trkInfo=AQEkqX2eckF__gAAAX-wMwEYvrsjBVbEtWQd4pgEdVSzkL22Nik1KEpY_ECWLKDGc41z8IOZWr2Bb0fvJplT60NPBtSw87J6QCpc7wD4qQ3iU13n6xJtBxME5o05Wmpg5JPm5YY=&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Ftutorialspoint
index.htm
None
https://www.tutorialspoint.com/categories/development
https://www.tutorialspoint.com/categories/it_and_software
https://www.tutorialspoint.com/categories/data_science_and_ai_ml
https://www.tutorialspoint.com/categories/cyber_security
https://www.tutorialspoint.com/categories/marketing
https://www.tutorialspoint.com/categories/office_productivity
https://www.tutorialspoint.com/categories/business
https://www.tutorialspoint.com/categories/lifestyle
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/articles/index.php
https://www.tutorialspoint.com/market/login.asp
https://www.tutorialspoint.com/latest/prime-packs
https://www.tutorialspoint.com/market/index.asp
https://www.tutorialspoint.com/latest/ebooks
https://www.tutorialspoint.com/tutorialslibrary.htm
https://www.tutorialspoint.com/articles/index.php
https://www.tutorialspoint.com/codingground.htm

Scrapy

Scrapy is also a Python framework that is used for web crawling and web scraping. Scrapy is used when we need scrap data on a large-scale project. It also provides various functionality to extract, store, and process data. When you need to scrap data on multiple pages and complex data then scrapy is the best option.

Example

In the below example, we have scraped data from multiple pages from a quote website using scrapy.To do this you need to define a scrapy spider that starts to make a request to the first page of a website and then parses the page and extracts data from the page and then follows the next page link until there are no more pages to scrape.

import scrapy

class QuotesSpider(scrapy.Spider):
   name = "quotes"
   start_urls = [
      'http://quotes.toscrape.com/page/1/',
   ]

   def parse(self, response):
      for quote in response.css('div.quote'):
         yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.css('span small::text').get(),
            'tags': quote.css('div.tags a.tag::text').getall(),
         }
      next_page = response.css('li.next a::attr(href)').get()
      if next_page is not None:
         yield response.follow(next_page, self.parse)

# Create a Scrapy process
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())

# Start the spider
process.crawl(QuotesSpider)

# Run the spider and display the output
process.start()
for item in QuotesSpider().parse(response=None):
   print(item)

Output

2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/page/8/)
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Anyone who has never made a mistake has never tried anything new.”', 'author': 'Albert Einstein', 'tags': ['mistakes']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': "“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”", 'author': 'Jane Austen', 'tags': ['humor', 'love', 'romantic', 'women']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Remember, if the time should come when you have to make a choice between what is right and what is easy, remember what happened to a boy who was good, and kind, and brave, because he strayed across the path of Lord Voldemort. Remember Cedric Diggory.”', 'author': 'J.K. Rowling', 'tags': ['integrity']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“I declare after all there is no enjoyment like reading! How much sooner one tires of any thing than of a book! -- When I have a house of my own, I shall be miserable if I have not an excellent library.”', 'author': 'Jane Austen', 'tags': ['books', 'library', 'reading']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.”', 'author': 'Jane Austen', 'tags': ['elizabeth-bennet', 'jane-austen']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“Some day you will be old enough to start reading fairy tales again.”', 'author': 'C.S. Lewis', 'tags': ['age', 'fairytales', 'growing-up']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“We are not necessarily doubting that God will do the best for us; we are wondering how painful the best will turn out to be.”', 'author': 'C.S. Lewis', 'tags': ['god']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“The fear of death follows from the fear of life. A man who lives fully is prepared to die at any time.”', 'author': 'Mark Twain', 'tags': ['death', 'life']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“A lie can travel half way around the world while the truth is putting on its shoes.”', 'author': 'Mark Twain', 'tags': ['misattributed-mark-twain', 'truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'text': '“I believe in Christianity as I believe that the sun has risen: not only because I see it, but because by it I see everything else.”', 'author': 'C.S. Lewis', 'tags': ['christianity', 'faith', 'religion', 'sun']}
2023-04-17 00:53:00 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: http://quotes.toscrape.com/page/9/)
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“The truth." Dumbledore sighed. "It is a beautiful and terrible thing, and should therefore be treated with great caution.”', 'author': 'J.K. Rowling', 'tags': ['truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': "“I'm the one that's got to die when it's time for me to die, so let me live my life the way I want to.”", 'author': 'Jimi Hendrix', 'tags': ['death', 'life']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“To die will be an awfully big adventure.”', 'author': 'J.M. Barrie', 'tags': ['adventure', 'love']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“It takes courage to grow up and become who you really are.”', 'author': 'E.E. Cummings', 'tags': ['courage']}     
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“But better to get hurt by the truth than comforted with a lie.”', 'author': 'Khaled Hosseini', 'tags': ['life']}  
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“You never really understand a person until you consider things from his point of view... Until you climb inside of his skin and walk around in it.”', 'author': 'Harper Lee', 'tags': ['better-life-empathy']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“You have to write the book that wants to be written. And if the book will be too difficult for grown-ups, then you write it for children.”', 'author': "Madeleine L'Engle", 'tags': ['books', 'children', 'difficult', 'grown-ups', 'write', 'writers', 'writing']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“Never tell the truth to people who are not worthy of it.”', 'author': 'Mark Twain', 'tags': ['truth']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': "“A person's a person, no matter how small.”", 'author': 'Dr. Seuss', 'tags': ['inspirational']}
2023-04-17 00:53:00 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”', 'author': 'George R.R. Martin', 'tags': ['books', 'mind']}
2023-04-17 00:53:00 [scrapy.core.engine] INFO: Closing spider (finished)

Conclusion

In this article, we discussed the differences between Beautiful Soup and Scrapy in Python. Though both are used for web scraping but have different functionalities. Beautiful Soup is used when we need to scrap data from a single page and Scrapy I used when we need to scale large data from multiple pages.

Updated on: 06-Jul-2023

117 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements