Scrapy - First Spider



Description

Spider is a class that defines initial URL to extract the data from, how to follow pagination links and how to extract and parse the fields defined in the items.py. Scrapy provides different types of spiders each of which gives a specific purpose.

Create a file called "first_spider.py" under the first_scrapy/spiders directory, where we can tell Scrapy how to find the exact data we're looking for. For this, you must define some attributes −

  • name − It defines the unique name for the spider.

  • allowed_domains − It contains the base URLs for the spider to crawl.

  • start-urls − A list of URLs from where the spider starts crawling.

  • parse() − It is a method that extracts and parses the scraped data.

The following code demonstrates how a spider code looks like −

import scrapy  

class firstSpider(scrapy.Spider): 
   name = "first" 
   allowed_domains = ["dmoz.org"] 
   
   start_urls = [ 
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", 
      "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" 
   ]  
   def parse(self, response): 
      filename = response.url.split("/")[-2] + '.html' 
      with open(filename, 'wb') as f: 
         f.write(response.body)
Advertisements