Deploying Scrapy spider on ScrapingHub

Python Server Side Programming Programming Scrapy

Scrapy spider

Scrapy spider is a class which provides the facility to follow the links of a website and extract the information from the webpages.

This is the main class from which other spiders must inherit.

Scrapinghub

Scrapinghub is an open source application to run Scrapy spiders. Scrapinghub turns web content into some useful data or information. It allows us to extract the data from webpages, even for complex webpages.

We are going to use scrapinghub to deploy scrapy spiders on cloud and execute it.

Steps to deploy spiders on scrapinghub −

Step1 −

Create one scrapy project −

After installing scrapy, just run the following command in your terminal −

$scrapy startproject <project_name>

Change your directory to your new project (project_name).

Step 2 −

Write one scrapy spider for your target website, let's take a usual website "quotes.toscrape.com".

Below is my very simple scrapy spider −

Code −

#import scrapy library
import scrapy

class AllSpider(scrapy.Spider):

crawled = set()
#Spider name
name = 'all'
#starting url
start_urls = ['http://www.tutorialspoint.com/']

def __init__(self):
   self.links = []

def parse(self, response):
   self.links.append(response.url)
   for href in response.css('a::attr(href)'):
      yield response.follow(href, self.parse)

Step 3 −

Run your spider and save the output to your links.json file −

After executing above code, you'll be able to scrape all links and save it inside links.json file. This might be not a lengthy process but to run it continously for round the clock(24/7) we need to deploy this spider on Scrapinghub.

Step 4 −

Creating account on Scrapinghub

For that, you just need to login to ScrapingHub login page either using your Gmail account or Github. It will redirect to the dashboard.

Now click on Create project and mention the name of the project. Now we can add our project to cloud either using command line(CLI) or through github. Next we going to deploy our code through shub CLI, first install shub

$pip install shub

After installing shub, login to shub account using the api key generated on creating account (Enter your API key from https://app.scrapinghub.com/account/apikey).

$shub login

If your API key is OK, you are logged in now. Now we need to deploy it using the deploy Id, which you see on command line section of the "deploy your code" section (6 digit number).

$ shub deploy deploy_id

That's it from command line, now move back On Spiders dashboard section, user can see the ready spider. Just click on spider name and on Run button.That's it now you can see your spider in your dashboard, something like −

It will show us, the running progress through one click and you don't need to run your local machine 24/7.

Jennifer Nicholas

Updated on: 2019-07-30T22:30:25+05:30

228 Views

Kickstart Your Career

Get certified by completing the course

Get Started