How to Scrape a Website using Puppeteer.js?


Web scraping is one of the best ways in which one can automate the process of data collection from the web. Normally we refer to a web scraper as a "crawler" who simply surfs the web and then gives us the scraped data from the pages that we selected. There are many reasons for automating the data scraping process as it is much easier than the process of manually extracting the data from different web pages. Scraping is also a very good solution when the web page that we want to extract the data from doesn't provide us with any APIs

In this tutorial, we will see how we can create a web scraper in Node.js using Puppeteer.js.

There are different stages in which we will build this scraper −

  • In the first step, we will code the app to be able to open Chromium and also load a special website link that we will be scraping.

  • In the next step, we will learn how we can scrape the details of all the books that are present on the single page.

  • Finally, we will learn how we can scrape the details of all the books that are present across multiple pages.

Prerequisites

There are two prerequisites in order to be able to build this node app:

  • Node.js − Have the stable latest version of node installed on the local machine.

  • Code Editor/ IDE − Have a decent code editor or IDE of your choice.

Now that we are all done with the theory of scraping and prerequisites, it's time that we focus on the main steps to start the development of a web scraper.

Project Setup

The first thing that we need is to set up the project for the web scraper. Once we have the "node" installed, the next step is to create a project in the root directory and then install the required dependencies in it.

We will use "npm" to install the dependencies.

I am naming the project as the "book-scraping-app", and for the same, I have created a folder with the same name, and inside it I will run the command shown below.

npm init

Once we run the above command, there will be different prompts that we will get in the terminal, feel free to enter the respective values of your choice. Once you are done with it, a file named "package.json" will be created.

package.json

In my case, the package.json file looks something like this.

{
   "name": "book-scraper-app",
   "version": "0.1.0",
   "description": "",
   "main": "index.js",
   "scripts": {
      "test": "echo "Error: no test specified" && exit 1"
   },
   "author": ""Mukul Latiyan"",
   "license": "ISC"
}

Now the next step is to be able to add puppeteer in our project and we do that by using the command shown below −

npm install --save-dev puppeteer

The above command will install the puppeteer and also a version of Chromium which puppeteer internally makes use of.

Once we run the above command, we can verify whether we have successfully installed puppeteer or not, by checking the package.json.

{
   "name": "book-scraper-app",
   "version": "0.1.0",
   "description": "",
   "main": "index.js",
   "scripts": {
      "test": "echo "Error: no test specified" && exit 1"
   },
   "author": ""Mukul Latiyan"",
   "license": "ISC",
   "devDependencies": {
      "puppeteer": "^15.5.0"
   }
}

Now with puppeteer installed, the next step is to change the script so that we can execute our app in a different manner.

Just add the following code in the "scripts" tag of your "package.json".

{
   "start": "node index.js"
}

Browser Instance Setup

We are done with the project setup. Now let's focus on the browser instance setup, in which we want to make sure that when we run our app, a specific URL gets opened in the Chromium browser.

In total, we will have the code of our business logic present in four files. These will be −

  • browser.js − used to start the browser instance via puppeteer

  • index.js − starting point of our web app

  • pageController.js − a simple controller to start the scraper

  • pageScraper.js − the entire scraping logic will be present here.

So, let's first create the browser.js file, and put the following code inside it.

browser.js

const puppeteer = require('puppeteer');

async function browserInit() {
   let browser;
   try {
      console.log("Opening the browser......");
      browser = await puppeteer.launch({
         headless: false,
         ignoreDefaultArgs: ['--disable-extensions'],
         args: ["--disable-setuid-sandbox"], 'ignoreHTTPSErrors': true
      });
   } catch (err) {
      console.log("Could not create a browser instance => : ", err);
   }
   return browser;
}
module.exports = {
   browserInit
};

In the above code, we are using the launch method that simply launches an instance of the Chromium browser. This launch method returns a JS Promise, so we must make sure that we handle the same by either using the "then" or "await" block.

We are using the "await" keyword and then we are wrapping the entire code in a "try catch" block.

It should also be noted that we are using different values in the JSON parameter of the launch() method. These mainly are −

  • headless − allows us to run the browser with an interface which in turn makes sure that our script executes.

  • ignoreHTTPSErrors − to make sure that you are allowed to visit those websites that aren't hosted over a secure protocol network as well.

  • ignoreDefaultArgs − to make sure that we are able to open chromium and avoid any extensions that might be blocking our task.

Now, we are done with the browser.js file, the next step is to make use of the index.js file.

index.js

Consider the code shown below.

const browserObject = require('./browser');
const scraperController = require('./pageController');

let browserInstance = browserObject.browserInit();

scraperController(browserInstance)

In the above code, we are importing the "browser.js" file and the "pageController.js" file and then passing the instance of the browser to a function that we will write inside the "pageCotroller.js" file.

pageController.js

Now let's create a file named "pageController.js" and then put the following code inside it.

const pageScraper = require('./pageScraper');
async function scrapeAll(browserInstance){
   let browser;
   try{
      browser = await browserInstance;
      await pageScraper.scraper(browser);    
   	 
   }
   catch(err){
      console.log("Could not resolve the browser instance => ", err);
   }
}
module.exports = (browserInstance) => scrapeAll(browserInstance)

In the above code, we are exporting a function which in turn takes the browser instance and then passes the same to a function, called scrapeAll(). This scrapeAll() function in turn passes the instance of the browser to the pageScraper.scraper().

pageScraper.js

Finally, we need to write the "pageScraper.js" file. Consider the code of the pageScraper shown below.

const scraperObject = {
   url: 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
   async scraper(browser){
      let page = await browser.newPage();
      console.log(`Navigating to ${this.url}...`);
      await page.goto(this.url);
   }
}
module.exports = scraperObject;

In the above code, we have a fixed URL, to which we will navigate to once the Chromium starts, and then we are simply calling the goto() method and then waiting for the response.

Now, we are done with the first main setup or running Chromium and then awaiting the response.

The Directory Structure

Your directory structure for the above code should look something like this.

├── browser.js
├── index.js
├── package-lock.json
├── package.json
├── pageController.js
└── pageScraper.js

0 directories, 6 files

Start the Project

To start the project, we need to run the command shown below −

npm run start

Once we run the above command, the Chromium browser will open and then we will be redirected to the URL that we had present inside the "pageScraper.js" file.

Scraping the Data

Now let's focus on how we should be able to extract the details of different books that are present on that URL with the help of the selectors that load the data.

Consider the updated pageScraper.js file code shown below.

const scraperObject = {
   url: 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
   async scraper(browser) {
      let page = await browser.newPage();
      console.log(`Navigating to ${this.url}...`);
      await page.goto(this.url);
      await page.waitForSelector('.page_inner');
    
      // extracting the links of all the required books
      let urls = await page.$$eval('section ol > li', links => {
         links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
         = links.map(el => el.querySelector('h3 > a').href)
         return links;
      });
      console.log(urls);
   }
}
module.exports = scraperObject;

In the above code, we are using the pageSelector() method in which we are waiting for the main div that contains the information related to the book that is rendered in the dom. Next we are making use of the eval() method which helps us in getting the URL element with the selector section ol > li.

Output

Now if we re-run the application, a new Chromium browser will open and then you will get different URLs of the books present in that URL.

> my-scraper@0.1.0 start
> node index.js

Opening the browser......
Navigating to http://books.toscrape.com/catalogue/category/books/childrens_11/index.html...
[
   'http://books.toscrape.com/catalogue/birdsong-a-story-in-pictures_975/index.html',
   'http://books.toscrape.com/catalogue/the-bear-and-the-piano_967/index.html',
   'http://books.toscrape.com/catalogue/the-secret-of-dreadwillow-carse_944/index.html',
   'http://books.toscrape.com/catalogue/the-white-cat-and-the-monk-a-retelling-of-the-poem-pangur-ban_865/index.html',
   'http://books.toscrape.com/catalogue/little-red_817/index.html',
   'http://books.toscrape.com/catalogue/walt-disneys-alice-in-wonderland_777/index.html',
   'http://books.toscrape.com/catalogue/twenty-yawns_773/index.html',
   'http://books.toscrape.com/catalogue/rain-fish_728/index.html',
   'http://books.toscrape.com/catalogue/once-was-a-time_724/index.html',
   'http://books.toscrape.com/catalogue/luis-paints-the-world_714/index.html',
   'http://books.toscrape.com/catalogue/nap-a-roo_567/index.html',
   'http://books.toscrape.com/catalogue/the-whale_501/index.html',
   'http://books.toscrape.com/catalogue/shrunken-treasures-literary-classics-short-sweet-and-silly_484/index.html',
   'http://books.toscrape.com/catalogue/raymie-nightingale_482/index.html',
   'http://books.toscrape.com/catalogue/playing-from-the-heart_481/index.html',
   'http://books.toscrape.com/catalogue/maybe-something-beautiful-how-art-transformed-a-neighborhood_386/index.html',
   'http://books.toscrape.com/catalogue/the-wild-robot_288/index.html',
   'http://books.toscrape.com/catalogue/the-thing-about-jellyfish_283/index.html',
   'http://books.toscrape.com/catalogue/the-lonely-ones_261/index.html',
   'http://books.toscrape.com/catalogue/the-day-the-crayons-came-home-crayons_241/index.html'
]

Now let's make changes to our pageScraper.js so that we can extract URLs from all the scraped links and open a new page instance and then retrieve the relevant data.

pageScraper.js

Consider the updated pageScraper.js file code shown below.

const scraperObject = {
   url: 'http://books.toscrape.com/catalogue/category/books/childrens_11/index.html',
   async scraper(browser) {
      let page = await browser.newPage();
      console.log(`Navigating to ${this.url}...`);
      await page.goto(this.url);
      // Wait for the required DOM to be rendered
      await page.waitForSelector('.page_inner');
    
      // Get the link to all the required books
      let urls = await page.$$eval('section ol > li', links => {
    
         // Make sure the book to be scraped is in stock
         links = links.filter(link => link.querySelector('.instock.availability > i').textContent !== "In stock")
       
         // Extract the links from the data
         links = links.map(el => el.querySelector('h3 > a').href)
         return links;
      });
    
      let pagePromise = (link) => new Promise(async(resolve, reject) => {
         let sampleObj = {};
         let newPage = await browser.newPage();
         await newPage.goto(link);
         sampleObj['bookTitle'] = await newPage.$eval('.product_main > h1', text => text.textContent);
         sampleObj['bookPrice'] = await newPage.$eval('.price_color', text => text.textContent);
         sampleObj['noAvailable'] = await newPage.$eval('.instock.availability', text => {
            
            // Strip new line and tab spaces
            text = text.textContent.replace(/(\r
\t|
|\r|\t)/gm, ""); // Get the number of stock available let regexp = /^.*\((.*)\).*$/i; let stockAvailable = regexp.exec(text)[1].split(' ')[0]; return stockAvailable; }); sampleObj['imageUrl'] = await newPage.$eval('#product_gallery img', img => img.src); sampleObj['bookDescription'] = await newPage.$eval('#product_description', div => div.nextSibling.nextSibling.textContent); sampleObj['upc'] = await newPage.$eval('.table.table-striped > tbody > tr > td', table => table.textContent); resolve(sampleObj); await newPage.close(); }); for(link in urls){ let currentPageData = await pagePromise(urls[link]); // scrapedData.push(currentPageData); console.log(currentPageData); } } } module.exports = scraperObject;

Now we can extract the URLs from all the URLs and when we loop through this array, then a new URL is opened and scrape data on that page, later closing that page and then opening a new page for the next URL that is present in the array.

Output

Now you can rerun the application and you should get to see the following output −

> my-scraper@0.1.0 start
> node index.js

Opening the browser......
Navigating to http://books.toscrape.com/catalogue/category/books/childrens_11/index.html...
{
   bookTitle: 'Birdsong: A Story in Pictures',
   bookPrice: '£54.64',
   noAvailable: '19',
   imageUrl: 'http://books.toscrape.com/media/cache/af/2f/af2fe2419ea136f2cd567aa92082c3ae.jpg',
   bookDescription: "Bring the thrilling story of one red bird to life. When an innocent bird meets two cruel kids, their world is forever changed. But exactly how that change unfolds is up to you, in the tradition of Kamishibai—Japanese paper theater. The wordless story by master cartoonist James Sturm is like a haiku—the elegant images leave space for children to inhabit this timeless tale—a Bring the thrilling story of one red bird to life. ...more",
   upc: '9528d0948525bf5f'
}
{
   bookTitle: 'The Bear and the Piano',
   bookPrice: '£36.89',
   noAvailable: '18',
   imageUrl: 'http://books.toscrape.com/media/cache/d0/87/d0876dcd1a6530a4cb54903aad7a3e28.jpg',
   bookDescription: 'One day, a young bear stumbles upon something he has never seen before in the forest. As time passes, he teaches himself how to play the strange instrument, and eventually the beautiful sounds are heard by a father and son who are picnicking in the woods....more',
   upc: '9f6568e9c95f60b0'
}
….

Conclusion

In this tutorial, we learned how we can create a web scraper in node.js using puppeteer.js.

Updated on: 22-Jun-2023

204 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements