Our website is made possible by displaying online advertisements to our visitors. Please consider supporting us by disabling your ad blocker.

Scraping Paginated Lists With Node.js, Cheerio, Async / Await, And Recursion

May 15, 2018
Siegfried Grimbeek
General Development

In this tutorial we are going to develop a small Node.js application which will be used to scrape paginated content and export the data to a JSON file. The full source code for the tutorial can be found here.

We will be scraping a list website and saving ten lists per page from the “new lists” section and the final application can be seen below.

HTML Scraping with Node.js and Cheerio Example

In this tutorial we will cover the following concepts:

Async/Await
Recursion
Node Live Reloading
Exporting data to a JSON file

We will use the following NPM packages:

Axios
Chalk
Cheerio
Fs
Nodemon

Project Setup

Lets get started by opening a Terminal window and running the following commands in your project folder:

mkdir cheerio-pagination-tutorial
cd cheerio-pagination-tutorial
npm init

Follow the prompts to setup the project, entering the default information will suffice, once the project has been initiated, a package.json file has now been created in the project directory.

Installing the External Packages

npm i axios chalk cheerio nodemon --save

You may have noticed that we are not installing the Fs package as this a core node module which is installed by default, below is a short description of what each package does and how we will use them:

Axios

This is a JavaScript library that we use to make HTTP requests from the application. It can also be used to make XMLHttpRequests from the browser. It addresses some of the shortcomings that the native ES6 fetch has, like transforming JSON data automatically.

Chalk

This has been included to add some spice to the application but it may come in handy if you ever need to develop an app which has a lot of interaction with the terminal. It allows a user to add styling to information that is logged to the console. For the app we going to use Chalk to add some font colors, bolding, underlines and background colors.

Cheerio

Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. We are going to use it to traverse through the HTTP that we got using Axios.

The Node.js file system module allow you to work with the file system on your computer. We will be using it to save the data to a JSON file.

Nodemon

Nodemon monitors for any changes in your Node.js application and automatically restarts the server.

Before we start with the actual development of our app, we have a bit more boilerplate to do, lets start by opening the package.json file and adding the following code:

"nodemonConfig": {
  "ignore": [
    "*.json"
  ]
}

This tells Nodemon to ignore any changes to any .json files, which will prevent our project from restarting when we export our JSON data as Nodemon will ignore changes to .json files.

Add the following line to the scripts property:

"start": "./node_modules/nodemon/bin/nodemon.js ./src/index.js",

This allows you to run the application by entering npm start into the Terminal from the root directory of the app, as it points to the Nodemon script and the index.js script (which we will still create).

If one installs Nodemon globally, by running sudo npm i -g nodemon, then the app can be started by running nodemon, it will then look for whatever file has been specified as the main file in the package.json.

You can have a look at the package.json file to make sure everything is correct. That is it, we are now ready to start developing.

Create a new file called index.js, then lets require our external dependancies by adding the following code to the file:

// External dependencies
const axios = require('axios')
const cheerio = require('cheerio')
const fs = require('fs')
const chalk = require('chalk')

You can already run npm start which will reload the app every time a file changes. Next lets setup some global variables that our app will use:

const url = 'http://listverse.com/'
const outputFile = 'data.json'
const parsedResults = []
const pageLimit = 10
let pageCounter = 0
let resultCount = 0

The url variable is the website that we will be scraping, the outputFile variable is the .json file the app will output, the parsedResults variable is an empty array where we will insert each result, the pageLimit is used to limit the number of pages scraped and lastly the pageCounter and resultCount are used to keep track of the number of results and pages that has been scraped.

Lets create our Async function to retrieve our data and load it into Cheerio which will make it accessible to us via jQuery selectors:

console.log(chalk.yellow.bgBlue(`\n  Scraping of ${chalk.underline.bold(url)} initiated...\n`))

const getWebsiteContent = async (url) => {
  try {
    const response = await axios.get(url)
    const $ = cheerio.load(response.data)

  } catch (error) {

    console.error(error)
  }
}

getWebsiteContent(url)

Firstly we are printing to the console that the scraping has been initiated, here you can see that we are using Chalk to change the text color to yellow, add a blue background and also bolden and underline the url. As stated, it is just to add some spice but if you do have a console intensive app it could come in handy for readability.

Next we create an Async function expression getWebsiteContent which uses Axios to retrieve the website url. A try and catch statement has been implemented, which will handle any errors and print them to the console. With the $ we are loading the retrieved webpage into cheerio.

One has to keep in mind that when scraping websites, each website is unique and there is no single solution which can be applied to all websites, with that said, with a little jQuery knowledge and using the skeleton of this app, you can easily adjust this script to scrape almost any website.

When inspecting the source of ‘http://listverse.com/', we can see that there are some clear selectors which we can use to get the new lists. Below where we load Cheerio in our getWebsiteContent function add the following code:

// New Lists
$('.wrapper .main .new article').map((i, el) => {
  const count = resultCount++
  const title = $(el).find('a').attr('href')
  const url = $(el).find('h3').text()
  const metadata = {
    count: count,
    title: title,
    url: url
  }
  parsedResults.push(metadata)
})

The above code maps through all the new articles and pushes their count, title and url to the metadata object. The metadata object is then pushed to the parsedResults array.

The code used to select the elements is basically jQuery and you can get a list of all available selectors from the Cheerio website.

Now that we have some results in our parsedResults array, lets store create a function expression that will save these results to the outputFile:

const exportResults = (parsedResults) => {
  fs.writeFile(outputFile, JSON.stringify(parsedResults, null, 4), (err) => {
    if (err) {
      console.log(err)
    }
    console.log(chalk.yellow.bgBlue(`\n ${chalk.underline.bold(parsedResults.length)} Results exported successfully to ${chalk.underline.bold(outputFile)}\n`))
  })
}

The above function takes one parameter, our parsedResults array and uses this to create the outputFile. If there are no errors, we use chalk to print out a fancy line that our results been successfully exported.

Now we can get a single website page, extract data from it and export it but how would we go about getting the rest of the pages from the paginated list?

Recursion

The process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called as recursive function. Using recursive algorithm, certain problems can be solved quite easily.

We are going to call our getWebsiteContent function within itself, each time passing through a new link as a parameter. Inside the getWebsiteContent function below map function add the following code:

    // Pagination Elements Link
    const nextPageLink = $('.pagination').find('.curr').parent().next().find('a').attr('href')
    console.log(chalk.cyan(`  Scraping: ${nextPageLink}`))
    pageCounter++

    if (pageCounter === pageLimit) {
      exportResults(parsedResults)
      return false
    }

    getWebsiteContent(nextPageLink)

Firstly we are getting our nextPageLink by using Cheerio to find the parent element, of the element with the class .curr and then we get the element next to it. We log the progress with the Chalk plugin and also keeping increasing the pageCounter variable to enforce the limit which we had specified at the start of the app.

This limit can be handy if a website only allows a certain amount of calls per minute, then using the log the scraper can be ran again using the last url it scraped as a starting point but off course there are also programmatic ways to do this such as using timeout functions to create delays.

We use an if statement to run the exportResults function and end the current function once the pageLimit has been reached and if the pageLimit has not been reached, we execute the getWebsiteContent(nextPageLink) function once more, this time passing in the new url.

Now our scraper is running through each page, getting the results and pushing them to our ‘parsedResults’ array.

Finally we want to also add the exportResults function to when we catch an error, reason being is that if an error is encountered after thousands of results has been scraped, these results are still exported.

} catch (error) {
  exportResults(parsedResults)
  console.error(error)
}

That is it, we now have a fully functional scraper that can be used to traverse any DOM and find store the results.

Siegfried Grimbeek

Passionate about the internet and technology as a whole, Siegfried Grimbeek has been actively developing websites for almost ten years. Currently he is focusing on JavaScript specialising in frameworks such as React and Vue.js and is always looking for ways to improve his development knowledege.

Scraping Paginated Lists With Node.js, Cheerio, Async / Await, And Recursion

Project Setup

Installing the External Packages

Siegfried Grimbeek

Search

Follow Us

Recent Posts

Support This Site