In this tutorial we are going to develop a small Node.js application which will be used to scrape paginated content and export the data to a JSON file. The full source code for the tutorial can be found here.
We will be scraping a list website and saving ten lists per page from the “new lists” section and the final application can be seen below.
In this tutorial we will cover the following concepts:
We will use the following NPM packages:
Lets get started by opening a Terminal window and running the following commands in your project folder:
mkdir cheerio-pagination-tutorial
cd cheerio-pagination-tutorial
npm init
Follow the prompts to setup the project, entering the default information will suffice, once the project has been initiated, a package.json file has now been created in the project directory.
npm i axios chalk cheerio nodemon --save
You may have noticed that we are not installing the Fs package as this a core node module which is installed by default, below is a short description of what each package does and how we will use them:
This is a JavaScript library that we use to make HTTP requests from the application. It can also be used to make XMLHttpRequests
from the browser. It addresses some of the shortcomings that the native ES6 fetch has, like transforming JSON data automatically.
This has been included to add some spice to the application but it may come in handy if you ever need to develop an app which has a lot of interaction with the terminal. It allows a user to add styling to information that is logged to the console. For the app we going to use Chalk to add some font colors, bolding, underlines and background colors.
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. We are going to use it to traverse through the HTTP that we got using Axios.
The Node.js file system module allow you to work with the file system on your computer. We will be using it to save the data to a JSON file.
Nodemon monitors for any changes in your Node.js application and automatically restarts the server.
Before we start with the actual development of our app, we have a bit more boilerplate to do, lets start by opening the package.json file and adding the following code:
"nodemonConfig": {
"ignore": [
"*.json"
]
}
This tells Nodemon to ignore any changes to any .json files, which will prevent our project from restarting when we export our JSON data as Nodemon will ignore changes to .json files.
Add the following line to the scripts
property:
"start": "./node_modules/nodemon/bin/nodemon.js ./src/index.js",
This allows you to run the application by entering npm start
into the Terminal from the root directory of the app, as it points to the Nodemon script and the index.js script (which we will still create).
If one installs Nodemon globally, by running sudo npm i -g nodemon
, then the app can be started by running nodemon
, it will then look for whatever file has been specified as the main file in the package.json.
You can have a look at the package.json file to make sure everything is correct. That is it, we are now ready to start developing.
Create a new file called index.js, then lets require our external dependancies by adding the following code to the file:
// External dependencies
const axios = require('axios')
const cheerio = require('cheerio')
const fs = require('fs')
const chalk = require('chalk')
You can already run npm start
which will reload the app every time a file changes. Next lets setup some global variables that our app will use:
const url = 'http://listverse.com/'
const outputFile = 'data.json'
const parsedResults = []
const pageLimit = 10
let pageCounter = 0
let resultCount = 0
The url
variable is the website that we will be scraping, the outputFile
variable is the .json file the app will output, the parsedResults
variable is an empty array where we will insert each result, the pageLimit
is used to limit the number of pages scraped and lastly the pageCounter
and resultCount
are used to keep track of the number of results and pages that has been scraped.
Lets create our Async
function to retrieve our data and load it into Cheerio which will make it accessible to us via jQuery selectors:
console.log(chalk.yellow.bgBlue(`\n Scraping of ${chalk.underline.bold(url)} initiated...\n`))
const getWebsiteContent = async (url) => {
try {
const response = await axios.get(url)
const $ = cheerio.load(response.data)
} catch (error) {
console.error(error)
}
}
getWebsiteContent(url)
Firstly we are printing to the console that the scraping has been initiated, here you can see that we are using Chalk to change the text color to yellow, add a blue background and also bolden and underline the url. As stated, it is just to add some spice but if you do have a console intensive app it could come in handy for readability.
Next we create an Async function expression getWebsiteContent
which uses Axios to retrieve the website url. A try and catch
statement has been implemented, which will handle any errors and print them to the console. With the $
we are loading the retrieved webpage into cheerio.
One has to keep in mind that when scraping websites, each website is unique and there is no single solution which can be applied to all websites, with that said, with a little jQuery knowledge and using the skeleton of this app, you can easily adjust this script to scrape almost any website.
When inspecting the source of ‘http://listverse.com/', we can see that there are some clear selectors which we can use to get the new lists. Below where we load Cheerio
in our getWebsiteContent
function add the following code:
// New Lists
$('.wrapper .main .new article').map((i, el) => {
const count = resultCount++
const title = $(el).find('a').attr('href')
const url = $(el).find('h3').text()
const metadata = {
count: count,
title: title,
url: url
}
parsedResults.push(metadata)
})
The above code maps through all the new articles and pushes their count, title and url to the metadata
object. The metadata
object is then pushed to the parsedResults
array.
The code used to select the elements is basically jQuery and you can get a list of all available selectors from the Cheerio website.
Now that we have some results in our parsedResults
array, lets store create a function expression that will save these results to the outputFile
:
const exportResults = (parsedResults) => {
fs.writeFile(outputFile, JSON.stringify(parsedResults, null, 4), (err) => {
if (err) {
console.log(err)
}
console.log(chalk.yellow.bgBlue(`\n ${chalk.underline.bold(parsedResults.length)} Results exported successfully to ${chalk.underline.bold(outputFile)}\n`))
})
}
The above function takes one parameter, our parsedResults
array and uses this to create the outputFile
. If there are no errors, we use chalk to print out a fancy line that our results been successfully exported.
Now we can get a single website page, extract data from it and export it but how would we go about getting the rest of the pages from the paginated list?
Recursion
The process in which a function calls itself directly or indirectly is called recursion and the corresponding function is called as recursive function. Using recursive algorithm, certain problems can be solved quite easily.
We are going to call our getWebsiteContent
function within itself, each time passing through a new link as a parameter. Inside the getWebsiteContent
function below map function add the following code:
// Pagination Elements Link
const nextPageLink = $('.pagination').find('.curr').parent().next().find('a').attr('href')
console.log(chalk.cyan(` Scraping: ${nextPageLink}`))
pageCounter++
if (pageCounter === pageLimit) {
exportResults(parsedResults)
return false
}
getWebsiteContent(nextPageLink)
Firstly we are getting our nextPageLink
by using Cheerio to find the parent element, of the element with the class .curr
and then we get the element next to it. We log the progress with the Chalk plugin and also keeping increasing the pageCounter
variable to enforce the limit which we had specified at the start of the app.
This limit can be handy if a website only allows a certain amount of calls per minute, then using the log the scraper can be ran again using the last url it scraped as a starting point but off course there are also programmatic ways to do this such as using timeout functions to create delays.
We use an if statement to run the exportResults
function and end the current function once the pageLimit
has been reached and if the pageLimit
has not been reached, we execute the getWebsiteContent(nextPageLink)
function once more, this time passing in the new url.
Now our scraper is running through each page, getting the results and pushing them to our ‘parsedResults’ array.
Finally we want to also add the exportResults
function to when we catch an error, reason being is that if an error is encountered after thousands of results has been scraped, these results are still exported.
} catch (error) {
exportResults(parsedResults)
console.error(error)
}
That is it, we now have a fully functional scraper that can be used to traverse any DOM and find store the results.