Building a website isn’t a difficult task in a lot of circumstances, but maintaining a website is a totally different story. When it comes to larger scale websites or blogs such as The Polyglot Developer, content can become out of date at any time, and it’s more likely to happen the longer the content appears on the site.
Stale content and content that doesn’t work, whether that be through links, formatting, or something else, can severely damage how you rank in search results and the overall growth of your website.
Let’s dig a little deeper into links for example.
You’re probably going to have links on your website, whether they be internal or external in terms of where they route to. As your website evolves, or as the external websites evolve, those links might change and break. Broken links create a poor experience, something that Google and other search engines take into consideration when it comes to search engine optimization (SEO).
In this tutorial, we’re going to use simple JavaScript to find all of the broken links on an entire website, not just the current page.
Just to throw it out there, I am currently using the strategies in this tutorial to scan for broken links on The Polyglot Developer. I don’t look for broken links every day, but a few times a year I check to make sure everything is operating smoothly.
There are quite a few options when it comes to checking links, and even more options if you’d rather use an application rather than JavaScript code. For this example, we’re going to use broken-link-checker, an NPM package found on GitHub.
To make use of this package, install it with the following command, assuming the project is your current working directory:
npm install broken-link-checker --save-dev
Remember, we plan to use it in our JavaScript project, not as a shell application.
Rather than trying to integrate this package into an already existing JavaScript file in our project, we’re going to create a fresh one. Create a scanner.js file with the following code:
var { SiteChecker } = require("broken-link-checker");
const siteChecker = new SiteChecker(
{
excludeInternalLinks: false,
excludeExternalLinks: false,
filterLevel: 0,
acceptedSchemes: ["http", "https"],
excludedKeywords: ["linkedin"]
},
{
"error": (error) => {
console.error(error);
},
"link": (result, customData) => {
if(result.broken) {
if(result.http.response && ![undefined, 200].includes(result.http.response.statusCode)) {
console.log(`${result.http.response.statusCode} => ${result.url.original}`);
}
}
},
"end": () => {
console.log("COMPLETED!");
}
}
);
siteChecker.enqueue("http://localhost:1313/");
So let’s break down everything that’s happening in the above code.
After importing the SiteChecker
class, we create a new instance of it with some configuration information. For this particular example, we’re saying that we want to include internal links as defined by the domain that we later plan to enqueue, as well as links that are not on our local domain. In this example, localhost links would be considered internal links.
There are various levels of filtering when it comes to links that should be checked. Per the documentation, the filter options are as follows:
Because this is a programming blog and there will likely be links inside code blocks, I didn’t want to chance any false positives. For this reason, I only wanted clickable links to be checked.
There are many different link schemes on this particular site. Rather than checking my Bitcoin, email, and other random schemes, I defined a short list of what I actually wanted to scan. I also know that some sites I don’t want to scan because they will also return a anti-crawl rejection error. LinkedIn, for example, will not let you check links for validity with this tool. You can add anything to this excludedKeywords
list, it’s entirely up to you.
So that’s the configuration out of the way.
Next, we want to define listener functions for different events. There are quite a few events available, but we’re paying attention to error
, link
, and end
for this particular example.
The link
event is triggered for every result, whether the result is a broken link or not. We can use this event to figure out what links were checked and the resulting information that came with them. So for example:
"link": (result, customData) => {
if(result.broken) {
if(result.http.response && ![undefined, 200].includes(result.http.response.statusCode)) {
console.log(`${result.http.response.statusCode} => ${result.url.original}`);
}
}
},
The first thing we’re doing is checking to see if the link had a broken status. To a lot of people, just knowing whether or not a link is broken or not is enough. I need a little more information because there are some broken scenarios that I don’t care too much about.
If the link is considered broken, we check to see if the status code from the response matches any status codes on our ignore list. You’ll likely never end up with a 200 code on a broken link, but undefined is fair game. You might also choose to ignore 401 or 403 codes as well if you’re not too interested in unauthorized statuses. A lot of people are mostly interested in 404 and 500 error codes as they are unreachable by anyone.
If the status code is not on the list, I print out the status code and the original URL that was checked. There are other URLs that are part of the result as well. For example, maybe you want the URL which the broken link resided on as well. Or if there was a redirect, maybe you want the URL that was in your HTML as well as the new URL. You’ll likely want to print out the full result to see what data is available.
If I wanted to run the above file, I could do the following:
node scanner.js
It might take a while depending on how many links are being checked, but along the way you should see output of all the broken links. Fixing these broken links will be good for your users as well as your search engine optimization (SEO).
While you could run the code and file as is, I actually made some modifications to include it as part of my Gulp pipeline.
If you’re using Gulp like I am, you can do something like the following in your gulpfile.js file:
var gulp = require("gulp");
var { SiteChecker } = require("broken-link-checker");
gulp.task("check-links", function(done) {
const siteChecker = new SiteChecker(
{
excludeInternalLinks: true,
excludeExternalLinks: false,
filterLevel: 0,
acceptedSchemes: ["http", "https"],
excludedKeywords: ["linkedin", "facebook", "twitter", "reddit", "youtube", "ycombinator", "namecheap"],
excludeLinksToSamePage: true
},
{
"error": (error) => {
console.error(error);
},
"link": (result, customData) => {
if(result.broken) {
if(result.http.response && ![undefined, 200].includes(result.http.response.statusCode)) {
console.log(`${result.http.response.statusCode} => ${result.url.original}`);
}
}
},
"end": () => {
done();
}
}
);
siteChecker.enqueue("http://localhost:1313/");
});
In the above code, take note of the done
function used in the end
event. You need to be able to tell your Gulp task when your asynchronous activity has completed. By making use of the done
function, we can execute it when the broken link check has completed. If we wanted to, we could run this task with the following command:
gulp check-links
While this Gulp task isn’t any more useful than the stand-alone JavaScript file, it could be if you had a more sophisticated build and deployment pipeline. For more information on Gulp and the pipelines you can create with it, check out my previous tutorial titled, Getting Familiar with Gulp for Workflow Automation.
If you want to periodically check your website for broken or unreachable links, and this is something you should do, you have plenty of options. Some of the options, like what was demonstrated in this tutorial, includes the use of JavaScript for recursive link scanning in a website.
It’s probably not a good idea to check for broken links too frequently. External web hosts might block your IP thinking you’re a bot, and it does take time and network resources to complete the task, more so if you’ve got a lot of content with links.
A video version of this tutorial can be found below.