Web Scraping Using JavaScript and Node.js

Internet is a source for all kinds of useful (and useless) data. Most people manually access that data by using a web browser. You can visit a website using a web browser to do things like checking out social media, getting the latest news, or check stock/cryptocurrency prices.

Another way of accessing data is to use an API. API is short for Application Programming Interface. A Web API defines the way we can programmatically access and interact with a remote resource. This way, we can consume data on the web without using a web browser. For example, we can use an API of a money exchange to programmatically fetch the latest prices for a stock without visiting the website.

Web scraping is the act of extracting data from a website either by manual or automated means. Manually extracting data can be time-consuming, given the amount of data that is out there. Unfortunately, not every online resource has an API that you can interact with. For those cases, we can automate a browser to access a website programmatically.

We can control a browser programmatically by using JavaScript. Automating our interactions with the web by programming a browser enables us to build tools that can scrape data from websites, fill out forms for us, take screenshots or download files with a single command.

There are many libraries in the JavaScript ecosystem that would allow us to control a browser programmatically. The package that we will be using for this purpose is called Puppeteer. It is a well-maintained library that is developed by the teams at Google.

Puppeteer allows us to control a Chrome (or a Chromium) browser programmatically. When we control a browser without any graphical User Interface (UI), it is said to be running in a headless mode.

This post assumes that you are comfortable using the JavaScript async-await pattern that is used for writing asynchronous programs. JavaScript has a couple of patterns that are used for dealing with the asynchronous program flow, such as callback functions and Promises. async-await is an asynchronous programming structure that got introduced into JavaScript after Promises. It makes working with asynchronous code a lot easier. Using async-await, we can write code that almost reads like synchronous code. Using async-await makes working with Puppeteer much easier.

This post will also assume a basic knowledge of Node.js, HTML, CSS, and JavaScript DOM APIs. If you are not comfortable with any of these subjects make sure to check out my book Awesome Coding that teaches you these fundamentals and a lot more! You can find the source code for the program we write here at: https://github.com/hibernationTheory/awesome-coding/tree/master/sections/05-puppeteer/puppeteer-03-project-wiki

Prerequisite Skills

JavaScript
Node.js (Beginner Level)
HTML and CSS (Beginner Level)
JavaScript DOM APIs (Beginner Level)

Getting Started with Puppeteer

Let's install Puppeteer to start working with it. This post will assume that you have Node.js and npm installed on your machine. We will begin with creating a new folder for our project and running the npm init command in that folder to create a package.json file in there.

Now that we have the package.json file created. We can install the puppeteer library by running this command:

npm install --save puppeteer@3.0.4

This installation might take a while since it downloads a version of the Chromium browser compatible with this library.

After downloading the file we can create a file called main.js and start coding inside it.

Here is an example of a Puppeteer program that programmatically launches a headless browser to visit a website and then takes a screenshot of that site to save it onto the computer.

const puppeteer = require("puppeteer");

async function main() {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://example.com");
  const filePath = "example.png";
  await page.screenshot({ path: filePath });

  await browser.close();
}

main();

We start our code by importing the puppeteer library. After that, we define an async function called main and then call it at the end of our program. The main logic of our program resides inside the main function.

Inside the function body, we first launch a browser instance by calling puppeteer.launch(). Whenever we launch a browser, we should remember to close it to not to cause any memory leaks from our program. A memory leak means the program that is not working is still consuming the resources of the system. We close the browser by calling browser.close().

We launch a new page inside that browser by calling browser.newPage(). We then visit the example.com domain inside that page by using the page.goto method. We take a screenshot of the page by using the page.screenshot method and save that screenshot into the same folder that we have called the program from. We then ensure that we are closing the browser and exit the program.

Now that we know the basics of Puppeteer let's build a simple project to put our knowledge into use.

Project: Get Random Wikipedia Articles

Using our Puppeteer knowledge, we will build a program that will fetch a random Wikipedia article every time it runs.

Let's look at how we would manually perform such a task to understand how we would automate it. In this case, we need to visit the website for Wikipedia (https://en.wikipedia.org) and click on the link named Random Article to take us to a random article page. On each article page, there is a title and an introductory paragraph.

We will need to follow the same steps with our Puppeteer program. We will visit the URL for random results and fetch the HTML elements with the title and the description. We would then need to display these results on the screen.

The URL for the Random Article page is https://en.wikipedia.org/wiki/Special:Random. We can get this value by right-clicking on this link and selecting Copy Link Address. We will start by writing a program that will visit this URL and take a screenshot.

const puppeteer = require("puppeteer");

async function main() {
  const browser = await puppeteer.launch();

  const page = await browser.newPage();
  const urlPath = "https://en.wikipedia.org/wiki/Special:Random";
  await page.goto(urlPath);
  const filePath = "example.png";
  await page.screenshot({ path: filePath });

  await browser.close();
}

main();

Everytime we run this program we are capturing a new screenshot from the visited URL.

We can inspect the HTML structure of an article page in a Chrome browser by clicking View > Developer > Inspect Elements. We would see that the title of the article is defined inside an h1 tag. This means that we can get the title data by running the code below inside the developer console when we are on an article page.

const title = document.querySelector("h1");
const titleText = title.innerText;

We can use Puppeteer to execute this code in the context of a webpage. We can use the page.evaluate function for this purpose. page.evaluate takes a callback function as an argument that gets evaluated in the current web page context. What we return from this callback function can be used in the Puppeteer application.

const puppeteer = require("puppeteer");

async function main() {
  const browser = await puppeteer.launch();

  const page = await browser.newPage();
  const urlPath = "https://en.wikipedia.org/wiki/Special:Random";
  await page.goto(urlPath);
  const filePath = "example.png";
  await page.screenshot({ path: filePath });

  const title = await page.evaluate(() => {
    const title = document.querySelector("h1");
    const titleText = title.innerText;

    return titleText;
  });

  console.log(title);

  await browser.close();
}

main();

Here we are capturing the value of the h1 tag in the webpage context and returning that value to the Puppeteer context.

const title = await page.evaluate(() => {
  const title = document.querySelector("h1");
  const titleText = title.innerText;

  return titleText;
});

page.evaluate can be a little unintuitive since its callback function can't refer to any value in the Puppeteer context. For example, we can't do something like the following example when using the page.evaluate function:

const selector = "h1";
const title = await page.evaluate(() => {
  const title = document.querySelector(selector);
  const titleText = title.innerText;

  return titleText;
});

console.log(title);

This program would throw an error. The selector variable doesn't exist inside the webpage context, so we can't refer to it from there. If we wanted to pass data to the webpage context, we could do so by providing it as an argument to the page.evaluate and its callback function.

const selector = "h1";
  const title = await page.evaluate((selector) => {
    const title = document.querySelector(selector);
    const titleText = title.innerText;

    return titleText;
  }, selector);

  console.log(title);

In this example, we are passing the selector variable as the second argument to the page.evaluate function as well as an argument to the callback function.

In our program, let's get the first paragraph of the article as well. Looking at the HTML structure, it seems like the p element we are looking for is inside an element with the class value mw-parser-output. That element, in turn, is inside the element with the id value mw-content-text. We can select all p elements inside that container with this CSS selector: #mw-content-text .mw-parser-output p.

const [title, description] = await page.evaluate(() => {
  const title = document.querySelector("h1");
  const titleText = title.innerText;

  const description = document.querySelector(
    "#mw-content-text .mw-parser-output p",
  );
  const descriptionText = description.innerText;

  return [titleText, descriptionText];
});

We are now getting both the title and the first paragraph from the article page. We are returning them to the Puppeteer context as an array. We are using array destructuring to unpack these values. Let's also get the URL of the current page by using the window.location.href variable.

const [title, description, url] = await page.evaluate(() => {
  const title = document.querySelector("h1");
  const titleText = title.innerText;

  const description = document.querySelector(
    "#mw-content-text .mw-parser-output p",
  );
  const descriptionText = description.innerText;

  const url = window.location.href;

  return [titleText, descriptionText, url];
});

This is looking pretty great. We can format these values that we are capturing using a template literal and display them on the screen using console.log.

const puppeteer = require("puppeteer");

async function main() {
  const browser = await puppeteer.launch();

  const page = await browser.newPage();
  const urlPath = "https://en.wikipedia.org/wiki/Special:Random";
  await page.goto(urlPath);
  const filePath = "example.png";
  await page.screenshot({ path: filePath });

  const [title, description, url] = await page.evaluate(() => {
    const title = document.querySelector("h1");
    const titleText = title.innerText;

    const description = document.querySelector(
      "#mw-content-text .mw-parser-output p",
    );
    const descriptionText = description.innerText;

    const url = window.location.href;

    return [titleText, descriptionText, url];
  });

  console.log(`
Title: ${title}
Description: ${description}
Read More at: ${url}
`);

  await browser.close();
}

main();

This code works great so far, but I am noticing that the description text is sometimes empty. Looking at the article page, this seems to happen when the first p element has a class called mw-empty-elt. Let's update our code to check to see if the first element's class name is equivalent to mw-empty-elt. If so, we would use the second p element instead. We can use the document.querySelectorAll function to get an array of all HTML elements that match the given CSS selector.

const puppeteer = require("puppeteer");

async function main() {
  const browser = await puppeteer.launch();

  const page = await browser.newPage();
  const urlPath = "https://en.wikipedia.org/wiki/Special:Random";
  await page.goto(urlPath);
  const filePath = "example.png";
  await page.screenshot({ path: filePath });

  const [title, description, url] = await page.evaluate(() => {
    const title = document.querySelector("h1");
    const titleText = title.innerText;

    let descriptionParagraph;
    const descriptionParagraphs = document.querySelectorAll(
      "#mw-content-text .mw-parser-output p",
    );
    const firstDescriptionParagraph = descriptionParagraphs[0];
    if (firstDescriptionParagraph.className === "mw-empty-elt") {
      descriptionParagraph = descriptionParagraphs[1];
    } else {
      descriptionParagraph = descriptionParagraphs[0];
    }

    const descriptionText = descriptionParagraph.innerText;

    const url = window.location.href;

    return [titleText, descriptionText, url];
  });

  console.log(`
Title: ${title}
Description: ${description}
Read More at: ${url}
`);

  await browser.close();
}

main();

This program is now in a pretty good spot! We have added the logic to choose the second paragraph if the first one has the class name mw-empty-elt.

let descriptionParagraph;
const descriptionParagraphs = document.querySelectorAll(
  "#mw-content-text .mw-parser-output p",
);
const firstDescriptionParagraph = descriptionParagraphs[0];
if (firstDescriptionParagraph.className === "mw-empty-elt") {
  descriptionParagraph = descriptionParagraphs[1];
} else {
  descriptionParagraph = descriptionParagraphs[0];
}

const descriptionText = descriptionParagraph.innerText;

And that is pretty much it for this project! One thing to note is how we rely on specific ID and class names to be present on the webpage for our program to work. If the HTML and CSS structure of the website we are scraping is to be updated, we would also need to update our program.

Things to Keep in Mind with Web Scraping

Performing manual operations in a programmatic way gives us a lot of leverage. If we have a program that can access a single website, it can be a simple matter to scale it to access thousands of websites.

This can be problematic when interacting with the web. If we were to load thousands of pages from a single domain in a short amount of time, it could potentially overwhelm the servers hosting those pages. It can even be interpreted as an attack by the website. Our IP can temporarily get blocked from accessing their resources or even get banned. We need to be mindful when using websites programmatically. We might want to add artificial delays in between our operations to slow our program down. We also need to be careful about what data we can access programmatically. Some websites try to limit programmatic access to protect their data, or there can even be legal implications for accessing and storing certain kinds of data.

Summary

Automating a browser is not the only way to access data on a webpage. There are many web applications out there that expose an API to connect developers with their resources. An API is an Application Programming Interface that we can use to interface with a resource programmatically. Using APIs, developers can build applications on top of popular services such as Twitter, Facebook, Google, or Spotify.

In this post, we used Puppeteer in Node.js to scrape data from websites. We have used the JavaScript async-await structure to manage the asynchronous data flow. We have also used CSS selectors to grab data from the HTML structure inside a web page using DOM API methods such as document.querySelectorAll

Web scraping is the act of using programs like Puppeteer to access and harvest data from websites programmatically. There can be legal implications for web scraping, so you should do your own research before engaging in such an action.