Web data scraping: Puppeteer

Web data scraping: Puppeteer

ยท

3 min read

What is Puppeteer?

Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium browsers over the DevTools Protocol (i.e. set of APIs that allows you to interact programmatically with Chrome or Chromium browsers. This protocol provides a way to inspect, debug, and profile, and otherwise control the browser.)

We can install it by using - npm install puppeteer

Examples of tasks it can perform -

  1. Web scraping

  2. Automated testing

  3. Generate screenshots and pdf

  4. Mimic different environments and much more...

Basic functions -

 // Launching a Browser
const browser = await puppeteer.launch();

// Opening a New Page 
const page = await browser.newPage();

// Navigate to url 
await page.goto("url")

// take screenshot 
await page.screenshot({ path: 'ss.png' });

// close the browser 
await browser.close();

Query selectors -

It is used to select and interact with elements on a web page. These selectors are similar to the ones you would use in JavaScript

// selects the first element that matches the specified CSS selector
const element = await page.$('selector');

// selects all elements that match the specified CSS selector
const elements = await page.$$('selector');

// find child element 
const innerElement = await element.$('inner-selector');

// find multiple child elements of same type 
const innerElements = await element.$$('inner-selector');

// to find out a deep descendent element by querying
const deep = await element.$('div >>> a');

p elements

P elements are pseudo-elements with a -p vendor prefix. It allows you to enhance your selectors with Puppeteer-specific query engines such as XPath, text queries, and ARIA.

// text
const txtEl = await element.$('div ::-p-text(hello)')

// using xpath
const xpathEl = await element.$('div ::-p-xpath(h1)')

// using aria 
const ariaEl = await element.$('::-p-aria(Submit)');

Locators -

It enables automatic retries for failed actions, resulting in more reliable and less flaky automation scripts along with functionalities of waitings and actions.

// waiting for button to be enabled 
const btn = await page.locator('button').wait();

// clicking an element 
await page.locator('button').click();

// fill value inside input
await page.locator('input').fill('value');

// hovering over an element 
await page.locator('button').hover();

// scroll through the page
await page.locator('div').scroll({
  scrollTop: 0,
});

// get event listening for locators
await page
  .locator('button')
  .on(LocatorEmittedEvents.Action, () => {
    console.log("clicked");
  })
  .click();

Evaluate javascript -

This provides us with the option to run and evaluate js on the page fetched by a puppeteer.

// simple function 
const sum = await page.evaluate(() => {
    return 1 + 2;
  });
console.log(sum); // gives 3

// passing arguments in our evalutation function 
const sum = await page.evaluate((a,b) => {
    return a + b;
  });
console.log(sum(1,1)); // gives 2

One example would be scraping out some important information with the help of selectors and then running some function to make it useful to store or display on your webpage.


Puppeteer and cheerio -

While Puppeteer allows for browser automation, interaction, and rendering of JavaScript-heavy sites, Cheerio offers a fast and lightweight way to parse and traverse the DOM with a jQuery-like syntax. We can use Puppeteer to easily gain access to the websites we want to scrape out rather than just simply requesting the webpage by making a post, get request and then we can use Cheerio to scrape out important information easily.

If we only use Cheerio we have to take care of every request we make along with their payload and headers. Even though it's not that tough we can go through multiple debugging sessions as a beginner.

Below is a simple example of their combined usage -

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
const content = await page.content();

const $ = cheerio.load(content);
$('selector').each((index, element) => {
    // Extract data using jQuery-like syntax
    console.log($(element).text());
});

await browser.close();

Follow up -

If you have any questions, you can comment below. Will try to come up with more interesting things ๐Ÿ˜„

ย