Cheerio & Moment.js: Web Data Scraping

Cheerio & Moment.js: Web Data Scraping

What is cheerio?

It is a javascript package used to parse and manipulate HTML and XML documents in a jQuery-like manner but without the overhead of involving a DOM. Simple example -

const cheerio = require('cheerio');
const $ = cheerio.load('<h2 class="title">Hello world</h2>');

const txt = $('h2.title').text();
console.log(txt);   // outputs Hello world

Basic functions -

  1. Loading HTML Content:

     const cheerio = require('cheerio');
    
     const htmlContent = '<div class="greeting">Hello, Cheerio!</div>';
     const $ = cheerio.load(htmlContent);
    
  2. Selecting Elements (similar to jQuery):

     const greeting = $('.greeting');
     console.log(greeting.text());  // Outputs: "Hello, Cheerio!"
    
  3. Iterating Over Elements:

     const listHtml = `
     <ul>
        <li>Apple</li>
        <li>Banana</li>
        <li>Cherry</li>
     </ul>
     `;
     const $ = cheerio.load(listHtml);
    
     $('li').each((index, element) => {
        console.log($(element).text());
     });
     // Outputs:
     // Apple
     // Banana
     // Cherry
    
  4. Manipulating Elements:

     $('.greeting').text('Hello, World!');
     console.log($.html());  // Outputs: '<div class="greeting">Hello, World!</div>'
    
  5. Attributes:

     const linkHtml = '<a href="https://www.example.com">Example</a>';
     const $ = cheerio.load(linkHtml);
    
     const link = $('a');
     console.log(link.attr('href'));  // Outputs: "https://www.example.com"
    
  6. Class Manipulation:

     const divHtml = '<div class="oldClass">Hello</div>';
     const $ = cheerio.load(divHtml);
    
     $('div').addClass('newClass').removeClass('oldClass');
     console.log($.html());  // Outputs: '<div class="newClass">Hello</div>'
    
  7. Inserting Content:

     const divHtml = '<div>Hello</div>';
     const $ = cheerio.load(divHtml);
    
     $('div').append(' World').prepend('Say: ');
     console.log($.html());  // Outputs: '<div>Say: Hello World</div>'
    

What is Moment.js?

JavaScript library that simplifies working with dates and times in web applications. It offers an approach to parse, validate, manipulate, and display dates and times in JavaScript. Example -

const moment = require("moment");
const date = "30-08-2023";

const formattedDate = moment(date).format("DD MM YYYY");
console.log(formattedDate);  // outputs 30 Aug 2023

Both of these libraries have many amazing features and functions that will help us to scrape the data out of the web.

Basic functions -

  1. Getting the Current Date and Time:

     const moment = require('moment');
    
     const now = moment();
     console.log("Current Date and Time:", now.format());
    
  2. Formatting a Date:

     console.log("Formatted Date:", now.format('MMMM Do YYYY, h:mm:ss a'));  // e.g., "August 29th 2023, 3:24:10 pm"
    
  3. Parsing a Date from a String:

     const parsedDate = moment("2023-12-25", "YYYY-MM-DD");
     console.log("Parsed Date:", parsedDate.format('MMMM Do YYYY'));  // Outputs: "December 25th 2023"
    
  4. Manipulating Dates (add/subtract days, months, etc.):

     const futureDate = now.add(7, 'days');
     console.log("Date after 7 days:", futureDate.format('MMMM Do YYYY'));
    
     const pastDate = now.subtract(5, 'months');
     console.log("Date 5 months ago:", pastDate.format('MMMM Do YYYY'));
    
  5. Difference between the Two Dates:

     const start = moment("2023-01-01");
     const end = moment("2023-12-31");
     const daysDifference = end.diff(start, 'days');
     console.log(`Difference in days: ${daysDifference} days`);
    
  6. Working with Timezones (requires moment-timezone package):

     const momentTz = require('moment-timezone');
     const newYorkTime = momentTz.tz("America/New_York");
     console.log("Time in New York:", newYorkTime.format());
    
  7. Localizations: Moment.js also supports multiple localizations to display dates in different languages.

     moment.locale('fr');  // Set to French
     console.log("Date in French:", now.format('LLLL'));  // e.g., "Mardi 29 ao没t 2023 15:24"
    

Scraping example -

In the example below we are scraping the holiday names and links from the website with the URL.

const express = require('express');
const axios = require('axios');
const cheerio = require('cheerio');
const app = express();

app.listen(3000,()=>{
    console.log("listening on "+3000);
})
app.get("/scrape",async(req,res)=>{
    const response = await axios.get("https://www.timeanddate.com/holidays/india/2023");
    let results = [];

    const $ = cheerio.load(response.data);
    $('tr').each(async(i,elem)=>{
        const link = $(elem).find('td a').attr('href');
        const name = $(elem).find('td a').text().trim();
        if (link && name != "") {
            results.push({
            name: name,
            link: link
        })
        }
    })
    return res.status(200).json({
        data: results
    })
})

I have made a basic server setup using Express and a scrape endpoint to scrape out the data from the website. First of all, we will request the URL and provide the HTML response to the Cheerio so that it can load (i.e. initialize the provided markup, giving you a parsed and queryable representation of the content).

The representation of data can be seen with the help of the inspect element and in this, it looks something like this -

<tr id="tr1" data-mask="134217728" data-date="1672531200000" class="showrow">
    <th class="nw">1 Jan</th>
    <td class="nw">Sunday</td>
    <td>
        <a href="/holidays/india/new-year-day">New Year's Day</a>
    </td>
    <td>Restricted Holiday</td>
</tr>
<tr id="tr2" data-mask="135266304" data-date="1673654400000" class="showrow">
    <th class="nw">14 Jan</th>
    <td class="nw">Saturday</td>
    <td>
        <a href="/holidays/india/makar-sankranti">Makar Sankranti</a>
    </td>
    <td>Restricted Holiday</td>
</tr>

So to select all the rows we will find a common property in them. As this structure is not that complex we can simply use the tr tag to select all the rows.
After selecting all the rows we need to extract two things name and link. The link as we can see's located in the <a></a> tag inside <td></td> and we can get text using .text() function.

const link = $(elem).find('td a').attr('href');

For the name we will select the same and to select its attribute href we will use .attr() function

const link = $(elem).find('td a').attr('href');

Now if both exist we will pass them into the results array and after that, we can return it.


Follow up -

If you have any questions, you can comment below. Will try to come up with more interesting things 馃槃