Building a Web Scraper in Next JS

Building a Web Scraper in Next JS

Table of contents

No heading

No headings in the article.

Web Scraping is essentially extracting data from other websites.
Some websites or services may not have a public API that allows developers direct access to the data. In such scenarios, web scraping can be used as an alternative method. The only limitation is that unlike APIs web scraping may require more maintenance, as websites constantly change their structure or content over time.

In this example, We'll get the schedule for the Formula 1 2023 season from their official website https://www.formula1.com/en/racing/2023.html

Firstly create the next app : npx create-next-app

Install axios and jsdom : npm i axios jsdom

// pages/api/schedule.ts
import axios from "axios";
import { JSDOM } from "jsdom";

const BASE_URL = `https://www.formula1.com/en/racing/2023.html`;

const { data } = await axios.get(getUrl, {
      headers: {
        Accept: `< get the headers from network tab >`,
        Host: `www.formula1.com`,
        User-Agent: `< get the headers from network tab >`,
      },
    });
const dom = new JSDOM(data);

Axios is used to make an HTTP GET request to the specified URL and fetch the HTML content of the web page. The obtained HTML content is then passed to the JSDOM constructor, which creates a new DOM (Document Object Model) object from the HTML content.
Using this new DOM we can retrive specific element, modify or add new elements.

Next, we need go to inspect tab and get the class or id that needs to be retrived.

The div with class name "race-card" had the info I needed.

import axios from "axios";
import { JSDOM } from "jsdom";

const BASE_URL = `https://www.formula1.com/en/racing/2023.html`;

const { data } = await axios.get(getUrl, {
      headers: {
        Accept: `< get the headers from network tab >`,
        Host: `www.formula1.com`,
        User-Agent: `< get the headers from network tab >`,
      },
    });
const dom = new JSDOM(data);
const raceCards = dom.window.document.querySelectorAll(".race-card");

The raceCards variable will hold a NodeList of all the elements that match the specified CSS selector.

const schedule = Array.from(raceCards, (raceCard) => {
      const raceInfo = raceCard.textContent;
      const raceInfoArr = raceInfo.split(" ");
      const date = raceInfoArr[0] + " " + raceInfoArr[1];
      const venue = raceInfo.replace(date + " ", "");
      return {
        date,
        venue,
      };
    });

Array.from method is used to create a new array from the NodeList of elements that match the ".race-card" selector.

Further you can modify and use the data according to your requirements.

Demo Link : https://f1calender.vercel.app/
Github Repo: https://github.com/gaurishxjfk/web-scrapping-next-js