[Node.js] Scraping SEO, h1,h2,canonical SEO TDK Scraper

Peter Chang
3 min readOct 25, 2023

--

In my everyday work, I got more than 1000 times request to extract SEO-related data from web pages but found it challenging due to different rendering methods?

I have written a script to scrape SEO meta data throught Selenium and Puppeteer browser. Equipped with efficient scraping methods, this tool allows you to extract vital information such as the title, description, and heading tags from websites, regardless of their rendering technique.

[Node.js] Scraping SEO, h1,h2,canonical SEO TDK Scraper, saved in a CSV format

Overview of Selenium SEO TDK Scraper

Selenium SEO TDK Scraper is a versatile tool that supports two scraping methods: Server-Side Rendering (SSR) and Client-Side Rendering (CSR). These methods enable you to extract SEO data from websites regardless of how they render their content.

Server-Side Rendering (SSR) Method

The SSR method utilizes Selenium WebDriver to automate a headless Chrome browser. It creates a browser instance, navigates to the specified URL, and retrieves the page source containing the Title, Description, and Heading tags (H1, H2, H3). This method is ideal for websites that generate HTML content on the server side before sending it to the client.

Client-Side Rendering (CSR) Method

The CSR method leverages Puppeteer, a powerful Node.js library for controlling a headless Chrome browser. It launches a headless browser instance, sets up request interception to mimic the behavior of a CSR application, and navigates to the target URL. It then retrieves the rendered page content, including the SEO-related data. This method is suitable for websites that rely on client-side JavaScript to dynamically generate content.

How to Use Selenium SEO TDK Scraper

Using Selenium SEO TDK Scraper is straightforward. Just follow these steps:

  1. Prepare a url.txt file containing a list of URLs you wish to scrape, with each URL on a separate line.
  2. Clone the repository: git clone https://github.com/wahengchang/selecnium-seo-tdk-scraper.git.
  3. Install the dependencies: yarn init or npm install.
  4. Paste your target URLs in the urls.txt file.
  5. Run the script: node index.js.

Installation

Clone the repository:

Copy

$ git clone https://github.com/wahengchang/selecnium-seo-tdk-scraper.git

Install the dependencies:

$ yarn init
// or npm install

Run

Paste your target urls in the file utls.txt, the script will read the urls here, then run the scraper.

The scraper will read the URLs from the urls.txt file and automatically employ the appropriate scraping method (SSR or CSR) based on each website's rendering technique. The extracted SEO data, including the Title, Description, and Heading tags, will be saved in a CSV format.

The scraper will read the URLs from the urls.txt file
The final result will be saved as a CSV file located in the ./2023-10-25_audit

The final result will be saved as a CSV file located in the ./2023-10-25_audit directory. Additionally, the HTML content of each scraped page will be saved under this directory, allowing you to further analyze and process the data if needed.

Conclusion

Selenium SEO TDK Scraper is a valuable tool for scraping SEO-related data from web pages. With its support for both Server-Side Rendering (SSR) and Client-Side Rendering (CSR) methods, it provides a versatile solution for extracting SEO data regardless of the rendering approach used by the target websites. Whether you need to analyze title tags, meta descriptions, or heading tags, this tool simplifies the process and saves you valuable time.

To get started with Selenium SEO TDK Scraper, clone the repository, follow the installation steps, and begin extracting SEO data from your target websites effortlessly. Happy scraping!

Reference:

--

--