[Node.js] Scraping SEO, h1,h2,canonical SEO TDK Scraper
In my everyday work, I got more than 1000 times request to extract SEO-related data from web pages but found it challenging due to different rendering methods?
I have written a script to scrape SEO meta data throught Selenium and Puppeteer browser. Equipped with efficient scraping methods, this tool allows you to extract vital information such as the title, description, and heading tags from websites, regardless of their rendering technique.
Overview of Selenium SEO TDK Scraper
Selenium SEO TDK Scraper is a versatile tool that supports two scraping methods: Server-Side Rendering (SSR) and Client-Side Rendering (CSR). These methods enable you to extract SEO data from websites regardless of how they render their content.
Server-Side Rendering (SSR) Method
The SSR method utilizes Selenium WebDriver to automate a headless Chrome browser. It creates a browser instance, navigates to the specified URL, and retrieves the page source containing the Title, Description, and Heading tags (H1, H2, H3). This method is ideal for websites that generate HTML content on the server side before sending it to the client.
Client-Side Rendering (CSR) Method
The CSR method leverages Puppeteer, a powerful Node.js library for controlling a headless Chrome browser. It launches a headless browser instance, sets up request interception to mimic the behavior of a CSR application, and navigates to the target URL. It then retrieves the rendered page content, including the SEO-related data. This method is suitable for websites that rely on client-side JavaScript to dynamically generate content.
How to Use Selenium SEO TDK Scraper
Using Selenium SEO TDK Scraper is straightforward. Just follow these steps:
- Prepare a
url.txt
file containing a list of URLs you wish to scrape, with each URL on a separate line. - Clone the repository:
git clone https://github.com/wahengchang/selecnium-seo-tdk-scraper.git
. - Install the dependencies:
yarn init
ornpm install
. - Paste your target URLs in the
urls.txt
file. - Run the script:
node index.js
.
Installation
Clone the repository:
Copy
$ git clone https://github.com/wahengchang/selecnium-seo-tdk-scraper.git
Install the dependencies:
$ yarn init
// or npm install
Run
Paste your target urls in the file utls.txt
, the script will read the urls here, then run the scraper.
The scraper will read the URLs from the urls.txt
file and automatically employ the appropriate scraping method (SSR or CSR) based on each website's rendering technique. The extracted SEO data, including the Title, Description, and Heading tags, will be saved in a CSV format.
urls.txt
file./2023-10-25_audit
The final result will be saved as a CSV file located in the ./2023-10-25_audit
directory. Additionally, the HTML content of each scraped page will be saved under this directory, allowing you to further analyze and process the data if needed.
Conclusion
Selenium SEO TDK Scraper is a valuable tool for scraping SEO-related data from web pages. With its support for both Server-Side Rendering (SSR) and Client-Side Rendering (CSR) methods, it provides a versatile solution for extracting SEO data regardless of the rendering approach used by the target websites. Whether you need to analyze title tags, meta descriptions, or heading tags, this tool simplifies the process and saves you valuable time.
To get started with Selenium SEO TDK Scraper, clone the repository, follow the installation steps, and begin extracting SEO data from your target websites effortlessly. Happy scraping!