Internal Link Scraping with Node.js URL Scraper, automation and design(State Machine)

Peter Chang
4 min readMay 3, 2023

--

Are you tired of manually scraping internal links for SEO or data analysis purposes? Look no further! I am excited to announce the open-sourcing of “persistent internal URL scraper” using automation and Node.js JavaScript.
github source code is here

Internal Link Scraping with Node.js URL Scraper, automation and design(State Machine) — midjourney created

While there are many existing tools, such as Python’s BeautifulSoup and Scrapy, or SASS platforms, none quite fit my needs as a Node.js developer with APIs and a frontend to consider.

This scraper automation tool is designed for small to mid-sized websites, ideally those with less than 1000 pages, using a single scraper in Axios/Puppeteer to fetch the entire site.

If you are looking for a tools to generate a list of URL links from one website, like the image shown below, this is the right tool.

Designing a Persistent Internal URL Scraper for Seamless Scraping

To ensure that our internal URL scraper is robust and reliable, we need to design it with persistence in mind. There are a number of factors that can cause the scraper to fail, such as poor internet connectivity, exceptions thrown by Node.js, or other unforeseen issues. To ensure that we can always pick up where we left off, we need to implement a persistent design.

The key to this design is to use two separate files to store two types of links: “to-do” links and “done” links. Whenever we start the scraper, it will read from the to-do list and begin opening pages by inputting URLs.

Designing a Persistent Internal URL Scraper for Seamless Scraping

As we do this, we will filter out any duplicate links and check them against the done list. If a link is already in the done list, we can ignore it. If a link is already in the to-do list, we can ignore it as well. However, if we discover a new link, we will append it to the to-do list for future processing.

This design ensures that even if the scraper is interrupted, we can easily pick up where we left off by simply reading from the to-do list and continuing with the remaining links. With this persistent design, we can be confident that our scraper will continue to run smoothly and efficiently, even in the face of unexpected challenges.

For example, https://abc.com/8 will be added to todo.txt, becase it is not discovered in both todo.txt and done.txt

Designing a Persistent Internal URL Scraper for Seamless Scraping

Building a State Machine for Internal URL Scraper, to Automation and Persistence

With our persistent design in place, our internal URL scraper is now stateful instead of stateless.

Building a Stateful Internal URL Scraper, to Automation and Persistence

This means we can manage the scraper’s progress more effectively, and handle different scenarios that may occur during scraping. There are four possible states the scraper can be in:

  1. New discover URL is completely new, and needs to be added to todo.txt for later scraping.
  2. New discover URL is already in done.txt or todo.txt, so we should ignore it and extract the first URL from the todo.txt queue.
  3. There are no more new discover URLs, so we should extract the first URL from the todo.txt queue.
  4. There are no more URLs to scrape, and no URLs left in done.txt or todo.txt.

To handle these different states, we can implement a state machine pattern rather than a switch case function. This will make it easier to manage the scraper’s progress and ensure that it runs smoothly.

By using a state machine, we can ensure that the scraper always knows what to do next, even if it encounters unexpected issues.

Conclusion

In conclusion, the “persistent internal URL scraper” tool we have designed is a reliable and efficient solution for automating internal link scraping.

The implementation of a state machine pattern allows for effective management of the scraper’s progress, even in unexpected scenarios.

Whether you are a small or mid-sized website owner, this tool is perfect for generating a list of URL links from your website. So, why not try it out today and take the hassle out of manual internal link scraping? And if you have any suggestions or improvements, feel free to send a pull request and contribute to making this tool even better.

Reference:

https://github.com/wahengchang/urlExtactor

--

--