The provided code is a multi-threaded web crawler designed to traverse and collect information from web pages starting from a base URL. The crawler uses the producer consumer pattern and breadth first search to extract urls from a web-page.
- The program starts with a seed_url that is provided to the crawler.
- The main thread starts by seeding the URL queue with the initial seed_url. The crawl_url method acts as a producer when it finds new links and adds them to the queue for further crawling, using self.queue.put(link). Multiple worker threads (consumers) are started in the start_crawl method. Each thread runs the worker method. Each worker thread tries to get a URL from the queue and process it by calling the crawl_url method. If the queue is empty for 5 seconds, the worker thread exits, demonstrating a typical consumer behavior of waiting for data and stopping when there is none.
- The links extracted from each page are first checked with the rules from robots.txt to see if they can be further crawled.
- If they can be further crawled i.e are allowed by robots.txt, the html content from the url is downloaded by the HTML downloader. The HTML downloader also handles rate limiting and slows down if the requests are rate limited.
- After the HTML is downloaded, the content is parsed by the Content Parser and the links are extracted from the HTML page.
- If the links are valid i.e of the same domain, not a spam then it is checked if the url has been visited before.
- If the url has not been visited before, the url is added to the queue.
- After the program is completed, a sitemap.json file is generated which has the sitemap of the crawled website.
- It is recommended that you use virtual environment for running development server. If you don't want to use it, skip to step 4 directly.
python -m pip install --user virtualenv
- If you're running the project for the first time, create a virtual environment.
virtualenv venv
- Activate the virtual environment.
source venv/bin/activate
- Install all the relevant depedencies by running this from the root directory of the project.
pip install -r requirements.txt - Run the web crawler.
python web_crawler.py - Check the sitemap.json generated
To run tests, run the following command
python web_crawler_test.pyNote: Did not have time to write tests for each file and use-case. In production code, would like to write more unit and integration tests.
- Atibhi Agrawal
