Overview

The provided code is a multi-threaded web crawler designed to traverse and collect information from web pages starting from a base URL. The crawler uses the producer consumer pattern and breadth first search to extract urls from a web-page.

Workflow

The program starts with a seed_url that is provided to the crawler.
The main thread starts by seeding the URL queue with the initial seed_url. The crawl_url method acts as a producer when it finds new links and adds them to the queue for further crawling, using self.queue.put(link). Multiple worker threads (consumers) are started in the start_crawl method. Each thread runs the worker method. Each worker thread tries to get a URL from the queue and process it by calling the crawl_url method. If the queue is empty for 5 seconds, the worker thread exits, demonstrating a typical consumer behavior of waiting for data and stopping when there is none.
The links extracted from each page are first checked with the rules from robots.txt to see if they can be further crawled.
If they can be further crawled i.e are allowed by robots.txt, the html content from the url is downloaded by the HTML downloader. The HTML downloader also handles rate limiting and slows down if the requests are rate limited.
After the HTML is downloaded, the content is parsed by the Content Parser and the links are extracted from the HTML page.
If the links are valid i.e of the same domain, not a spam then it is checked if the url has been visited before.
If the url has not been visited before, the url is added to the queue.
After the program is completed, a sitemap.json file is generated which has the sitemap of the crawled website.

Installation

It is recommended that you use virtual environment for running development server. If you don't want to use it, skip to step 4 directly.

python -m pip install --user virtualenv

If you're running the project for the first time, create a virtual environment.

virtualenv venv

Activate the virtual environment.

source venv/bin/activate

Install all the relevant depedencies by running this from the root directory of the project.

  pip install -r requirements.txt

Run the web crawler.

  python web_crawler.py

Check the sitemap.json generated

Running Tests

To run tests, run the following command

python web_crawler_test.py

Note: Did not have time to write tests for each file and use-case. In production code, would like to write more unit and integration tests.

Authors

Atibhi Agrawal

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
content_parser.py		content_parser.py
html_downloader.py		html_downloader.py
requirements.txt		requirements.txt
robot_parser.py		robot_parser.py
sitemap.json		sitemap.json
url_filter.py		url_filter.py
web_crawler.py		web_crawler.py
web_crawler_test.py		web_crawler_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Workflow

Installation

Running Tests

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Workflow

Installation

Running Tests

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages