A powerful, Flask-powered web crawler built with Scrapy for efficient data extraction
Neural Crawler is a modern web scraping application that combines the power of Scrapy with a user-friendly Flask web interface. It enables users to crawl websites and extract structured data with just a single click.
The default configuration targets Books to Scrape, extracting book titles, prices, and availability informationβbut can be easily customized to crawl any website.
-
π Web Interface - Simple, intuitive Flask-based UI for triggering crawls
-
β‘ Fast Crawling - Powered by Scrapy's asynchronous architecture
-
π JSON Output - Structured data export in JSON format
-
π§ Configurable - Easily customizable spider rules and settings
-
π€ Robots.txt Compliant - Respects website crawling policies
-
β±οΈ Timeout Protection - Built-in safeguards against runaway crawls
-
π― Page Limit Control - Configurable page count limits
| Technology | Purpose |
|---|---|
| Python 3.8+ | Core programming language |
| Scrapy | Web crawling framework |
| Flask | Web application framework |
| HTML/CSS | Frontend interface |
Neural-Crawler-Web-Crawler/
βββ app.py # Flask application entry point
βββ crawler_runner.py # Spider execution handler
βββ output.json # Crawled data output
βββ scrapy.cfg # Scrapy configuration
βββ neuralcrawling/ # Scrapy project module
β βββ __init__.py
β βββ items.py # Data models
β βββ middlewares.py # Spider/Downloader middlewares
β βββ pipelines.py # Data processing pipelines
β βββ settings.py # Scrapy settings
β βββ spiders/
β βββ __init__.py
β βββ crawling_spider.py # Main crawler spider
βββ static/
β βββ style.css # Application styles
βββ templates/
β βββ index.html # Web interface template
βββ images/
βββ crawler-2.jpg # Banner image
- Python 3.8 or higher
- pip (Python package manager)
-
Clone the repository
git clone https://github.com/yourusername/Neural-Crawler-Web-Crawler.git cd Neural-Crawler-Web-Crawler -
Create a virtual environment (recommended)
python -m venv venv # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install dependencies
pip install flask scrapy
-
Start the Flask server
python app.py
-
Access the interface
Open your browser and navigate to:
http://127.0.0.1:5000 -
Start crawling
Click the "Start Crawling" button to begin data extraction.
You can also run the Scrapy spider directly from the command line:
scrapy crawl mycrawler -o output.jsonModify neuralcrawling/spiders/crawling_spider.py to customize:
| Setting | Default | Description |
|---|---|---|
allowed_domains |
books.toscrape.com |
Domains the spider can crawl |
start_urls |
http://books.toscrape.com/ |
Starting URLs for crawling |
CLOSESPIDER_PAGECOUNT |
10 |
Maximum pages to crawl |
Edit neuralcrawling/settings.py to adjust:
ROBOTSTXT_OBEY- Respect robots.txt rules (default:True)DOWNLOAD_DELAY- Delay between requests (default:1second)CONCURRENT_REQUESTS_PER_DOMAIN- Parallel requests per domain (default:1)
Crawled data is saved to output.json in the following structure:
[
{
"title": "A Light in the Attic",
"price": "Β£51.77",
"availability": "In"
},
{
"title": "Tipping the Velvet",
"price": "Β£53.74",
"availability": "In"
}
]- Update
allowed_domainsandstart_urlsincrawling_spider.py - Modify the
rulesto match the target site's URL patterns - Update
parse_item()to extract the desired data fields - Adjust the HTML template in
templates/index.htmlto display new fields
def parse_item(self, response):
yield {
"title": response.css("h1::text").get(),
"price": response.css(".price::text").get(),
"description": response.css(".description p::text").get(),
"url": response.url,
}Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Scrapy - The fast, high-level web crawling framework
- Flask - The lightweight WSGI web application framework
- Books to Scrape - Test website for web scraping

