Skip to content

EimanTahir071/Neural-Crawler-Web-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ Neural Crawler - Web Crawler

Neural Crawler Banner

A powerful, Flask-powered web crawler built with Scrapy for efficient data extraction

Python Version Scrapy Flask License


πŸ“– Overview

Neural Crawler is a modern web scraping application that combines the power of Scrapy with a user-friendly Flask web interface. It enables users to crawl websites and extract structured data with just a single click.

The default configuration targets Books to Scrape, extracting book titles, prices, and availability informationβ€”but can be easily customized to crawl any website.

✨ Features

  • 🌐 Web Interface - Simple, intuitive Flask-based UI for triggering crawls

  • ⚑ Fast Crawling - Powered by Scrapy's asynchronous architecture

  • πŸ“Š JSON Output - Structured data export in JSON format

  • πŸ”§ Configurable - Easily customizable spider rules and settings

  • πŸ€– Robots.txt Compliant - Respects website crawling policies

  • ⏱️ Timeout Protection - Built-in safeguards against runaway crawls

  • 🎯 Page Limit Control - Configurable page count limits

    Neural Crawler Banner

πŸ› οΈ Tech Stack

Technology Purpose
Python 3.8+ Core programming language
Scrapy Web crawling framework
Flask Web application framework
HTML/CSS Frontend interface

πŸ“ Project Structure

Neural-Crawler-Web-Crawler/
β”œβ”€β”€ app.py                  # Flask application entry point
β”œβ”€β”€ crawler_runner.py       # Spider execution handler
β”œβ”€β”€ output.json             # Crawled data output
β”œβ”€β”€ scrapy.cfg              # Scrapy configuration
β”œβ”€β”€ neuralcrawling/         # Scrapy project module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ items.py            # Data models
β”‚   β”œβ”€β”€ middlewares.py      # Spider/Downloader middlewares
β”‚   β”œβ”€β”€ pipelines.py        # Data processing pipelines
β”‚   β”œβ”€β”€ settings.py         # Scrapy settings
β”‚   └── spiders/
β”‚       β”œβ”€β”€ __init__.py
β”‚       └── crawling_spider.py  # Main crawler spider
β”œβ”€β”€ static/
β”‚   └── style.css           # Application styles
β”œβ”€β”€ templates/
β”‚   └── index.html          # Web interface template
└── images/
    └── crawler-2.jpg       # Banner image

πŸš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Setup Steps

  1. Clone the repository

    git clone https://github.com/yourusername/Neural-Crawler-Web-Crawler.git
    cd Neural-Crawler-Web-Crawler
  2. Create a virtual environment (recommended)

    python -m venv venv
    
    # Windows
    venv\Scripts\activate
    
    # macOS/Linux
    source venv/bin/activate
  3. Install dependencies

    pip install flask scrapy

πŸ’» Usage

Running the Web Application

  1. Start the Flask server

    python app.py
  2. Access the interface

    Open your browser and navigate to: http://127.0.0.1:5000

  3. Start crawling

    Click the "Start Crawling" button to begin data extraction.

Running the Spider Directly

You can also run the Scrapy spider directly from the command line:

scrapy crawl mycrawler -o output.json

βš™οΈ Configuration

Spider Settings

Modify neuralcrawling/spiders/crawling_spider.py to customize:

Setting Default Description
allowed_domains books.toscrape.com Domains the spider can crawl
start_urls http://books.toscrape.com/ Starting URLs for crawling
CLOSESPIDER_PAGECOUNT 10 Maximum pages to crawl

Global Settings

Edit neuralcrawling/settings.py to adjust:

  • ROBOTSTXT_OBEY - Respect robots.txt rules (default: True)
  • DOWNLOAD_DELAY - Delay between requests (default: 1 second)
  • CONCURRENT_REQUESTS_PER_DOMAIN - Parallel requests per domain (default: 1)

πŸ“€ Output Format

Crawled data is saved to output.json in the following structure:

[
  {
    "title": "A Light in the Attic",
    "price": "Β£51.77",
    "availability": "In"
  },
  {
    "title": "Tipping the Velvet",
    "price": "Β£53.74",
    "availability": "In"
  }
]

🎨 Customization

Crawling a Different Website

  1. Update allowed_domains and start_urls in crawling_spider.py
  2. Modify the rules to match the target site's URL patterns
  3. Update parse_item() to extract the desired data fields
  4. Adjust the HTML template in templates/index.html to display new fields

Example: Custom Data Extraction

def parse_item(self, response):
    yield {
        "title": response.css("h1::text").get(),
        "price": response.css(".price::text").get(),
        "description": response.css(".description p::text").get(),
        "url": response.url,
    }

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Acknowledgments

  • Scrapy - The fast, high-level web crawling framework
  • Flask - The lightweight WSGI web application framework
  • Books to Scrape - Test website for web scraping

About

BFS and DFS BASED Project.A neural network-powered web crawler that intelligently extracts, classifies, and processes web content using deep learning

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors