Skip to content
/ ArQiv Public

ArQiv is a state-of-the-art search engine, meticulously designed for the ArXiv research corpus. By leveraging powerful data structures such as inverted indexes, tries, and bitmap indexes alongside advanced ranking algorithms like BM25, TF-IDF, Fast Vector Ranking, and BERT-based semantic search.

License

Notifications You must be signed in to change notification settings

Tejas242/ArQiv

Repository files navigation

🚀 ArQiv Search Engine

High-performance semantic search for ArXiv research papers

ArQiv Version

FeaturesGetting StartedUsageArchitectureBenchmarksLive Demo

Python Streamlit NumPy Scikit-learn NLTK PyTorch Pandas Plotly


📖 Overview

ArQiv is a state-of-the-art search engine designed specifically for the ArXiv research corpus. It combines multiple indexing strategies and ranking algorithms to deliver lightning-fast, relevant results that help researchers discover papers more efficiently.

Only first 1000 documents are loaded change the sample_size in loader.py file to load more docs or full dataset.

✨ Key Features

🔍 Optimized Data Structures

  • Inverted Index: Rapid term lookups with positional data
  • Trie (Prefix Tree): Instant autocomplete and fuzzy matching
  • Bitmap Index: Ultra-fast Boolean operations with vectorization

📊 Advanced Ranking Algorithms

  • BM25 Ranking: Precise probabilistic relevance scoring
  • TF-IDF Ranking: Robust vectorized similarity computation
  • Fast Vector Ranking: Real-time ranking via NearestNeighbors
  • Optional BERT Ranking: Deep semantic ranking with transformers

🖥️ Multiple Interfaces

  • Rich CLI: Colorful terminal interface with detailed results
  • Streamlit Web App: Interactive web UI with visualizations
  • In-memory Query Caching: Near-instant response on repeated queries

⚡ Performance & Scalability

  • Parallel Processing: Multi-core indexing for large datasets
  • Modular Design: Extensible architecture for new features
  • Memory Efficient: Optimized for performance on standard hardware

🚀 Getting Started

Prerequisites

  • Python 3.7+
  • 4GB+ RAM (8GB recommended for full dataset)
  • Internet connection for initial dataset download

Installation

  1. Clone the repository:

    git clone https://github.com/tejas242/ArQiv.git
    cd arqiv
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Download NLTK resources (one-time):

    import nltk
    nltk.download('stopwords')

📋 Usage

Command Line Interface

Launch the rich CLI interface:

python cli.py

Streamlit Web Interface

Start the interactive web application:

cd streamlit
streamlit run streamlit_app.py

Or use the hosted version:

https://arqiv-search.streamlit.app/

CLI Demo

🏗️ Architecture

ArQiv employs a layered architecture with the following components:

┌─────────────────────────────────┐
│          User Interfaces        │
│    CLI Interface  │  Streamlit  │
├─────────────────────────────────┤
│         Ranking Algorithms      │
│  BM25  │ TF-IDF │ Vector │ BERT │
├─────────────────────────────────┤
│          Index Structures       │
│ Inverted Index │ Trie │ Bitmap  │
├─────────────────────────────────┤
│             Data Layer          │
│ Document Model │ Preprocessing  │
└─────────────────────────────────┘

📊 Benchmarks

Task Performance
Index 1,000 documents 0.8 seconds
Boolean search < 5ms
BM25 ranking ~50-100ms
TF-IDF ranking < 5ms
Fast Vector ranking < 5ms
BERT ranking ~200ms

Measurements on Ryzen 3 CPU with 8GB RAM

⚠️ Limitations & Challenges

While ArQiv is designed to be powerful and efficient, users should be aware of these limitations:

  • Memory Usage: The in-memory index requires approximately 50MB per 1,000 documents
  • Scalability Ceiling: Performance may degrade with datasets beyond 100,000 documents without distributed architecture
  • Language Support: Currently optimized for English text only
  • Neural Features: BERT-based ranking requires substantial CPU resources (GPU recommended)
  • Cold Start: First-time initialization has higher latency while building indices
  • Preprocessing Effects: Stemming may occasionally lead to unexpected term matches

For large-scale deployment scenarios, consider implementing a sharded index architecture or using a database backend for the inverted index.

📁 Project Structure

Click to expand directory structure
arqiv/
├── data/               # Data handling components
│   ├── document.py     # Document model
│   ├── preprocessing.py # Text processing utilities
│   └── loader.py       # Dataset loading
├── index/              # Indexing structures
│   ├── inverted_index.py # Main indexing engine
│   ├── trie.py         # Prefix tree for autocomplete
│   └── bitmap_index.py # Bitmap for fast boolean ops
├── ranking/            # Ranking algorithms
│   ├── bm25.py         # BM25 implementation
│   ├── tfidf.py        # TF-IDF with scikit-learn
│   ├── fast_vector_ranker.py # Vector-based ranking
│   └── bert_ranker.py  # Neural ranking with BERT
├── search/             # Search functionalities
│   └── fuzzy_search.py # Approximate string matching
├── streamlit/          # Web interface
│   └── streamlit_app.py # Streamlit application
├── docs/               # Documentation
├── tests/              # Test suite
├── cli.py              # Command-line interface
└── README.md           # This file

🔧 Troubleshooting

Common issues and solutions

Dataset Download Issues

  • Ensure Kaggle API credentials are set up correctly
  • For manual download: Place the arxiv-metadata-oai-snapshot.json in the data/ directory

Performance Problems

  • For slow search: Try using BM25 or TF-IDF ranking for faster results
  • If indexing is slow: Increase the number of worker processes in the parallel option

Memory Usage

  • If experiencing memory issues: Reduce sample_size parameter in load_arxiv_dataset()

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ⚡ by Tejas Mahajan

About

ArQiv is a state-of-the-art search engine, meticulously designed for the ArXiv research corpus. By leveraging powerful data structures such as inverted indexes, tries, and bitmap indexes alongside advanced ranking algorithms like BM25, TF-IDF, Fast Vector Ranking, and BERT-based semantic search.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published