Skip to content

CMU-TBD/ar-od-llm-indoor-navigation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Indoor Navigation System for Blind People via Cognitive Architectures, Augmented Reality and LLMs

A multimodal assistive navigation system combining Augmented Reality (AR), Object Detection (OD), and Large Language Models (LLM) to enable indoor navigation for blind and visually impaired users through a cognitive architecture framework.

Overview

This system provides real-time, conversational navigation assistance for blind users exploring indoor environments. By integrating computer vision, spatial anchoring, natural language understanding, and cognitive architecture principles, the system enables users to:

  • Navigate complex indoor spaces using conversational commands ("Take me to the kitchen")
  • Understand their surroundings through object detection and scene descriptions
  • Interact naturally with an AI assistant that understands navigation intents and context
  • Track objects and locations using spatial memory and visual recognition
  • Receive multimodal feedback through audio cues and verbal guidance

Key Technologies

  • iOS AR Application: ARKit-based mobile client with cognitive architecture controllers
  • Spatial Anchoring: Microsoft Azure Spatial Anchors for persistent waypoint tracking
  • Object Detection: Google MLKit for real-time visual recognition
  • LLM Integration: Multiple backends (Ollama, Mistral, Llama2, GPT-2) for natural language understanding
  • Vision-Language Models: LLaVA for multimodal scene understanding
  • Path Planning: A* algorithm for optimal route calculation
  • Speech Recognition: Real-time voice input processing on iOS

Cognitive Architecture

The system implements a classical cognitive architecture with the following components:

System Architecture

Figure 1: Modular client-server architecture. On the client side (smartphone), boxes represent modules running asynchronously on separate threads. The working memory component integrates information from other modules and produces a speech output. A cognitive cycle initiates when the perception module processes internal and external information stored in the working memory. The procedural memory decides what to do next by retrieving the contents of the working memory, which in turn retrieves knowledge about the user and the world. Finally, the cycle ends with a conversational action processed by the motor module. Blue-colored modules can be turned on/off according to the assessment setting. Green-colored modules denote cognitive modules proposed by the CMC. Arrows indicate the flow of information.


Repository Structure

ar-od-llm-indoor-navigation/
├── client/                 # iOS AR application (Swift, Xcode)
├── server/                 # Python backend services (Flask, LLMs)
├── datasets/               # Training and dialogue datasets
├── LICENSE                 # MIT License
└── README.md              # This file

Folder Descriptions

📱 /client - iOS AR Application

Platform: iOS (minimum deployment: iOS 11.0) Language: Swift Framework: ARKit, UIKit

The client folder contains an Xcode project (prototype3) implementing the mobile AR navigation interface for blind users.

Key Components:

Cognitive Architecture Controllers (CAControllers/):

  • DeclarativeMemoryController.swift - Stores factual knowledge (waypoints, object locations, room information)
  • ProceduralMemoryController.swift - Manages navigation procedures and how-to knowledge
  • WorkingMemoryController.swift - Maintains active navigation state and temporary information
  • PerceptionController.swift - Processes sensory inputs (camera frames, spatial data)
  • MotorController.swift - Executes outputs (audio feedback, haptic cues)
  • PathPlannerController.swift - Computes optimal routes using pathfinding algorithms

View Controllers (ViewControllers/):

  • WayfindingViewController.swift - Main navigation interface
  • BaseAnchorsViewController.swift - Manages spatial anchors and waypoints
  • MainMenuViewController.swift - App entry point and scenario selection
  • AnchorInfoViewController.swift - Displays detailed waypoint information
  • GlobalSettingsViewController.swift - Configuration and feature toggles

Data Models (Models/):

  • AnchorData.swift - Represents navigation waypoints with coordinates and properties
  • PathData.swift - Route and path information structures
  • SessionData.swift - Navigation session state management
  • RootAnchors.swift - Hierarchical anchor organization

Utilities (Utils/):

  • WiFiConnection.swift - Network connectivity management
  • CustomLogger.swift - Logging and instrumentation
  • GUIHelper.swift - UI rendering utilities
  • TextObservationTracker.swift - OCR result tracking

Dependencies (CocoaPods):

pod 'AzureSpatialAnchors', '2.13.0'     # Persistent spatial anchoring
pod 'GoogleMLKit/ObjectDetection'       # Real-time object detection
pod 'GoogleMLKit/TextRecognition'       # OCR capabilities
pod 'DropDown'                          # UI components
pod 'SwiftySound'                       # Audio playback

Features:

  • Real-time object detection with toggle on/off
  • Conversational AI interaction via voice commands
  • People detection in rooms
  • Microsoft Spatial Anchors for persistent waypoints
  • 8 research scenarios for systematic evaluation
  • Audio guidance system with directional cues
  • Session logging for research analysis

🖥️ /server - Python Backend Services

Platform: Python 3.x Framework: Flask LLM Engines: Ollama, Mistral, Llama2, GPT-2, LLaVA

The server folder contains the backend services responsible for natural language understanding, pathfinding, and LLM integration.

Core Services:

Flask Application (application_spatial_anchors.py):

  • RESTful API for client-server communication
  • Spatial anchor description management
  • Session and anchor data retrieval
  • Conversational interface endpoints

API Endpoints:

Endpoint Method Purpose
/add-mapping POST Store anchor descriptions
/get-description GET Retrieve visual descriptions for anchors
/get-reply POST Process user utterances and return AI responses
/get-root-anchors GET Retrieve all waypoint anchors
/get-nlu-examples GET Get intent/entity training examples
/get-session-names GET List available navigation sessions
/get-anchors-session GET Get anchors for specific session

LLM Integration:

  • ollama.py - Ollama engine with RAG (Retrieval-Augmented Generation)
  • ollama_multimodal.py - Vision + language model integration
  • mistral_conversational.py - Mistral model wrapper for dialogue
  • gpt2_conversational.py - GPT-2 model wrapper
  • llava_vision.py - LLaVA vision-language model for scene understanding

Navigation & Planning:

  • planner.py - A* pathfinding algorithm implementation
  • transit.py - Movement and transit logic
  • simply_geojson.py - GeoJSON map data processing

Machine Learning:

  • gpt2_train.py - Fine-tuning script for GPT-2 on navigation dialogues
  • run_clm.py - Causal language modeling utilities
  • data_generator.py - Synthetic training data generation
  • dataset_generator.py - Dialogue dataset creation

Data Directory (data/):

  • root_anchors.json - Complete waypoint definitions with coordinates
  • navigation.json - Route network graph representation
  • nlu_examples.json - Intent and entity training examples
  • markers.json - Geographic marker data (~4.8MB)
  • reformatted_navcog.json - NavCog-compatible format (~6.8MB)
  • template*.txt - LLM prompt templates for various tasks

Conversational Intents:

The NLU system recognizes the following navigation intents:

  • request-route - "How do I get to the kitchen?"
  • request-replan - "Find me another way"
  • find-object - "Where is the coffee maker?"
  • is-room-occupied - "Is anyone in the conference room?"
  • recall-object - "Where did I see the printer?"
  • look-around - "What's around me?"

Technical Stack:

# Core Dependencies
flask                    # Web framework
langchain               # LLM orchestration
transformers            # Hugging Face models
torch                   # PyTorch deep learning
ollama                  # Local LLM execution

📊 /datasets - Training and Dialogue Data

Total Size: ~12.5 MB Format: JSON, TXT, CSV Domain: Indoor navigation dialogues and object detection

The datasets folder contains training corpora for fine-tuning LLMs and evaluating the navigation system.

Files:

Main Dialogue Dataset:

  • dialogue_dataset.json (~8.4 MB) - Primary conversational training corpus
  • dialogue_dataset.txt (~4.2 MB) - Text format version for language model training

Evaluation Data:

  • dialogue_results.json (~150 KB) - System performance evaluation results
  • test_ubicomp.json (~7.4 KB) - Test set for ubicomp scenarios
  • examples.csv (~3.6 KB) - Example conversations in CSV format

Generation Scripts:

  • gpt2_dataset_gen.py (~6.8 KB) - Automated dialogue generation script

Dataset Structure:

Each conversation in dialogue_dataset.json follows this format:

{
  "conversation-id": "conversation-0",
  "turns": [
    {
      "speaker": "user",
      "utterance": "How can I get to the kitchen?",
      "utt-delex": "How can I get to the $destination$?",
      "intent": "request-route",
      "slots": {
        "$destination$": "kitchen"
      },
      "state": {
        "$destination$": "kitchen"
      }
    },
    {
      "speaker": "system",
      "api-call": {
        "api": "request-route",
        "parameter": "$destination$",
        "value": "kitchen"
      },
      "results": {
        "$route-list$": "[...]"
      },
      "intent": "offer-routes",
      "utterance": "I found two routes to the kitchen. Route 1: ..."
    }
  ]
}

Slot Types:

  • $destination$ - Target location for navigation
  • $route$ - Selected route identifier
  • $object$ - Object name for detection/recall
  • $room$ - Room identifier
  • $position$ - Spatial position information
  • $distance$ - Distance measurements

Data Generation:

The dataset generator creates synthetic dialogues with:

  • 1000+ conversations per generation run
  • Randomized intent combinations
  • Realistic slot value variations
  • Multi-turn dialogue flows
  • State tracking across turns

Installation & Setup

Client Setup (iOS)

  1. Prerequisites:

    # Install CocoaPods
    sudo gem install cocoapods
    
    # Xcode 12+ required
  2. Install Dependencies:

    cd client/prototype3
    pod install
  3. Open Workspace:

    open prototype3.xcworkspace
  4. Configure Azure Spatial Anchors:

    • Add your Azure Spatial Anchors credentials in project settings
    • Update Info.plist with required permissions
  5. Build and Run:

    • Select target device (iOS 11.0+)
    • Build and deploy to device (ARKit requires physical device)

Server Setup (Python)

  1. Prerequisites:

    # Python 3.8+ required
    python3 --version
  2. Install Dependencies:

    cd server
    pip install -r requirements.txt  # If available
    
    # Or install manually:
    pip install flask torch transformers langchain
  3. Install Ollama (for local LLM):

    # macOS
    brew install ollama
    
    # Pull models
    ollama pull mistral
    ollama pull llama2
    ollama pull llava
  4. Start Server:

    python application_spatial_anchors.py
  5. Configure Client Connection:

    • Update client WiFiConnection settings to point to server IP
    • Ensure both devices are on the same network

Usage

Running Navigation Scenarios

The system includes 8 pre-configured research scenarios for evaluation:

  1. Scenario 1: Basic navigation to single destination
  2. Scenario 2: Multi-step navigation with waypoints
  3. Scenario 3: Object detection during navigation
  4. Scenario 4: Room occupancy detection
  5. Scenario 5: Object recall from memory
  6. Scenario 6: Dynamic replanning
  7. Scenario 7: Conversational interaction focus
  8. Scenario 8: Combined multimodal assistance

Conversational Examples

User: "Where am I?"
System: "You're in the main hallway, near the entrance."

User: "Take me to the kitchen."
System: "I found a route to the kitchen. Head forward 10 feet, then turn right."

User: "Is there anyone in the conference room?"
System: "Let me check... Yes, I detect 3 people in the conference room."

User: "Where did I see the fire extinguisher?"
System: "You saw a fire extinguisher near the east stairwell entrance."

Feature Toggles

  • Object Detection: Enable/disable real-time object detection
  • Conversational Mode: Toggle between conversational and command-based interaction
  • Audio Feedback: Configure verbal guidance and audio cues
  • Instrumentation: Enable detailed logging for research analysis

Research & Citation

This system is based on research published in the AAAI Conference on Artificial Intelligence.

Citation

If you use this system in your research, please cite:

@inproceedings{romero2025navigation,
  title={Navigation and Interaction for Blind Users via a Cognitive Architecture},
  author={Oscar J. Romero, Anthony Tomasic, Elizabeth Carter, John Zimmerman, Aaron Steinfeld},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026},
  organization={AAAI}
}

Research Contributions

  • Cognitive Architecture for Accessibility: Novel application of cognitive architectures to assistive navigation
  • Multimodal Integration: Combining AR, object detection, and LLMs for comprehensive environmental understanding
  • Conversational Navigation: Natural language interface for blind users to navigate complex indoor spaces
  • Spatial Memory System: Integration of declarative and procedural memory for context-aware guidance
  • Open-Source Implementation: Complete end-to-end system for research reproducibility

System Architecture

Data Flow

User Voice Input
    ↓
iOS Speech Recognition
    ↓
Working Memory Controller
    ↓
Server API (/get-reply)
    ↓
LLM Intent/Entity Recognition
    ↓
Path Planning (A*) / Object Detection
    ↓
Response Generation
    ↓
Motor Controller (Audio Output)
    ↓
User Receives Guidance

Communication Protocol

  1. Client → Server: JSON payload with user utterance and context
  2. Server Processing:
    • Intent classification
    • Entity extraction
    • API routing (pathfinding, detection, memory recall)
  3. Server → Client: JSON response with system utterance and actions
  4. Client Execution: Motor controller renders audio/haptic feedback

Development

Project Structure (Client)

prototype3/
├── CAControllers/          # Cognitive architecture implementation
├── ViewControllers/        # UI layer (50+ view controllers)
├── Models/                 # Data structures and state management
├── Utils/                  # Helper classes and utilities
├── Resources/              # ML models (.tflite files)
├── Sounds/                 # Audio assets for guidance
└── Base.lproj/            # Storyboard UI definitions

Project Structure (Server)

server/
├── Core Services/          # Flask app, LLM integration
├── Machine Learning/       # Model training and generation
├── Navigation/             # Pathfinding algorithms
├── Data Processing/        # Dataset utilities
├── data/                   # Configuration and training data
└── scripts/                # Execution scripts

Adding New Intents

  1. Update NLU Examples (server/data/nlu_examples.json):

    {
      "intent": "request-weather",
      "examples": ["What's the temperature?", "Is it cold outside?"],
      "slots": ["$location$", "$time$"]
    }
  2. Add API Handler (server/application_spatial_anchors.py):

    @app.route('/get-weather', methods=['POST'])
    def get_weather():
        location = request.json.get('location')
        # Implementation
        return jsonify({'temperature': 72, 'condition': 'sunny'})
  3. Update Client Working Memory (WorkingMemoryController.swift):

    func handleWeatherIntent(_ response: [String: Any]) {
        let temperature = response["temperature"] as? Int
        // Process and present to user
    }

Testing

  • iOS Unit Tests: Run via Xcode Test Navigator
  • Server Testing: Use provided test scripts in datasets/test_ubicomp.json
  • Integration Testing: Run through 8 research scenarios
  • Session Logging: Enable instrumentation for detailed analysis

Performance

System Requirements

Client (iOS):

  • iPhone 6s or newer (ARKit support)
  • iOS 11.0 or later
  • 500MB free storage
  • WiFi or cellular connectivity

Server:

  • Python 3.8+
  • 4GB RAM minimum (8GB recommended for LLM inference)
  • 10GB storage (for models and datasets)
  • CUDA-capable GPU (optional, for faster LLM inference)

Benchmarks

  • Object Detection: ~10-15 FPS on iPhone 11
  • Path Planning: <100ms for typical routes
  • LLM Response Time: 1-3 seconds (Ollama/Mistral on CPU)
  • Speech Recognition: Real-time (<200ms latency)

Troubleshooting

Common Issues

iOS App Crashes:

  • Ensure Azure Spatial Anchors credentials are valid
  • Check camera and location permissions in Settings
  • Verify ARKit compatibility on device

Server Connection Failed:

  • Confirm server is running: python application_spatial_anchors.py
  • Check firewall settings allow Flask port (default: 5000)
  • Verify client and server on same network

LLM Not Responding:

  • Ensure Ollama is running: ollama serve
  • Check model is downloaded: ollama list
  • Review server logs for errors

Object Detection Not Working:

  • Verify GoogleMLKit pod installed correctly
  • Check lighting conditions (requires adequate illumination)
  • Ensure camera feed is active

License

This project is licensed under the MIT License. See LICENSE file for details.

MIT License

Copyright (c) 2025 TBD Lab (Transportation, Bots, and Disability)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

Acknowledgments

This research was conducted by the TBD Lab (Transportation, Bots, and Disability) and presented at the AAAI Conference on Artificial Intelligence.

Technologies Used:

  • Microsoft Azure Spatial Anchors
  • Google MLKit
  • Apple ARKit
  • Ollama
  • Hugging Face Transformers
  • Langchain

Special Thanks:

  • AAAI reviewers and community
  • Accessibility research participants
  • Open-source contributors

Contributing

Contributions are welcome! Please follow these guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Code Style

  • Swift: Follow Apple's Swift style guide
  • Python: PEP 8 compliance
  • Comments: Document complex algorithms and cognitive architecture decisions
  • Testing: Include unit tests for new features

Additional Resources

Documentation

Related Papers

  • NavCog: Accessible Indoor Navigation System
  • Cognitive Architectures for Assistive Technologies
  • Vision-Language Models for Scene Understanding
  • LLMs for Natural Language Interfaces

Datasets

  • Indoor Navigation Corpus (12.5MB)
  • Intent/Entity Training Examples
  • Spatial Anchor Configurations

Version: 1.0 Last Updated: 2025 Maintained by: TBD Lab

About

Codebase for the AAAI paper "Navigation and Interaction for Blind Users via a Cognitive Architecture"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors