Indoor Navigation System for Blind People via Cognitive Architectures, Augmented Reality and LLMs

A multimodal assistive navigation system combining Augmented Reality (AR), Object Detection (OD), and Large Language Models (LLM) to enable indoor navigation for blind and visually impaired users through a cognitive architecture framework.

Overview

This system provides real-time, conversational navigation assistance for blind users exploring indoor environments. By integrating computer vision, spatial anchoring, natural language understanding, and cognitive architecture principles, the system enables users to:

Navigate complex indoor spaces using conversational commands ("Take me to the kitchen")
Understand their surroundings through object detection and scene descriptions
Interact naturally with an AI assistant that understands navigation intents and context
Track objects and locations using spatial memory and visual recognition
Receive multimodal feedback through audio cues and verbal guidance

Key Technologies

iOS AR Application: ARKit-based mobile client with cognitive architecture controllers
Spatial Anchoring: Microsoft Azure Spatial Anchors for persistent waypoint tracking
Object Detection: Google MLKit for real-time visual recognition
LLM Integration: Multiple backends (Ollama, Mistral, Llama2, GPT-2) for natural language understanding
Vision-Language Models: LLaVA for multimodal scene understanding
Path Planning: A* algorithm for optimal route calculation
Speech Recognition: Real-time voice input processing on iOS

Cognitive Architecture

The system implements a classical cognitive architecture with the following components:

Figure 1: Modular client-server architecture. On the client side (smartphone), boxes represent modules running asynchronously on separate threads. The working memory component integrates information from other modules and produces a speech output. A cognitive cycle initiates when the perception module processes internal and external information stored in the working memory. The procedural memory decides what to do next by retrieving the contents of the working memory, which in turn retrieves knowledge about the user and the world. Finally, the cycle ends with a conversational action processed by the motor module. Blue-colored modules can be turned on/off according to the assessment setting. Green-colored modules denote cognitive modules proposed by the CMC. Arrows indicate the flow of information.

Repository Structure

ar-od-llm-indoor-navigation/
├── client/                 # iOS AR application (Swift, Xcode)
├── server/                 # Python backend services (Flask, LLMs)
├── datasets/               # Training and dialogue datasets
├── LICENSE                 # MIT License
└── README.md              # This file

Folder Descriptions

📱 `/client` - iOS AR Application

Platform: iOS (minimum deployment: iOS 11.0) Language: Swift Framework: ARKit, UIKit

The client folder contains an Xcode project (prototype3) implementing the mobile AR navigation interface for blind users.

Key Components:

Cognitive Architecture Controllers (CAControllers/):

DeclarativeMemoryController.swift - Stores factual knowledge (waypoints, object locations, room information)
ProceduralMemoryController.swift - Manages navigation procedures and how-to knowledge
WorkingMemoryController.swift - Maintains active navigation state and temporary information
PerceptionController.swift - Processes sensory inputs (camera frames, spatial data)
MotorController.swift - Executes outputs (audio feedback, haptic cues)
PathPlannerController.swift - Computes optimal routes using pathfinding algorithms

View Controllers (ViewControllers/):

WayfindingViewController.swift - Main navigation interface
BaseAnchorsViewController.swift - Manages spatial anchors and waypoints
MainMenuViewController.swift - App entry point and scenario selection
AnchorInfoViewController.swift - Displays detailed waypoint information
GlobalSettingsViewController.swift - Configuration and feature toggles

Data Models (Models/):

AnchorData.swift - Represents navigation waypoints with coordinates and properties
PathData.swift - Route and path information structures
SessionData.swift - Navigation session state management
RootAnchors.swift - Hierarchical anchor organization

Utilities (Utils/):

WiFiConnection.swift - Network connectivity management
CustomLogger.swift - Logging and instrumentation
GUIHelper.swift - UI rendering utilities
TextObservationTracker.swift - OCR result tracking

Dependencies (CocoaPods):

pod 'AzureSpatialAnchors', '2.13.0'     # Persistent spatial anchoring
pod 'GoogleMLKit/ObjectDetection'       # Real-time object detection
pod 'GoogleMLKit/TextRecognition'       # OCR capabilities
pod 'DropDown'                          # UI components
pod 'SwiftySound'                       # Audio playback

Features:

Real-time object detection with toggle on/off
Conversational AI interaction via voice commands
People detection in rooms
Microsoft Spatial Anchors for persistent waypoints
8 research scenarios for systematic evaluation
Audio guidance system with directional cues
Session logging for research analysis

🖥️ `/server` - Python Backend Services

Platform: Python 3.x Framework: Flask LLM Engines: Ollama, Mistral, Llama2, GPT-2, LLaVA

The server folder contains the backend services responsible for natural language understanding, pathfinding, and LLM integration.

Core Services:

Flask Application (application_spatial_anchors.py):

RESTful API for client-server communication
Spatial anchor description management
Session and anchor data retrieval
Conversational interface endpoints

API Endpoints:

Endpoint	Method	Purpose
`/add-mapping`	POST	Store anchor descriptions
`/get-description`	GET	Retrieve visual descriptions for anchors
`/get-reply`	POST	Process user utterances and return AI responses
`/get-root-anchors`	GET	Retrieve all waypoint anchors
`/get-nlu-examples`	GET	Get intent/entity training examples
`/get-session-names`	GET	List available navigation sessions
`/get-anchors-session`	GET	Get anchors for specific session

LLM Integration:

ollama.py - Ollama engine with RAG (Retrieval-Augmented Generation)
ollama_multimodal.py - Vision + language model integration
mistral_conversational.py - Mistral model wrapper for dialogue
gpt2_conversational.py - GPT-2 model wrapper
llava_vision.py - LLaVA vision-language model for scene understanding

Navigation & Planning:

planner.py - A* pathfinding algorithm implementation
transit.py - Movement and transit logic
simply_geojson.py - GeoJSON map data processing

Machine Learning:

gpt2_train.py - Fine-tuning script for GPT-2 on navigation dialogues
run_clm.py - Causal language modeling utilities
data_generator.py - Synthetic training data generation
dataset_generator.py - Dialogue dataset creation

Data Directory (data/):

root_anchors.json - Complete waypoint definitions with coordinates
navigation.json - Route network graph representation
nlu_examples.json - Intent and entity training examples
markers.json - Geographic marker data (~4.8MB)
reformatted_navcog.json - NavCog-compatible format (~6.8MB)
template*.txt - LLM prompt templates for various tasks

Conversational Intents:

The NLU system recognizes the following navigation intents:

request-route - "How do I get to the kitchen?"
request-replan - "Find me another way"
find-object - "Where is the coffee maker?"
is-room-occupied - "Is anyone in the conference room?"
recall-object - "Where did I see the printer?"
look-around - "What's around me?"

Technical Stack:

# Core Dependencies
flask                    # Web framework
langchain               # LLM orchestration
transformers            # Hugging Face models
torch                   # PyTorch deep learning
ollama                  # Local LLM execution

📊 `/datasets` - Training and Dialogue Data

Total Size: ~12.5 MB Format: JSON, TXT, CSV Domain: Indoor navigation dialogues and object detection

The datasets folder contains training corpora for fine-tuning LLMs and evaluating the navigation system.

Files:

Main Dialogue Dataset:

dialogue_dataset.json (~8.4 MB) - Primary conversational training corpus
dialogue_dataset.txt (~4.2 MB) - Text format version for language model training

Evaluation Data:

dialogue_results.json (~150 KB) - System performance evaluation results
test_ubicomp.json (~7.4 KB) - Test set for ubicomp scenarios
examples.csv (~3.6 KB) - Example conversations in CSV format

Generation Scripts:

gpt2_dataset_gen.py (~6.8 KB) - Automated dialogue generation script

Dataset Structure:

Each conversation in dialogue_dataset.json follows this format:

{
  "conversation-id": "conversation-0",
  "turns": [
    {
      "speaker": "user",
      "utterance": "How can I get to the kitchen?",
      "utt-delex": "How can I get to the $destination$?",
      "intent": "request-route",
      "slots": {
        "$destination$": "kitchen"
      },
      "state": {
        "$destination$": "kitchen"
      }
    },
    {
      "speaker": "system",
      "api-call": {
        "api": "request-route",
        "parameter": "$destination$",
        "value": "kitchen"
      },
      "results": {
        "$route-list$": "[...]"
      },
      "intent": "offer-routes",
      "utterance": "I found two routes to the kitchen. Route 1: ..."
    }
  ]
}

Slot Types:

$destination$ - Target location for navigation
$route$ - Selected route identifier
$object$ - Object name for detection/recall
$room$ - Room identifier
$position$ - Spatial position information
$distance$ - Distance measurements

Data Generation:

The dataset generator creates synthetic dialogues with:

1000+ conversations per generation run
Randomized intent combinations
Realistic slot value variations
Multi-turn dialogue flows
State tracking across turns

Installation & Setup

Client Setup (iOS)

Prerequisites:

# Install CocoaPods
sudo gem install cocoapods

# Xcode 12+ required

Install Dependencies:
```
cd client/prototype3
pod install
```
Open Workspace:
```
open prototype3.xcworkspace
```
Configure Azure Spatial Anchors:
- Add your Azure Spatial Anchors credentials in project settings
- Update Info.plist with required permissions
Build and Run:
- Select target device (iOS 11.0+)
- Build and deploy to device (ARKit requires physical device)

Server Setup (Python)

Prerequisites:

# Python 3.8+ required
python3 --version

Install Dependencies:

cd server
pip install -r requirements.txt  # If available

# Or install manually:
pip install flask torch transformers langchain

Install Ollama (for local LLM):

# macOS
brew install ollama

# Pull models
ollama pull mistral
ollama pull llama2
ollama pull llava

Start Server:
```
python application_spatial_anchors.py
```
Configure Client Connection:
- Update client WiFiConnection settings to point to server IP
- Ensure both devices are on the same network

Usage

Running Navigation Scenarios

The system includes 8 pre-configured research scenarios for evaluation:

Scenario 1: Basic navigation to single destination
Scenario 2: Multi-step navigation with waypoints
Scenario 3: Object detection during navigation
Scenario 4: Room occupancy detection
Scenario 5: Object recall from memory
Scenario 6: Dynamic replanning
Scenario 7: Conversational interaction focus
Scenario 8: Combined multimodal assistance

Conversational Examples

User: "Where am I?"
System: "You're in the main hallway, near the entrance."

User: "Take me to the kitchen."
System: "I found a route to the kitchen. Head forward 10 feet, then turn right."

User: "Is there anyone in the conference room?"
System: "Let me check... Yes, I detect 3 people in the conference room."

User: "Where did I see the fire extinguisher?"
System: "You saw a fire extinguisher near the east stairwell entrance."

Feature Toggles

Object Detection: Enable/disable real-time object detection
Conversational Mode: Toggle between conversational and command-based interaction
Audio Feedback: Configure verbal guidance and audio cues
Instrumentation: Enable detailed logging for research analysis

Research & Citation

This system is based on research published in the AAAI Conference on Artificial Intelligence.

Citation

If you use this system in your research, please cite:

@inproceedings{romero2025navigation,
  title={Navigation and Interaction for Blind Users via a Cognitive Architecture},
  author={Oscar J. Romero, Anthony Tomasic, Elizabeth Carter, John Zimmerman, Aaron Steinfeld},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026},
  organization={AAAI}
}

Research Contributions

Cognitive Architecture for Accessibility: Novel application of cognitive architectures to assistive navigation
Multimodal Integration: Combining AR, object detection, and LLMs for comprehensive environmental understanding
Conversational Navigation: Natural language interface for blind users to navigate complex indoor spaces
Spatial Memory System: Integration of declarative and procedural memory for context-aware guidance
Open-Source Implementation: Complete end-to-end system for research reproducibility

System Architecture

Data Flow

User Voice Input
    ↓
iOS Speech Recognition
    ↓
Working Memory Controller
    ↓
Server API (/get-reply)
    ↓
LLM Intent/Entity Recognition
    ↓
Path Planning (A*) / Object Detection
    ↓
Response Generation
    ↓
Motor Controller (Audio Output)
    ↓
User Receives Guidance

Communication Protocol

Client → Server: JSON payload with user utterance and context
Server Processing:
- Intent classification
- Entity extraction
- API routing (pathfinding, detection, memory recall)
Server → Client: JSON response with system utterance and actions
Client Execution: Motor controller renders audio/haptic feedback

Development

Project Structure (Client)

prototype3/
├── CAControllers/          # Cognitive architecture implementation
├── ViewControllers/        # UI layer (50+ view controllers)
├── Models/                 # Data structures and state management
├── Utils/                  # Helper classes and utilities
├── Resources/              # ML models (.tflite files)
├── Sounds/                 # Audio assets for guidance
└── Base.lproj/            # Storyboard UI definitions

Project Structure (Server)

server/
├── Core Services/          # Flask app, LLM integration
├── Machine Learning/       # Model training and generation
├── Navigation/             # Pathfinding algorithms
├── Data Processing/        # Dataset utilities
├── data/                   # Configuration and training data
└── scripts/                # Execution scripts

Adding New Intents

Update NLU Examples (server/data/nlu_examples.json):

{
  "intent": "request-weather",
  "examples": ["What's the temperature?", "Is it cold outside?"],
  "slots": ["$location$", "$time$"]
}

Add API Handler (server/application_spatial_anchors.py):

@app.route('/get-weather', methods=['POST'])
def get_weather():
    location = request.json.get('location')
    # Implementation
    return jsonify({'temperature': 72, 'condition': 'sunny'})

Update Client Working Memory (WorkingMemoryController.swift):

func handleWeatherIntent(_ response: [String: Any]) {
    let temperature = response["temperature"] as? Int
    // Process and present to user
}

Testing

iOS Unit Tests: Run via Xcode Test Navigator
Server Testing: Use provided test scripts in datasets/test_ubicomp.json
Integration Testing: Run through 8 research scenarios
Session Logging: Enable instrumentation for detailed analysis

Performance

System Requirements

Client (iOS):

iPhone 6s or newer (ARKit support)
iOS 11.0 or later
500MB free storage
WiFi or cellular connectivity

Server:

Python 3.8+
4GB RAM minimum (8GB recommended for LLM inference)
10GB storage (for models and datasets)
CUDA-capable GPU (optional, for faster LLM inference)

Benchmarks

Object Detection: ~10-15 FPS on iPhone 11
Path Planning: <100ms for typical routes
LLM Response Time: 1-3 seconds (Ollama/Mistral on CPU)
Speech Recognition: Real-time (<200ms latency)

Troubleshooting

Common Issues

iOS App Crashes:

Ensure Azure Spatial Anchors credentials are valid
Check camera and location permissions in Settings
Verify ARKit compatibility on device

Server Connection Failed:

Confirm server is running: python application_spatial_anchors.py
Check firewall settings allow Flask port (default: 5000)
Verify client and server on same network

LLM Not Responding:

Ensure Ollama is running: ollama serve
Check model is downloaded: ollama list
Review server logs for errors

Object Detection Not Working:

Verify GoogleMLKit pod installed correctly
Check lighting conditions (requires adequate illumination)
Ensure camera feed is active

License

This project is licensed under the MIT License. See LICENSE file for details.

MIT License

Copyright (c) 2025 TBD Lab (Transportation, Bots, and Disability)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

Acknowledgments

This research was conducted by the TBD Lab (Transportation, Bots, and Disability) and presented at the AAAI Conference on Artificial Intelligence.

Technologies Used:

Microsoft Azure Spatial Anchors
Google MLKit
Apple ARKit
Ollama
Hugging Face Transformers
Langchain

Special Thanks:

AAAI reviewers and community
Accessibility research participants
Open-source contributors

Contributing

Contributions are welcome! Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Style

Swift: Follow Apple's Swift style guide
Python: PEP 8 compliance
Comments: Document complex algorithms and cognitive architecture decisions
Testing: Include unit tests for new features

Additional Resources

Documentation

Client README - Detailed iOS app documentation
Scenario Documentation - Research scenario descriptions
Demo Videos - System demonstration videos

Related Papers

NavCog: Accessible Indoor Navigation System
Cognitive Architectures for Assistive Technologies
Vision-Language Models for Scene Understanding
LLMs for Natural Language Interfaces

Datasets

Indoor Navigation Corpus (12.5MB)
Intent/Entity Training Examples
Spatial Anchor Configurations

Version: 1.0 Last Updated: 2025 Maintained by: TBD Lab

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
client		client
datasets		datasets
server		server
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png

Folders and files

Latest commit

History

Repository files navigation

Indoor Navigation System for Blind People via Cognitive Architectures, Augmented Reality and LLMs

Overview

Key Technologies

Cognitive Architecture

Repository Structure

Folder Descriptions

📱 /client - iOS AR Application

Key Components:

Dependencies (CocoaPods):

Features:

🖥️ /server - Python Backend Services

Core Services:

Conversational Intents:

Technical Stack:

📊 /datasets - Training and Dialogue Data

Files:

Dataset Structure:

Slot Types:

Data Generation:

Installation & Setup

Client Setup (iOS)

Server Setup (Python)

Usage

Running Navigation Scenarios

Conversational Examples

Feature Toggles

Research & Citation

Citation

Research Contributions

System Architecture

Data Flow

Communication Protocol

Development

Project Structure (Client)

Project Structure (Server)

Adding New Intents

Testing

Performance

System Requirements

Benchmarks

Troubleshooting

Common Issues

License

Acknowledgments

Contributing

Code Style

Additional Resources

Documentation

Related Papers

Datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

📱 `/client` - iOS AR Application

🖥️ `/server` - Python Backend Services

📊 `/datasets` - Training and Dialogue Data

Packages