A multimodal assistive navigation system combining Augmented Reality (AR), Object Detection (OD), and Large Language Models (LLM) to enable indoor navigation for blind and visually impaired users through a cognitive architecture framework.
This system provides real-time, conversational navigation assistance for blind users exploring indoor environments. By integrating computer vision, spatial anchoring, natural language understanding, and cognitive architecture principles, the system enables users to:
- Navigate complex indoor spaces using conversational commands ("Take me to the kitchen")
- Understand their surroundings through object detection and scene descriptions
- Interact naturally with an AI assistant that understands navigation intents and context
- Track objects and locations using spatial memory and visual recognition
- Receive multimodal feedback through audio cues and verbal guidance
- iOS AR Application: ARKit-based mobile client with cognitive architecture controllers
- Spatial Anchoring: Microsoft Azure Spatial Anchors for persistent waypoint tracking
- Object Detection: Google MLKit for real-time visual recognition
- LLM Integration: Multiple backends (Ollama, Mistral, Llama2, GPT-2) for natural language understanding
- Vision-Language Models: LLaVA for multimodal scene understanding
- Path Planning: A* algorithm for optimal route calculation
- Speech Recognition: Real-time voice input processing on iOS
The system implements a classical cognitive architecture with the following components:
Figure 1: Modular client-server architecture. On the client side (smartphone), boxes represent modules running asynchronously on separate threads. The working memory component integrates information from other modules and produces a speech output. A cognitive cycle initiates when the perception module processes internal and external information stored in the working memory. The procedural memory decides what to do next by retrieving the contents of the working memory, which in turn retrieves knowledge about the user and the world. Finally, the cycle ends with a conversational action processed by the motor module. Blue-colored modules can be turned on/off according to the assessment setting. Green-colored modules denote cognitive modules proposed by the CMC. Arrows indicate the flow of information.
ar-od-llm-indoor-navigation/
├── client/ # iOS AR application (Swift, Xcode)
├── server/ # Python backend services (Flask, LLMs)
├── datasets/ # Training and dialogue datasets
├── LICENSE # MIT License
└── README.md # This file
Platform: iOS (minimum deployment: iOS 11.0) Language: Swift Framework: ARKit, UIKit
The client folder contains an Xcode project (prototype3) implementing the mobile AR navigation interface for blind users.
Cognitive Architecture Controllers (CAControllers/):
DeclarativeMemoryController.swift- Stores factual knowledge (waypoints, object locations, room information)ProceduralMemoryController.swift- Manages navigation procedures and how-to knowledgeWorkingMemoryController.swift- Maintains active navigation state and temporary informationPerceptionController.swift- Processes sensory inputs (camera frames, spatial data)MotorController.swift- Executes outputs (audio feedback, haptic cues)PathPlannerController.swift- Computes optimal routes using pathfinding algorithms
View Controllers (ViewControllers/):
WayfindingViewController.swift- Main navigation interfaceBaseAnchorsViewController.swift- Manages spatial anchors and waypointsMainMenuViewController.swift- App entry point and scenario selectionAnchorInfoViewController.swift- Displays detailed waypoint informationGlobalSettingsViewController.swift- Configuration and feature toggles
Data Models (Models/):
AnchorData.swift- Represents navigation waypoints with coordinates and propertiesPathData.swift- Route and path information structuresSessionData.swift- Navigation session state managementRootAnchors.swift- Hierarchical anchor organization
Utilities (Utils/):
WiFiConnection.swift- Network connectivity managementCustomLogger.swift- Logging and instrumentationGUIHelper.swift- UI rendering utilitiesTextObservationTracker.swift- OCR result tracking
pod 'AzureSpatialAnchors', '2.13.0' # Persistent spatial anchoring
pod 'GoogleMLKit/ObjectDetection' # Real-time object detection
pod 'GoogleMLKit/TextRecognition' # OCR capabilities
pod 'DropDown' # UI components
pod 'SwiftySound' # Audio playback- Real-time object detection with toggle on/off
- Conversational AI interaction via voice commands
- People detection in rooms
- Microsoft Spatial Anchors for persistent waypoints
- 8 research scenarios for systematic evaluation
- Audio guidance system with directional cues
- Session logging for research analysis
Platform: Python 3.x Framework: Flask LLM Engines: Ollama, Mistral, Llama2, GPT-2, LLaVA
The server folder contains the backend services responsible for natural language understanding, pathfinding, and LLM integration.
Flask Application (application_spatial_anchors.py):
- RESTful API for client-server communication
- Spatial anchor description management
- Session and anchor data retrieval
- Conversational interface endpoints
API Endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/add-mapping |
POST | Store anchor descriptions |
/get-description |
GET | Retrieve visual descriptions for anchors |
/get-reply |
POST | Process user utterances and return AI responses |
/get-root-anchors |
GET | Retrieve all waypoint anchors |
/get-nlu-examples |
GET | Get intent/entity training examples |
/get-session-names |
GET | List available navigation sessions |
/get-anchors-session |
GET | Get anchors for specific session |
LLM Integration:
ollama.py- Ollama engine with RAG (Retrieval-Augmented Generation)ollama_multimodal.py- Vision + language model integrationmistral_conversational.py- Mistral model wrapper for dialoguegpt2_conversational.py- GPT-2 model wrapperllava_vision.py- LLaVA vision-language model for scene understanding
Navigation & Planning:
planner.py- A* pathfinding algorithm implementationtransit.py- Movement and transit logicsimply_geojson.py- GeoJSON map data processing
Machine Learning:
gpt2_train.py- Fine-tuning script for GPT-2 on navigation dialoguesrun_clm.py- Causal language modeling utilitiesdata_generator.py- Synthetic training data generationdataset_generator.py- Dialogue dataset creation
Data Directory (data/):
root_anchors.json- Complete waypoint definitions with coordinatesnavigation.json- Route network graph representationnlu_examples.json- Intent and entity training examplesmarkers.json- Geographic marker data (~4.8MB)reformatted_navcog.json- NavCog-compatible format (~6.8MB)template*.txt- LLM prompt templates for various tasks
The NLU system recognizes the following navigation intents:
request-route- "How do I get to the kitchen?"request-replan- "Find me another way"find-object- "Where is the coffee maker?"is-room-occupied- "Is anyone in the conference room?"recall-object- "Where did I see the printer?"look-around- "What's around me?"
# Core Dependencies
flask # Web framework
langchain # LLM orchestration
transformers # Hugging Face models
torch # PyTorch deep learning
ollama # Local LLM executionTotal Size: ~12.5 MB Format: JSON, TXT, CSV Domain: Indoor navigation dialogues and object detection
The datasets folder contains training corpora for fine-tuning LLMs and evaluating the navigation system.
Main Dialogue Dataset:
dialogue_dataset.json(~8.4 MB) - Primary conversational training corpusdialogue_dataset.txt(~4.2 MB) - Text format version for language model training
Evaluation Data:
dialogue_results.json(~150 KB) - System performance evaluation resultstest_ubicomp.json(~7.4 KB) - Test set for ubicomp scenariosexamples.csv(~3.6 KB) - Example conversations in CSV format
Generation Scripts:
gpt2_dataset_gen.py(~6.8 KB) - Automated dialogue generation script
Each conversation in dialogue_dataset.json follows this format:
{
"conversation-id": "conversation-0",
"turns": [
{
"speaker": "user",
"utterance": "How can I get to the kitchen?",
"utt-delex": "How can I get to the $destination$?",
"intent": "request-route",
"slots": {
"$destination$": "kitchen"
},
"state": {
"$destination$": "kitchen"
}
},
{
"speaker": "system",
"api-call": {
"api": "request-route",
"parameter": "$destination$",
"value": "kitchen"
},
"results": {
"$route-list$": "[...]"
},
"intent": "offer-routes",
"utterance": "I found two routes to the kitchen. Route 1: ..."
}
]
}$destination$- Target location for navigation$route$- Selected route identifier$object$- Object name for detection/recall$room$- Room identifier$position$- Spatial position information$distance$- Distance measurements
The dataset generator creates synthetic dialogues with:
- 1000+ conversations per generation run
- Randomized intent combinations
- Realistic slot value variations
- Multi-turn dialogue flows
- State tracking across turns
-
Prerequisites:
# Install CocoaPods sudo gem install cocoapods # Xcode 12+ required
-
Install Dependencies:
cd client/prototype3 pod install -
Open Workspace:
open prototype3.xcworkspace
-
Configure Azure Spatial Anchors:
- Add your Azure Spatial Anchors credentials in project settings
- Update Info.plist with required permissions
-
Build and Run:
- Select target device (iOS 11.0+)
- Build and deploy to device (ARKit requires physical device)
-
Prerequisites:
# Python 3.8+ required python3 --version -
Install Dependencies:
cd server pip install -r requirements.txt # If available # Or install manually: pip install flask torch transformers langchain
-
Install Ollama (for local LLM):
# macOS brew install ollama # Pull models ollama pull mistral ollama pull llama2 ollama pull llava
-
Start Server:
python application_spatial_anchors.py
-
Configure Client Connection:
- Update client WiFiConnection settings to point to server IP
- Ensure both devices are on the same network
The system includes 8 pre-configured research scenarios for evaluation:
- Scenario 1: Basic navigation to single destination
- Scenario 2: Multi-step navigation with waypoints
- Scenario 3: Object detection during navigation
- Scenario 4: Room occupancy detection
- Scenario 5: Object recall from memory
- Scenario 6: Dynamic replanning
- Scenario 7: Conversational interaction focus
- Scenario 8: Combined multimodal assistance
User: "Where am I?"
System: "You're in the main hallway, near the entrance."
User: "Take me to the kitchen."
System: "I found a route to the kitchen. Head forward 10 feet, then turn right."
User: "Is there anyone in the conference room?"
System: "Let me check... Yes, I detect 3 people in the conference room."
User: "Where did I see the fire extinguisher?"
System: "You saw a fire extinguisher near the east stairwell entrance."
- Object Detection: Enable/disable real-time object detection
- Conversational Mode: Toggle between conversational and command-based interaction
- Audio Feedback: Configure verbal guidance and audio cues
- Instrumentation: Enable detailed logging for research analysis
This system is based on research published in the AAAI Conference on Artificial Intelligence.
If you use this system in your research, please cite:
@inproceedings{romero2025navigation,
title={Navigation and Interaction for Blind Users via a Cognitive Architecture},
author={Oscar J. Romero, Anthony Tomasic, Elizabeth Carter, John Zimmerman, Aaron Steinfeld},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026},
organization={AAAI}
}- Cognitive Architecture for Accessibility: Novel application of cognitive architectures to assistive navigation
- Multimodal Integration: Combining AR, object detection, and LLMs for comprehensive environmental understanding
- Conversational Navigation: Natural language interface for blind users to navigate complex indoor spaces
- Spatial Memory System: Integration of declarative and procedural memory for context-aware guidance
- Open-Source Implementation: Complete end-to-end system for research reproducibility
User Voice Input
↓
iOS Speech Recognition
↓
Working Memory Controller
↓
Server API (/get-reply)
↓
LLM Intent/Entity Recognition
↓
Path Planning (A*) / Object Detection
↓
Response Generation
↓
Motor Controller (Audio Output)
↓
User Receives Guidance
- Client → Server: JSON payload with user utterance and context
- Server Processing:
- Intent classification
- Entity extraction
- API routing (pathfinding, detection, memory recall)
- Server → Client: JSON response with system utterance and actions
- Client Execution: Motor controller renders audio/haptic feedback
prototype3/
├── CAControllers/ # Cognitive architecture implementation
├── ViewControllers/ # UI layer (50+ view controllers)
├── Models/ # Data structures and state management
├── Utils/ # Helper classes and utilities
├── Resources/ # ML models (.tflite files)
├── Sounds/ # Audio assets for guidance
└── Base.lproj/ # Storyboard UI definitions
server/
├── Core Services/ # Flask app, LLM integration
├── Machine Learning/ # Model training and generation
├── Navigation/ # Pathfinding algorithms
├── Data Processing/ # Dataset utilities
├── data/ # Configuration and training data
└── scripts/ # Execution scripts
-
Update NLU Examples (
server/data/nlu_examples.json):{ "intent": "request-weather", "examples": ["What's the temperature?", "Is it cold outside?"], "slots": ["$location$", "$time$"] } -
Add API Handler (
server/application_spatial_anchors.py):@app.route('/get-weather', methods=['POST']) def get_weather(): location = request.json.get('location') # Implementation return jsonify({'temperature': 72, 'condition': 'sunny'})
-
Update Client Working Memory (
WorkingMemoryController.swift):func handleWeatherIntent(_ response: [String: Any]) { let temperature = response["temperature"] as? Int // Process and present to user }
- iOS Unit Tests: Run via Xcode Test Navigator
- Server Testing: Use provided test scripts in
datasets/test_ubicomp.json - Integration Testing: Run through 8 research scenarios
- Session Logging: Enable instrumentation for detailed analysis
Client (iOS):
- iPhone 6s or newer (ARKit support)
- iOS 11.0 or later
- 500MB free storage
- WiFi or cellular connectivity
Server:
- Python 3.8+
- 4GB RAM minimum (8GB recommended for LLM inference)
- 10GB storage (for models and datasets)
- CUDA-capable GPU (optional, for faster LLM inference)
- Object Detection: ~10-15 FPS on iPhone 11
- Path Planning: <100ms for typical routes
- LLM Response Time: 1-3 seconds (Ollama/Mistral on CPU)
- Speech Recognition: Real-time (<200ms latency)
iOS App Crashes:
- Ensure Azure Spatial Anchors credentials are valid
- Check camera and location permissions in Settings
- Verify ARKit compatibility on device
Server Connection Failed:
- Confirm server is running:
python application_spatial_anchors.py - Check firewall settings allow Flask port (default: 5000)
- Verify client and server on same network
LLM Not Responding:
- Ensure Ollama is running:
ollama serve - Check model is downloaded:
ollama list - Review server logs for errors
Object Detection Not Working:
- Verify GoogleMLKit pod installed correctly
- Check lighting conditions (requires adequate illumination)
- Ensure camera feed is active
This project is licensed under the MIT License. See LICENSE file for details.
MIT License
Copyright (c) 2025 TBD Lab (Transportation, Bots, and Disability)
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
This research was conducted by the TBD Lab (Transportation, Bots, and Disability) and presented at the AAAI Conference on Artificial Intelligence.
Technologies Used:
- Microsoft Azure Spatial Anchors
- Google MLKit
- Apple ARKit
- Ollama
- Hugging Face Transformers
- Langchain
Special Thanks:
- AAAI reviewers and community
- Accessibility research participants
- Open-source contributors
Contributions are welcome! Please follow these guidelines:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Swift: Follow Apple's Swift style guide
- Python: PEP 8 compliance
- Comments: Document complex algorithms and cognitive architecture decisions
- Testing: Include unit tests for new features
- Client README - Detailed iOS app documentation
- Scenario Documentation - Research scenario descriptions
- Demo Videos - System demonstration videos
- NavCog: Accessible Indoor Navigation System
- Cognitive Architectures for Assistive Technologies
- Vision-Language Models for Scene Understanding
- LLMs for Natural Language Interfaces
- Indoor Navigation Corpus (12.5MB)
- Intent/Entity Training Examples
- Spatial Anchor Configurations
Version: 1.0 Last Updated: 2025 Maintained by: TBD Lab
