A Scalable Two-Stage Reinforcement Learning Framework for Multi-Agent Budget-Constrained POMDPs
- Overview
- Method
- Results
- Repository Structure
- Installation
- Usage
- Experiments
- Application Domains
- Citation
Oracle-Guided Meta-PPO addresses the challenge of training reinforcement learning policies for large-scale multi-agent systems with:
- ๐ Budget constraints across all agents
- ๐๏ธ Partial observability (POMDP setting)
- ๐ Heterogeneous agent dynamics
The key insight is to leverage computationally tractable oracle policies (computed via value iteration on a surrogate MDP) to guide the training of a meta-policy that generalizes across diverse agent configurations.
| Feature | Description |
|---|---|
| ๐ Scalability | Efficiently handles up to 1,000 heterogeneous agents |
| ๐ Generalization | Meta-policy trained on small subset generalizes to unseen configurations |
| ๐ฎ Oracle Guidance | MDP-based oracles accelerate POMDP policy learning |
| โก Two-Stage Design | Decouples budget allocation from policy learning |
The proposed approach consists of three main stages:
Overview of the Oracle-Guided Meta-PPO pipeline: (1) Random Forest predicts optimal budget allocation, (2) Value iteration generates oracle policies for each agent-budget pair, (3) Meta-PPO learns when to follow the oracle vs. gather information.
A Random Forest regressor learns to predict optimal per-agent budget allocations based on agent-specific features (degradation dynamics, costs, etc.).
For each agent-budget pair, an oracle policy is computed via value iteration on a surrogate MDP with full state observability.
A PPO-based meta-policy is trained with a hierarchical action space:
- Action 0: Follow the oracle policy's recommendation
- Action 1: Take an inspection action to reduce uncertainty
This design allows the policy to focus on the core POMDP challenge: when to gather information.
Our Oracle-Guided Meta-PPO achieves near-oracle performance while operating under partial observability:
Comparison across three metrics: (a) Maximum lifetime achieved, (b) Number of repair actions, (c) Total cost incurred. Oracle-Guided Meta-PPO (orange) closely tracks the oracle policy (blue dashed) and significantly outperforms Vanilla Meta-PPO (red) and rule-based baselines (green).
The Random Forest regressor accurately predicts optimal budget allocation parameters:
Random Forest predictions closely match ground-truth parameters obtained via non-linear least squares optimization.
The framework demonstrates practical scalability:
Total computation time vs. number of components (log-log scale). The algorithm efficiently scales to 1,000 agents.
Oracle-Guided-Meta-PPO/
โ
โโโ ๐ infra_env/ # Infrastructure Management Scenario
โ โโโ ๐ env/ # Environment definitions
โ โ โโโ component_mdp_repair.py # MDP formulation for components
โ โ โโโ component_pomdp_repair.py # POMDP formulation with belief tracking
โ โ โโโ meta_ppo_env.py # Meta-PPO environment wrapper
โ โ โโโ baseline_env.py # Baseline environment
โ โ
โ โโโ ๐ pomdp_solver/ # Core algorithms
โ โโโ random_forest.py # RF model training
โ โโโ random_forest_budget_split.py # Budget allocation via RF
โ โโโ generate_oracle_policies.py # Value iteration oracle generation
โ โโโ oracle_guided_meta_ppo_train.py # Oracle-Guided Meta-PPO training
โ โโโ oracle_guided_meta_ppo_test.py # Oracle-Guided Meta-PPO testing
โ โโโ oracle_guided_meta_ppo_optimal_budget_split.py # Full pipeline
โ โโโ vanilla_meta_ppo_train.py # Baseline: Vanilla Meta-PPO
โ โโโ vanilla_meta_ppo_test.py # Baseline: Vanilla Meta-PPO testing
โ โโโ realistic_baseline.py # Baseline: Rule-based policy
โ โโโ oracle_policy_test.py # Baseline: Oracle-only policy
โ โโโ time_complexity.py # Scalability experiments
โ
โโโ ๐ etf_env/ # ETF Risk Capital Management Scenario
โ โโโ ๐ env/ # Environment definitions
โ โ โโโ etf_env.py # Multi-asset ETF environment
โ โ โโโ sub_etf_env.py # Sub-environment definitions
โ โ
โ โโโ ๐ models/ # Machine learning models
โ โ โโโ random_forest.py # Budget split model
โ โ โโโ random_forest_budget_split.py # RF training for budget allocation
โ โ
โ โโโ etf_oracle_guided_meta_ppo.py # Oracle-Guided Meta-PPO for ETF
โ โโโ etf_oracle_policy.py # Oracle policy generation
โ โโโ oracle_guided_meta_ppo_train_refactored.py
โ โโโ oracle_guided_meta_ppo_test_refactored.py
โ โโโ vanilla_meta_ppo_train.py # Baseline comparison
โ โโโ vanilla_meta_ppo_test.py
โ โโโ generate_sp500_data.py # Data generation utilities
โ โโโ generate_oracle_policy.py # Oracle generation
โ โโโ baselinefin.py # Financial baseline configuration
โ
โโโ ๐ assets/ # Images for README
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ .gitignore # Git ignore rules
โโโ ๐ README.md # This file
- Python 3.8+
- pip or conda
# Clone the repository
git clone https://github.com/Manavvora/Oracle-Guided-Meta-PPO.git
cd Oracle-Guided-Meta-PPO
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtcd infra_env/pomdp_solver
# Step 1: Train Random Forest for budget prediction
python random_forest.py
# Step 2: Compute budget split
python random_forest_budget_split.py --num_components 1000
# Step 3: Generate oracle policies
python generate_oracle_policies_optimal_budget_split.py --num_components 1000
# Step 4: Train Oracle-Guided Meta-PPO
python oracle_guided_meta_ppo_train.py
# Step 5: Evaluate
python oracle_guided_meta_ppo_optimal_budget_split.py --num_components 1000cd etf_env
# Train
python etf_oracle_guided_meta_ppo.py --timesteps 100000
# Test
python oracle_guided_ppo_test.pycd infra_env/pomdp_solver
python time_complexity.py| Method | Description |
|---|---|
| Oracle Policy | Upper bound - MDP policy with full observability |
| Vanilla Meta-PPO | Standard meta-PPO without oracle guidance |
| Realistic Baseline | Rule-based inspection/replacement policy |
| Equal Budget Split | Uniform budget allocation |
- Time-to-Failure (TTF): Average operational lifetime
- Total Cost Incurred: Cumulative maintenance costs
- Action Distribution: Frequency of inspect/replace/no-action
| Aspect | Details |
|---|---|
| Problem | Managing degradation of infrastructure components |
| State | Component health (partially observable) + budget consumed |
| Actions | No-op, Inspect, Replace |
| Objective | Maximize lifetime within budget |
| Aspect | Details |
|---|---|
| Problem | Managing risk capital across financial assets |
| State | Asset prices (observable) + risk levels (partial) |
| Actions | No-op, Inspect, Recapitalize |
| Objective | Maximize portfolio survival |
If you find this work useful, please cite our paper:
@article{vora2024solving,
title={Solving truly massive budgeted monotonic pomdps with oracle-guided meta-reinforcement learning},
author={Vora, Manav and Liang, Jonas and Grussing, Michael N and Ornik, Melkior},
journal={arXiv preprint arXiv:2408.07192},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
This work builds upon:
- Stable-Baselines3 for RL algorithms
- Gymnasium for environment interfaces
โญ If you find this work useful, please consider giving it a star! โญ



