🎯 Oracle-Guided Meta-PPO

A Scalable Two-Stage Reinforcement Learning Framework for Multi-Agent Budget-Constrained POMDPs

🌟 Overview

Oracle-Guided Meta-PPO addresses the challenge of training reinforcement learning policies for large-scale multi-agent systems with:

📊 Budget constraints across all agents
👁️ Partial observability (POMDP setting)
🔄 Heterogeneous agent dynamics

The key insight is to leverage computationally tractable oracle policies (computed via value iteration on a surrogate MDP) to guide the training of a meta-policy that generalizes across diverse agent configurations.

✨ Key Features

Feature	Description
🚀 Scalability	Efficiently handles up to 1,000 heterogeneous agents
🎓 Generalization	Meta-policy trained on small subset generalizes to unseen configurations
🔮 Oracle Guidance	MDP-based oracles accelerate POMDP policy learning
⚡ Two-Stage Design	Decouples budget allocation from policy learning

🔬 Method

The proposed approach consists of three main stages:

Overview of the Oracle-Guided Meta-PPO pipeline: (1) Random Forest predicts optimal budget allocation, (2) Value iteration generates oracle policies for each agent-budget pair, (3) Meta-PPO learns when to follow the oracle vs. gather information.

Stage 1: Budget Allocation via Random Forest

A Random Forest regressor learns to predict optimal per-agent budget allocations based on agent-specific features (degradation dynamics, costs, etc.).

Stage 2: Oracle Policy Generation

For each agent-budget pair, an oracle policy is computed via value iteration on a surrogate MDP with full state observability.

Stage 3: Oracle-Guided Meta-PPO Training

A PPO-based meta-policy is trained with a hierarchical action space:

Action 0: Follow the oracle policy's recommendation
Action 1: Take an inspection action to reduce uncertainty

This design allows the policy to focus on the core POMDP challenge: when to gather information.

📈 Results

Performance Comparison

Our Oracle-Guided Meta-PPO achieves near-oracle performance while operating under partial observability:

Comparison across three metrics: (a) Maximum lifetime achieved, (b) Number of repair actions, (c) Total cost incurred. Oracle-Guided Meta-PPO (orange) closely tracks the oracle policy (blue dashed) and significantly outperforms Vanilla Meta-PPO (red) and rule-based baselines (green).

Budget Allocation via Random Forest

The Random Forest regressor accurately predicts optimal budget allocation parameters:

Random Forest predictions closely match ground-truth parameters obtained via non-linear least squares optimization.

Computational Scalability

The framework demonstrates practical scalability:

Total computation time vs. number of components (log-log scale). The algorithm efficiently scales to 1,000 agents.

🗂️ Repository Structure

Oracle-Guided-Meta-PPO/
│
├── 📁 infra_env/                          # Infrastructure Management Scenario
│   ├── 📁 env/                            # Environment definitions
│   │   ├── component_mdp_repair.py        # MDP formulation for components
│   │   ├── component_pomdp_repair.py      # POMDP formulation with belief tracking
│   │   ├── meta_ppo_env.py                # Meta-PPO environment wrapper
│   │   └── baseline_env.py                # Baseline environment
│   │
│   └── 📁 pomdp_solver/                   # Core algorithms
│       ├── random_forest.py               # RF model training
│       ├── random_forest_budget_split.py  # Budget allocation via RF
│       ├── generate_oracle_policies.py    # Value iteration oracle generation
│       ├── oracle_guided_meta_ppo_train.py    # Oracle-Guided Meta-PPO training
│       ├── oracle_guided_meta_ppo_test.py     # Oracle-Guided Meta-PPO testing
│       ├── oracle_guided_meta_ppo_optimal_budget_split.py  # Full pipeline
│       ├── vanilla_meta_ppo_train.py      # Baseline: Vanilla Meta-PPO
│       ├── vanilla_meta_ppo_test.py       # Baseline: Vanilla Meta-PPO testing
│       ├── realistic_baseline.py          # Baseline: Rule-based policy
│       ├── oracle_policy_test.py          # Baseline: Oracle-only policy
│       └── time_complexity.py             # Scalability experiments
│
├── 📁 etf_env/                            # ETF Risk Capital Management Scenario
│   ├── 📁 env/                            # Environment definitions
│   │   ├── etf_env.py                     # Multi-asset ETF environment
│   │   └── sub_etf_env.py                 # Sub-environment definitions
│   │
│   ├── 📁 models/                         # Machine learning models
│   │   ├── random_forest.py               # Budget split model
│   │   └── random_forest_budget_split.py  # RF training for budget allocation
│   │
│   ├── etf_oracle_guided_meta_ppo.py      # Oracle-Guided Meta-PPO for ETF
│   ├── etf_oracle_policy.py               # Oracle policy generation
│   ├── oracle_guided_meta_ppo_train_refactored.py
│   ├── oracle_guided_meta_ppo_test_refactored.py
│   ├── vanilla_meta_ppo_train.py          # Baseline comparison
│   ├── vanilla_meta_ppo_test.py
│   ├── generate_sp500_data.py             # Data generation utilities
│   ├── generate_oracle_policy.py          # Oracle generation
│   └── baselinefin.py                     # Financial baseline configuration
│
├── 📁 assets/                             # Images for README
├── 📄 requirements.txt                    # Python dependencies
├── 📄 .gitignore                          # Git ignore rules
└── 📄 README.md                           # This file

🚀 Installation

Prerequisites

Python 3.8+
pip or conda

Quick Start

# Clone the repository
git clone https://github.com/Manavvora/Oracle-Guided-Meta-PPO.git
cd Oracle-Guided-Meta-PPO

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

💻 Usage

Infrastructure Management Scenario

cd infra_env/pomdp_solver

# Step 1: Train Random Forest for budget prediction
python random_forest.py

# Step 2: Compute budget split
python random_forest_budget_split.py --num_components 1000

# Step 3: Generate oracle policies
python generate_oracle_policies_optimal_budget_split.py --num_components 1000

# Step 4: Train Oracle-Guided Meta-PPO
python oracle_guided_meta_ppo_train.py

# Step 5: Evaluate
python oracle_guided_meta_ppo_optimal_budget_split.py --num_components 1000

ETF Risk Capital Management

cd etf_env

# Train
python etf_oracle_guided_meta_ppo.py --timesteps 100000

# Test
python oracle_guided_ppo_test.py

Run Scalability Experiments

cd infra_env/pomdp_solver
python time_complexity.py

📊 Experiments

Baselines

Method	Description
Oracle Policy	Upper bound - MDP policy with full observability
Vanilla Meta-PPO	Standard meta-PPO without oracle guidance
Realistic Baseline	Rule-based inspection/replacement policy
Equal Budget Split	Uniform budget allocation

Evaluation Metrics

Time-to-Failure (TTF): Average operational lifetime
Total Cost Incurred: Cumulative maintenance costs
Action Distribution: Frequency of inspect/replace/no-action

🌐 Application Domains

🏗️ Infrastructure Management

Aspect	Details
Problem	Managing degradation of infrastructure components
State	Component health (partially observable) + budget consumed
Actions	No-op, Inspect, Replace
Objective	Maximize lifetime within budget

📈 ETF Risk Capital Management

Aspect	Details
Problem	Managing risk capital across financial assets
State	Asset prices (observable) + risk levels (partial)
Actions	No-op, Inspect, Recapitalize
Objective	Maximize portfolio survival

📄 Citation

If you find this work useful, please cite our paper:

@article{vora2024solving,
  title={Solving truly massive budgeted monotonic pomdps with oracle-guided meta-reinforcement learning},
  author={Vora, Manav and Liang, Jonas and Grussing, Michael N and Ornik, Melkior},
  journal={arXiv preprint arXiv:2408.07192},
  year={2024}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This work builds upon:

Stable-Baselines3 for RL algorithms
Gymnasium for environment interfaces

⭐ If you find this work useful, please consider giving it a star! ⭐

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎯 Oracle-Guided Meta-PPO

Table of Contents

🌟 Overview

✨ Key Features

🔬 Method

Stage 1: Budget Allocation via Random Forest

Stage 2: Oracle Policy Generation

Stage 3: Oracle-Guided Meta-PPO Training

📈 Results

Performance Comparison

Budget Allocation via Random Forest

Computational Scalability

🗂️ Repository Structure

🚀 Installation

Prerequisites

Quick Start

💻 Usage

Infrastructure Management Scenario

ETF Risk Capital Management

Run Scalability Experiments

📊 Experiments

Baselines

Evaluation Metrics

🌐 Application Domains

🏗️ Infrastructure Management

📈 ETF Risk Capital Management

📄 Citation

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
etf_env		etf_env
infra_env		infra_env
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

leadcatlab/Oracle-Guided-Meta-PPO

Folders and files

Latest commit

History

Repository files navigation

🎯 Oracle-Guided Meta-PPO

Table of Contents

🌟 Overview

✨ Key Features

🔬 Method

Stage 1: Budget Allocation via Random Forest

Stage 2: Oracle Policy Generation

Stage 3: Oracle-Guided Meta-PPO Training

📈 Results

Performance Comparison

Budget Allocation via Random Forest

Computational Scalability

🗂️ Repository Structure

🚀 Installation

Prerequisites

Quick Start

💻 Usage

Infrastructure Management Scenario

ETF Risk Capital Management

Run Scalability Experiments

📊 Experiments

Baselines

Evaluation Metrics

🌐 Application Domains

🏗️ Infrastructure Management

📈 ETF Risk Capital Management

📄 Citation

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages