Skip to content

leadcatlab/Oracle-Guided-Meta-PPO

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽฏ Oracle-Guided Meta-PPO

A Scalable Two-Stage Reinforcement Learning Framework for Multi-Agent Budget-Constrained POMDPs

arXiv Python 3.8+ PyTorch Stable-Baselines3 License: MIT

[Paper] ยท [PDF]


Table of Contents


๐ŸŒŸ Overview

Oracle-Guided Meta-PPO addresses the challenge of training reinforcement learning policies for large-scale multi-agent systems with:

  • ๐Ÿ“Š Budget constraints across all agents
  • ๐Ÿ‘๏ธ Partial observability (POMDP setting)
  • ๐Ÿ”„ Heterogeneous agent dynamics

The key insight is to leverage computationally tractable oracle policies (computed via value iteration on a surrogate MDP) to guide the training of a meta-policy that generalizes across diverse agent configurations.

โœจ Key Features

Feature Description
๐Ÿš€ Scalability Efficiently handles up to 1,000 heterogeneous agents
๐ŸŽ“ Generalization Meta-policy trained on small subset generalizes to unseen configurations
๐Ÿ”ฎ Oracle Guidance MDP-based oracles accelerate POMDP policy learning
โšก Two-Stage Design Decouples budget allocation from policy learning

๐Ÿ”ฌ Method

The proposed approach consists of three main stages:

Architecture Overview

Overview of the Oracle-Guided Meta-PPO pipeline: (1) Random Forest predicts optimal budget allocation, (2) Value iteration generates oracle policies for each agent-budget pair, (3) Meta-PPO learns when to follow the oracle vs. gather information.

Stage 1: Budget Allocation via Random Forest

A Random Forest regressor learns to predict optimal per-agent budget allocations based on agent-specific features (degradation dynamics, costs, etc.).

Stage 2: Oracle Policy Generation

For each agent-budget pair, an oracle policy is computed via value iteration on a surrogate MDP with full state observability.

Stage 3: Oracle-Guided Meta-PPO Training

A PPO-based meta-policy is trained with a hierarchical action space:

  • Action 0: Follow the oracle policy's recommendation
  • Action 1: Take an inspection action to reduce uncertainty

This design allows the policy to focus on the core POMDP challenge: when to gather information.


๐Ÿ“ˆ Results

Performance Comparison

Our Oracle-Guided Meta-PPO achieves near-oracle performance while operating under partial observability:

Performance Metrics Comparison

Comparison across three metrics: (a) Maximum lifetime achieved, (b) Number of repair actions, (c) Total cost incurred. Oracle-Guided Meta-PPO (orange) closely tracks the oracle policy (blue dashed) and significantly outperforms Vanilla Meta-PPO (red) and rule-based baselines (green).

Budget Allocation via Random Forest

The Random Forest regressor accurately predicts optimal budget allocation parameters:

Random Forest vs NLLS

Random Forest predictions closely match ground-truth parameters obtained via non-linear least squares optimization.

Computational Scalability

The framework demonstrates practical scalability:

Computational Complexity

Total computation time vs. number of components (log-log scale). The algorithm efficiently scales to 1,000 agents.


๐Ÿ—‚๏ธ Repository Structure

Oracle-Guided-Meta-PPO/
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ infra_env/                          # Infrastructure Management Scenario
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ env/                            # Environment definitions
โ”‚   โ”‚   โ”œโ”€โ”€ component_mdp_repair.py        # MDP formulation for components
โ”‚   โ”‚   โ”œโ”€โ”€ component_pomdp_repair.py      # POMDP formulation with belief tracking
โ”‚   โ”‚   โ”œโ”€โ”€ meta_ppo_env.py                # Meta-PPO environment wrapper
โ”‚   โ”‚   โ””โ”€โ”€ baseline_env.py                # Baseline environment
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ ๐Ÿ“ pomdp_solver/                   # Core algorithms
โ”‚       โ”œโ”€โ”€ random_forest.py               # RF model training
โ”‚       โ”œโ”€โ”€ random_forest_budget_split.py  # Budget allocation via RF
โ”‚       โ”œโ”€โ”€ generate_oracle_policies.py    # Value iteration oracle generation
โ”‚       โ”œโ”€โ”€ oracle_guided_meta_ppo_train.py    # Oracle-Guided Meta-PPO training
โ”‚       โ”œโ”€โ”€ oracle_guided_meta_ppo_test.py     # Oracle-Guided Meta-PPO testing
โ”‚       โ”œโ”€โ”€ oracle_guided_meta_ppo_optimal_budget_split.py  # Full pipeline
โ”‚       โ”œโ”€โ”€ vanilla_meta_ppo_train.py      # Baseline: Vanilla Meta-PPO
โ”‚       โ”œโ”€โ”€ vanilla_meta_ppo_test.py       # Baseline: Vanilla Meta-PPO testing
โ”‚       โ”œโ”€โ”€ realistic_baseline.py          # Baseline: Rule-based policy
โ”‚       โ”œโ”€โ”€ oracle_policy_test.py          # Baseline: Oracle-only policy
โ”‚       โ””โ”€โ”€ time_complexity.py             # Scalability experiments
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ etf_env/                            # ETF Risk Capital Management Scenario
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ env/                            # Environment definitions
โ”‚   โ”‚   โ”œโ”€โ”€ etf_env.py                     # Multi-asset ETF environment
โ”‚   โ”‚   โ””โ”€โ”€ sub_etf_env.py                 # Sub-environment definitions
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ models/                         # Machine learning models
โ”‚   โ”‚   โ”œโ”€โ”€ random_forest.py               # Budget split model
โ”‚   โ”‚   โ””โ”€โ”€ random_forest_budget_split.py  # RF training for budget allocation
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ etf_oracle_guided_meta_ppo.py      # Oracle-Guided Meta-PPO for ETF
โ”‚   โ”œโ”€โ”€ etf_oracle_policy.py               # Oracle policy generation
โ”‚   โ”œโ”€โ”€ oracle_guided_meta_ppo_train_refactored.py
โ”‚   โ”œโ”€โ”€ oracle_guided_meta_ppo_test_refactored.py
โ”‚   โ”œโ”€โ”€ vanilla_meta_ppo_train.py          # Baseline comparison
โ”‚   โ”œโ”€โ”€ vanilla_meta_ppo_test.py
โ”‚   โ”œโ”€โ”€ generate_sp500_data.py             # Data generation utilities
โ”‚   โ”œโ”€โ”€ generate_oracle_policy.py          # Oracle generation
โ”‚   โ””โ”€โ”€ baselinefin.py                     # Financial baseline configuration
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ assets/                             # Images for README
โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt                    # Python dependencies
โ”œโ”€โ”€ ๐Ÿ“„ .gitignore                          # Git ignore rules
โ””โ”€โ”€ ๐Ÿ“„ README.md                           # This file

๐Ÿš€ Installation

Prerequisites

  • Python 3.8+
  • pip or conda

Quick Start

# Clone the repository
git clone https://github.com/Manavvora/Oracle-Guided-Meta-PPO.git
cd Oracle-Guided-Meta-PPO

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

๐Ÿ’ป Usage

Infrastructure Management Scenario

cd infra_env/pomdp_solver

# Step 1: Train Random Forest for budget prediction
python random_forest.py

# Step 2: Compute budget split
python random_forest_budget_split.py --num_components 1000

# Step 3: Generate oracle policies
python generate_oracle_policies_optimal_budget_split.py --num_components 1000

# Step 4: Train Oracle-Guided Meta-PPO
python oracle_guided_meta_ppo_train.py

# Step 5: Evaluate
python oracle_guided_meta_ppo_optimal_budget_split.py --num_components 1000

ETF Risk Capital Management

cd etf_env

# Train
python etf_oracle_guided_meta_ppo.py --timesteps 100000

# Test
python oracle_guided_ppo_test.py

Run Scalability Experiments

cd infra_env/pomdp_solver
python time_complexity.py

๐Ÿ“Š Experiments

Baselines

Method Description
Oracle Policy Upper bound - MDP policy with full observability
Vanilla Meta-PPO Standard meta-PPO without oracle guidance
Realistic Baseline Rule-based inspection/replacement policy
Equal Budget Split Uniform budget allocation

Evaluation Metrics

  • Time-to-Failure (TTF): Average operational lifetime
  • Total Cost Incurred: Cumulative maintenance costs
  • Action Distribution: Frequency of inspect/replace/no-action

๐ŸŒ Application Domains

๐Ÿ—๏ธ Infrastructure Management

Aspect Details
Problem Managing degradation of infrastructure components
State Component health (partially observable) + budget consumed
Actions No-op, Inspect, Replace
Objective Maximize lifetime within budget

๐Ÿ“ˆ ETF Risk Capital Management

Aspect Details
Problem Managing risk capital across financial assets
State Asset prices (observable) + risk levels (partial)
Actions No-op, Inspect, Recapitalize
Objective Maximize portfolio survival

๐Ÿ“„ Citation

If you find this work useful, please cite our paper:

@article{vora2024solving,
  title={Solving truly massive budgeted monotonic pomdps with oracle-guided meta-reinforcement learning},
  author={Vora, Manav and Liang, Jonas and Grussing, Michael N and Ornik, Melkior},
  journal={arXiv preprint arXiv:2408.07192},
  year={2024}
}

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

This work builds upon:


โญ If you find this work useful, please consider giving it a star! โญ

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.9%
  • Shell 4.1%