Skip to content

Commit 493d498

Browse files
kbhardwaj-nvidiabxyu-nvidia
authored andcommitted
Comp-Coding Verifier (#5)
Signed-off-by: Brian Yu <bxyu@nvidia.com> Signed-off-by: Khushi Bhardwaj <kbhardwaj@nvidia.com> Co-authored-by: bxyu-nvidia <bxyu@nvidia.com> Signed-off-by: soares-f <soarescmsa@gmail.com>
1 parent 656911f commit 493d498

15 files changed

Lines changed: 1168 additions & 20 deletions

README.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -450,7 +450,7 @@ source .venv/bin/activate
450450
pytest
451451
```
452452

453-
At some point, you will want to actually add data that can be used to query your server. Please follow the instructions for [How To: Prepare and validate data for PR submission or RL training](#how-to-prepare-and-validate-data-for-pr-submission-or-rl-training).
453+
At some point, you will want to actually add data that can be used to query your server. Please follow the instructions for [How To: Prepare and validate data for PR submission or RL training](#how-to-prepare-and-validate-data-for-pr-submission-or-rl-training).
454454

455455

456456
If you need some dataset preprocessing or formatting scripts, please place them your resources server directory e.g. `resources_servers/simple_weather/my_preprocess_script.py`.
@@ -501,7 +501,7 @@ Gitlab uses MLFlow to interface with its model artifact registry. You will need:
501501
2. The URI will look something like `https://gitlab-master.nvidia.com/api/v4/projects/191584/ml/mlflow/`
502502
2. Your Gitlab token. Your Gitlab token must have the `api` and `read_api` scopes.
503503

504-
Provide your MLFlow credentials in `env.yaml`.
504+
Provide your MLFlow credentials in `env.yaml`.
505505
```yaml
506506
mlflow_tracking_uri: {your NeMo Gym Gitlab URI}
507507
mlflow_tracking_token: {your Gitlab PAT}
@@ -700,7 +700,7 @@ More often than node, the SHA256 displayed by Github (SHA256:xxxx) should be the
700700

701701
For developers that sign commits via SSH keys, this is configuration so that VSCode source control is able to sign commits properly!
702702
```bash
703-
git config gpg.format ssh
703+
git config gpg.format ssh
704704
git config user.signingkey ~/.ssh/id_ed25519.pub
705705
```
706706

@@ -724,35 +724,35 @@ Tying back to NeMo Gym, NeMo gym can be used to create synthetic data for SFT tr
724724

725725
# FAQ: Why NeMo Gym?
726726

727-
NeMo Gym is a large-scale collection of high-quality verifier environments for multi-verifier RL training.
727+
NeMo Gym is a large-scale collection of high-quality verifier environments for multi-verifier RL training.
728728
To enable this, NeMo Gym provides infra support for the rollout server that runs 100+ verifiers in parallel.
729729

730730
The document below details why we designed NeMo Gym the way we did. It also includes a direct comparative study that clearly differentiates NeMo Gym from other environment frameworks.
731731

732732
\[Banghua\] As of Thu Aug 21:
733733

734-
1. Gym is completely different from any of the alternatives above in terms of data **coverage, quantity and quality.** For example, for math only, gym contains 1M+ high-quality math verifiable dataset curated by our internal team, with great math verify \+ LLM-as-a-judge support. In contrast, SkyRL and verifiers above only have a small train subset of GSM8K and AIME. We also have close to 10k SWE development, which require both high quality data curation efforts and good infra support. In contrast, Aviary only focuses on scientific knowledge environment. **None of the existing frameworks support general multi-turn tool-use agent, with tools like search, code execution, and other synthetic tools.**
735-
2. We will be a **superset** of all existing gym environments. We are already a super-set of Sky RL Lab Gym and verifiers. We have integrated all GEM environments. We’re working with Aviary to incorporate them as well.
734+
1. Gym is completely different from any of the alternatives above in terms of data **coverage, quantity and quality.** For example, for math only, gym contains 1M+ high-quality math verifiable dataset curated by our internal team, with great math verify \+ LLM-as-a-judge support. In contrast, SkyRL and verifiers above only have a small train subset of GSM8K and AIME. We also have close to 10k SWE development, which require both high quality data curation efforts and good infra support. In contrast, Aviary only focuses on scientific knowledge environment. **None of the existing frameworks support general multi-turn tool-use agent, with tools like search, code execution, and other synthetic tools.**
735+
2. We will be a **superset** of all existing gym environments. We are already a super-set of Sky RL Lab Gym and verifiers. We have integrated all GEM environments. We’re working with Aviary to incorporate them as well.
736736
3. As is shown from Brian’s comparison below, we have much **better infra support for scaling**. And the plan is to use NeMo Gym for 500B+ model training for quality improvement. This will make nemo gym battle tested in frontier model training, while the other gyms are mostly for smaller-scale experiments.
737737

738738
Key use case requirements to avoid training environment scale, complexity, and diversity limitations:
739739

740-
1. Can I easily build my environment without worrying about a training framework?
741-
2. Can I easily call my model using OpenAI Responses and not worry about reasoning parsing?
742-
3. Can I easily use your environment framework to build an agent application product?
743-
4. Can I easily use your environment framework to build a simple multi-agent system?
744-
5. Can I easily run individual SWE-bench task Docker containers?
745-
6. Can I easily add an agent built with any agent framework?
746-
7. Can I easily add any environment framework?
747-
8. Can I easily simultaneously use math-verify==0.7.0 and math-verify==0.8.0 in 2 different environments?
740+
1. Can I easily build my environment without worrying about a training framework?
741+
2. Can I easily call my model using OpenAI Responses and not worry about reasoning parsing?
742+
3. Can I easily use your environment framework to build an agent application product?
743+
4. Can I easily use your environment framework to build a simple multi-agent system?
744+
5. Can I easily run individual SWE-bench task Docker containers?
745+
6. Can I easily add an agent built with any agent framework?
746+
7. Can I easily add any environment framework?
747+
8. Can I easily simultaneously use math-verify==0.7.0 and math-verify==0.8.0 in 2 different environments?
748748
9. Can I easily spin up multiple environments at once?
749749

750750
Key principles
751751

752-
1. \[Reqs 1, 2\] Decoupled from training framework
753-
2. \[Reqs 2, 3, 4, 6, 7\] Standardized behind OpenAI Responses
754-
3. \[Reqs 3, 4, 6\] Explicit Agent vs model abstraction
755-
4. \[Reqs 3, 4, 5, 6, 7\] REST environment servers and container compatible
752+
1. \[Reqs 1, 2\] Decoupled from training framework
753+
2. \[Reqs 2, 3, 4, 6, 7\] Standardized behind OpenAI Responses
754+
3. \[Reqs 3, 4, 6\] Explicit Agent vs model abstraction
755+
4. \[Reqs 3, 4, 5, 6, 7\] REST environment servers and container compatible
756756
5. \[Reqs 8, 9\] Separate Python env per server at runtime
757757

758758
\[Brian note\] There are some rows yet to be filled in here.

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,10 +64,10 @@ dependencies = [
6464
# 1. Why this dependency is here in NeMo Gym
6565
# 2. When this dependency was last updated
6666
# 3. The license of the dependencies.
67-
#
67+
#
6868
# If you are adding or removing dependencies, please do your due diligience to update this information. PRs to main that modify dependencies will not be accepted unless this information is provided.
6969
# The licenses of the below dependencies include: Apache 2.0, MIT, and BSD 3-Clause
70-
#
70+
#
7171
# By design, most (if not all) dependencies are unfrozen here to be easier to consume. The core pieces we need are server infra like FastAPI, etc.
7272
########################################
7373

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# Competitive Coding Resources Server
2+
3+
### Overview
4+
Verifies competitive programming solutions by executing submitted code against unit tests. The server consumes agent trajectories and returns a reward based on whether the assistant's code produces the correct outputs for given test inputs.
5+
Data source: [Filtered competitive programming dataset](https://huggingface.co/datasets/Nexusflow/comp_prog_filtered_no_function); split=`train`
6+
7+
### Input schema
8+
- `responses_create_params`: OpenAI Responses create params
9+
- Use only a user message with the problem statement and instructions (e.g., "You are an expert competitive programmer...").
10+
- `verifier_metadata` (required):
11+
- `unit_tests` (required): dict with `inputs` and `outputs` arrays containing test cases.
12+
- `inputs`: list of strings representing stdin input for each test case
13+
- `outputs`: list of strings representing expected stdout output for each test case
14+
- `problem_id` (optional): unique identifier for the problem
15+
16+
**Notes**
17+
- All test cases must pass for a solution to receive a reward of 1.0
18+
- Failed test cases result in a reward of 0.0 with detailed error information
19+
20+
### Test execution (for now)
21+
- Code is executed using Python's `exec()` function in a controlled environment
22+
- Each test case runs with redirected stdin/stdout:
23+
- `stdin` is populated with the test input
24+
- `stdout` is captured for comparison with expected output
25+
- Available built-ins include common functions: `input`, `print`, `range`, `len`, `int`, `str`, `list`, etc.
26+
- Newlines in test data are properly handled (converts `\\n` to actual newlines)
27+
28+
### Example dataset row
29+
```json
30+
{
31+
"responses_create_params": {
32+
"input": [
33+
{
34+
"role": "user",
35+
"content": "You are an expert competitive programmer. You will be given a problem statement and must output a complete Python solution that reads from stdin and writes to stdout.\n\nPolycarp has $n$ different binary words. A word called binary if it contains only characters '0' and '1'. For example, these words are binary: \"0001\", \"11\", \"0\" and \"0011100\".\n\nPolycarp wants to offer his set of $n$ binary words to play a game \"words\". In this game, players name words and each next word (starting from the second) must start with the last character of the previous word. The first word can be any. For example, these sequence of words can be named during the game: \"0101\", \"1\", \"10\", \"00\", \"00001\".\n\nWord reversal is the operation of reversing the order of the characters. For example, the word \"0111\" after the reversal becomes \"1110\", the word \"11010\" after the reversal becomes \"01011\".\n\nProbably, Polycarp has such a set of words that there is no way to put them in the order correspondent to the game rules. In this situation, he wants to reverse some words from his set so that: the final set of $n$ words still contains different words (i.e. all words are unique); there is a way to put all words of the final set of words in the order so that the final sequence of $n$ words is consistent with the game rules. \n\nPolycarp wants to reverse minimal number of words. Please, help him.\n\n\n-----Input-----\n\nThe first line of the input contains one integer $t$ ($1 \\le t \\le 10^4$) — the number of test cases in the input. Then $t$ test cases follow.\n\nThe first line of a test case contains one integer $n$ ($1 \\le n \\le 2\\cdot10^5$) — the number of words in the Polycarp's set. Next $n$ lines contain these words. All of $n$ words aren't empty and contains only characters '0' and '1'. The sum of word lengths doesn't exceed $4\\cdot10^6$. All words are different.\n\nGuaranteed, that the sum of $n$ for all test cases in the input doesn't exceed $2\\cdot10^5$. Also, guaranteed that the sum of word lengths for all test cases in the input doesn't exceed $4\\cdot10^6$.\n\n\n-----Output-----\n\nPrint answer for all of $t$ test cases in the order they appear.\n\nIf there is no answer for the test case, print -1. Otherwise, the first line of the output should contain $k$ ($0 \\le k \\le n$) — the minimal number of words in the set which should be reversed. The second line of the output should contain $k$ distinct integers — the indexes of the words in the set which should be reversed. Words are numerated from $1$ to $n$ in the order they appear. If $k=0$ you can skip this line (or you can print an empty line). If there are many answers you can print any of them.\n\n\n-----Example-----\nInput\n4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n\nOutput\n1\n3 \n-1\n0\n\n2\n1 2"
36+
}
37+
]
38+
},
39+
"verifier_metadata": {
40+
"problem_id": "c69268d8bdb4da0685d7b187c88296c1",
41+
"unit_tests": {
42+
"inputs": ["4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n"],
43+
"outputs": ["1\n3 \n-1\n0\n\n2\n1 2 \n"]
44+
}
45+
}
46+
}
47+
```
48+
49+
### Example of rollouts and usage
50+
51+
```bash
52+
config_paths="responses_api_agents/simple_agent/configs/simple_agent.yaml,\
53+
responses_api_models/openai_model/configs/openai_model.yaml,\
54+
resources_servers/comp_coding/configs/comp_coding.yaml"
55+
56+
# Running the server
57+
ng_run "+config_paths=[$config_paths]" \
58+
+simple_agent.responses_api_agents.simple_agent.resources_server.name=comp_coding
59+
60+
# Prepare example data for validation
61+
ng_prepare_data "+config_paths=[$config_paths]" \
62+
+output_dirpath=resources_servers/comp_coding/data/ \
63+
+mode=example_validation
64+
65+
# Download train data from gitlab model registry
66+
ng_download_dataset_from_gitlab \
67+
+dataset_name=comp_coding \
68+
+version=0.0.1 \
69+
+run_id=5a1167ef-3533-486f-9c0e-49d1e97fc887 \
70+
+artifact_fpath=train.jsonl \
71+
+output_fpath=resources_servers/comp_coding/data/train.jsonl
72+
73+
# Collect rollouts from example problems
74+
ng_collect_rollouts +agent_name=comp_coding_simple_agent \
75+
+input_jsonl_fpath=resources_servers/comp_coding/data/example.jsonl \
76+
+output_jsonl_fpath=resources_servers/comp_coding/data/example_rollouts.jsonl \
77+
+limit=null
78+
```
79+
80+
### Optional data preperation/validation scripts
81+
82+
```bash
83+
# Build training dataset from collected examples
84+
uv run python resources_servers/comp_coding/scripts/build_examples.py \
85+
--out resources_servers/comp_coding/data/train.jsonl \
86+
--split train[:5000]
87+
88+
# Validate and pre-process train dataset
89+
uv run python resources_servers/comp_coding/scripts/validate_dataset.py \
90+
--in data/comp_coding/train.jsonl --fail-fast
91+
```
92+
93+
### Error handling
94+
The server provides specific error messages for different failure modes:
95+
- `Empty model output`: No text found in the response
96+
- `Missing verifier_metadata.unit_tests`: Required test data not provided
97+
- `Invalid unit_tests`: Malformed test case data
98+
- `Could not extract code`: No valid Python code found in response
99+
- `INVALID_TEST_FORMAT`: Test inputs/outputs length mismatch or empty
100+
- `TEST_CASE_N_FAILED`: Specific test case failed with expected vs actual output
101+
- `TEST_CASE_N_ERROR`: Runtime error during test execution
102+
103+
## Licensing information
104+
TODO: @kbhardwaj to confirm data/code licensing information w Vahid and team

0 commit comments

Comments
 (0)