Skip to content

HF Dataset Format Checking & Migration, Auto summary table generation #81

@banghuaz-nvidia

Description

@banghuaz-nvidia

3 changes on Gym side:

  1. Require domain, license in the config yaml (fixed enum)
  • Domain: [“Math”, “Coding”, “Agent”, “Knowledge”, “Instruction_following”, “Long_context”, “Safety”, “Games”, “Others”]
  • Generate large summary table with domain, resource_server_name, license
  1. Implement ng_gitlab_to_hf_dataset
  • Creates the name “Nvidia/Nemo-Gym-[Domain]–[Name-of-resource-server]-[Dataset name]”. For example: “Nvidia/Nemo-Gym-Math-library_judge_math-dapo17k”.
  • Adds to the nemo-gym collection from Nvidia HF
  • Make sure to upload as a private repo
  1. Simple format checking for the dataset before uploading
  • Uploads to HF, [then delete gitlab dataset (optional)]
  • Prepare dataset from HF (to be a alternative function to ng_download_dataset_from_gitlab, maybe just ng_download_dataset_from_huggingface)
  • Download from HF

Metadata

Metadata

Assignees

Labels

usabilityimprovements to user experience

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions