Here we provide the overall outline on how Airscape's phase-2 training is carried out and how it should be used.
Enter each subfolders to see the exact dependencies needed.
Our idea is to use rejection sampling and iteratively use the self-play training loop to let the model we get in phase-1 automatically improved based on the guidance from a MoE teacher.
Use mature VLM to generate 8 prompts based on the given frame. Then these data will be used in model inference.
Open prompts_generate to see more details.
Use phase-1 airscape to get outcomes based on different prompts and seeds, which boosts diversity that allows for foreseeable capbability for evolution.
The details can be found in inference, which are basically same as the phase-1 code. You can also open phase1 to learn more about the trainin and usage.
This discriminator acts as a MoE teacher that leads the model to get stronger.
Open best_selection to see more details.