Augmented Imitation Learning#
This section describes how to use Isaac Lab’s imitation learning capabilities with the visual augmentation capabilities of Cosmos models to generate demonstrations at scale to train visuomotor policies robust against visual variations.
Generating Demonstrations#
We use the Isaac Lab Mimic feature that allows the generation of additional demonstrations automatically from a handful of annotated demonstrations.
Note
This section assumes you already have an annotated dataset of collected demonstrations. If you don’t, you can follow the instructions in Teleoperation and Imitation Learning with Isaac Lab Mimic to collect and annotate your own demonstrations.
In the following example, we will show you how to use Isaac Lab Mimic to generate additional demonstrations that can be used to train a visuomotor policy directly or can be augmented with visual variations using Cosmos (using the Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-Mimic-v0
environment).
Note
The Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-Mimic-v0
environment is similar to the standard visuomotor environment (Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Mimic-v0
), but with the addition of segmentation masks, depth maps, and normal maps in the generated dataset. These additional modalities are required to get the best results from the visual augmentation done using Cosmos.
./isaaclab.sh -p scripts/imitation_learning/isaaclab_mimic/generate_dataset.py \
--device cpu --enable_cameras --headless --num_envs 10 --generation_num_trials 1000 \
--input_file ./datasets/annotated_dataset.hdf5 --output_file ./datasets/mimic_dataset_1k.hdf5 \
--task Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-Mimic-v0 \
--rendering_mode performance
The number of demonstrations can be increased or decreased, 1000 demonstrations have been shown to provide good training results for this task.
Additionally, the number of environments in the --num_envs
parameter can be adjusted to speed up data generation.
The suggested number of 10 can be executed on a moderate laptop CPU.
On a more powerful desktop machine, use a larger number of environments for a significant speedup of this step.
Cosmos Augmentation#
HDF5 to MP4 Conversion#
The hdf5_to_mp4.py
script converts camera frames stored in HDF5 demonstration files to MP4 videos. It supports multiple camera modalities including RGB, segmentation, depth and normal maps. This conversion is necessary for visual augmentation using Cosmos as it only works with video files rather than HDF5 data.
Required Arguments
|
Path to the input HDF5 file. |
|
Directory to save the output MP4 files. |
Optional Arguments
|
List of input keys to process from the HDF5 file. (default: [“table_cam”, “wrist_cam”, “table_cam_segmentation”, “table_cam_normals”, “table_cam_shaded_segmentation”, “table_cam_depth”]) |
|
Height of the output video in pixels. (default: 704) |
|
Width of the output video in pixels. (default: 1280) |
|
Frames per second for the output video. (default: 30) |
Note
The default input keys cover all camera modalities as per the naming convention followed in the Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-Mimic-v0
environment. We include an additional modality “table_cam_shaded_segmentation” which is not a part of the generated modalities from simulation in the HDF5 data file. Instead, it is automatically generated by this script using a combination of the segmentation and normal maps to get a pseudo-textured segmentation video for better controlling the Cosmos augmentation.
Note
We recommend using the default values given above for the output video height, width and framerate for the best results with Cosmos augmentation.
Example usage for the cube stacking task:
python scripts/tools/hdf5_to_mp4.py \
--input_file datasets/mimic_dataset_1k.hdf5 \
--output_dir datasets/mimic_dataset_1k_mp4
Running Cosmos for Visual Augmentation#
After converting the demonstrations to MP4 format, you can use a Cosmos model to visually augment the videos. Follow the Cosmos documentation for details on the augmentation process. Visual augmentation can include changes to lighting, textures, backgrounds, and other visual elements while preserving the essential task-relevant features.
We use the RGB, depth and shaded segmentation videos from the previous step as input to the Cosmos model as seen below:

We provide an example augmentation output from Cosmos Transfer1 below:

We recommend using the Cosmos Transfer1 model for visual augmentation as we found it to produce the best results in the form of a highly diverse dataset with a wide range of visual variations. You can refer to the installation instructions here, the checkpoint download instructions here and this example for reference on how to use Transfer1 for this usecase. We further recommend the following settings to be used with the Transfer1 model for this task:
Hyperparameters
|
“The video captures a game playing, with bad crappy graphics and cartoonish frames. It represents a recording of old outdated games. The images are very pixelated and of poor CG quality. There are many subtitles in the footage. Overall, the video is unrealistic and appears cg. Plane background.” |
|
50 |
|
“0.3,0.3,0.6,0.7” |
|
“blur,canny,depth,segmentation” |
Another crucial aspect to get good augmentations is the set of prompts used to control the Cosmos generation. We provide a script, cosmos_prompt_gen.py
, to construct prompts from a set of carefully chosen templates that handle various aspects of the augmentation process.
Required Arguments
|
Path to the file containing templates for the prompts. |
Optional Arguments
|
Number of prompts to generate (default: 1). |
|
Path to the output file to write generated prompts. (default: prompts.txt) |
python scripts/tools/cosmos/cosmos_prompt_gen.py \
--templates_path scripts/tools/cosmos/transfer1_templates.json \
--num_prompts 10 --output_path prompts.txt
In case you want to create your own prompts, we suggest you refer to the following guidelines:
Keep the prompts as detailed as possible. It is best to have some instruction on how the generation should handle each visible object/region of interest. For instance, the prompts that we provide cover explicit details for the table, lighting, background, robot arm, cubes, and the general setting.
Try to keep the augmentation instructions as realistic and coherent as possible. The more unrealistic or unconventional the prompt is, the worse the model does at retaining key features of the input control video(s).
Keep the augmentation instructions in-sync for each aspect. What we mean by this is that the augmentation for all the objects/regions of interest should be coherent and conventional with respect to each other. For example, it is better to have a prompt such as “The table is of old dark wood with faded polish and food stains and the background consists of a suburban home” instead of something like “The table is of old dark wood with faded polish and food stains and the background consists of a spaceship hurtling through space”.
It is vital to include details on key aspects of the input control video(s) that should be retained or left unchanged. In our prompts, we very clearly mention that the cube colors should be left unchanged such that the bottom cube is blue, the middle is red and the top is green. Note that we not only mention what should be left unchanged but also give details on what form that aspect currently has.
Example command to use the Cosmos Transfer1 model for this usecase:
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=1}"
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/cosmos_dataset_1k_mp4 \
--controlnet_specs ./controlnet_specs/demo_0.json \
--offload_text_encoder_model \
--offload_guardrail_models \
--num_gpus $NUM_GPU
Example ./controlnet_specs/demo_0.json
json file to use with the above command:
{
"prompt": "A robotic arm is picking up and stacking cubes inside a foggy industrial scrapyard at dawn, surrounded by piles of old robotic parts and twisted metal. The background includes large magnetic cranes, rusted conveyor belts, and flickering yellow floodlights struggling to penetrate the fog. The robot arm is bright teal with a glossy surface and silver stripes on the outer edges; the joints rotate smoothly and the pistons reflect a pale cyan hue. The robot arm is mounted on a table that is light oak wood with a natural grain pattern and a glossy varnish that reflects overhead lights softly; small burn marks dot one corner. The arm is connected to the base mounted on the table. The bottom cube is deep blue, the second cube is bright red, and the top cube is vivid green, maintaining their correct order after stacking. Sunlight pouring in from a large, open window bathes the table and robotic arm in a warm golden light. The shadows are soft, and the scene feels natural and inviting with a slight contrast between light and shadow.",
"negative_prompt": "The video captures a game playing, with bad crappy graphics and cartoonish frames. It represents a recording of old outdated games. The images are very pixelated and of poor CG quality. There are many subtitles in the footage. Overall, the video is unrealistic and appears cg. Plane background.",
"input_video_path" : "mimic_dataset_1k_mp4/demo_0_table_cam.mp4",
"sigma_max": 50,
"vis": {
"input_control": "mimic_dataset_1k_mp4/demo_0_table_cam.mp4",
"control_weight": 0.3
},
"edge": {
"control_weight": 0.3
},
"depth": {
"input_control": "mimic_dataset_1k_mp4/demo_0_table_cam_depth.mp4",
"control_weight": 0.6
},
"seg": {
"input_control": "mimic_dataset_1k_mp4/demo_0_table_cam_shaded_segmentation.mp4",
"control_weight": 0.7
}
}
MP4 to HDF5 Conversion#
The mp4_to_hdf5.py
script converts the visually augmented MP4 videos back to HDF5 format for training. This step is crucial as it ensures the augmented visual data is in the correct format for training visuomotor policies in Isaac Lab and pairs the videos with the corresponding demonstration data from the original dataset.
Required Arguments
|
Path to the input HDF5 file containing the original demonstrations. |
|
Directory containing the visually augmented MP4 videos. |
|
Path to save the new HDF5 file with augmented videos. |
Note
The input HDF5 file is used to preserve the non-visual data (such as robot states and actions) while replacing the visual data with the augmented versions.
Important
The visually augmented MP4 files must follow the naming convention demo_{demo_id}_*.mp4
, where:
demo_id
matches the demonstration ID from the original MP4 file*
signifies that the file name can be as per user preference starting from this point
This naming convention is required for the script to correctly pair the augmented videos with their corresponding demonstrations.
Example usage for the cube stacking task:
python scripts/tools/mp4_to_hdf5.py \
--input_file datasets/mimic_dataset_1k.hdf5 \
--videos_dir datasets/cosmos_dataset_1k_mp4 \
--output_file datasets/cosmos_dataset_1k.hdf5
Pre-generated Dataset#
We provide a pre-generated dataset in HDF5 format containing visually augmented demonstrations for the cube stacking task. This dataset can be used if you do not wish to run Cosmos locally to generate your own augmented data. The dataset is available on Hugging Face and contains both (as separate dataset files), original and augmented demonstrations, that can be used for training visuomotor policies.
Merging Datasets#
The merge_hdf5_datasets.py
script combines multiple HDF5 datasets into a single file. This is useful when you want to combine the original demonstrations with the augmented ones to create a larger, more diverse training dataset.
Required Arguments
|
A list of paths to HDF5 files to merge. |
Optional Arguments
|
File path to merged output. (default: merged_dataset.hdf5) |
Tip
Merging datasets can help improve policy robustness by exposing the model to both original and augmented visual conditions during training.
Example usage for the cube stacking task:
python scripts/tools/merge_hdf5_datasets.py \
--input_files datasets/mimic_dataset_1k.hdf5 datasets/cosmos_dataset_1k.hdf5 \
--output_file datasets/mimic_cosmos_dataset.hdf5
Model Training and Evaluation#
Robomimic Setup#
As an example, we will train a BC agent implemented in Robomimic to train a policy. Any other framework or training method could be used.
To install the robomimic framework, use the following commands:
# install the dependencies
sudo apt install cmake build-essential
# install python module (for robomimic)
./isaaclab.sh -i robomimic
Training an agent#
Using the generated data, we can now train a visuomotor BC agent for Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-v0
:
./isaaclab.sh -p scripts/imitation_learning/robomimic/train.py \
--task Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-v0 --algo bc \
--dataset ./datasets/mimic_cosmos_dataset.hdf5 \
--name bc_rnn_image_franka_stack_mimic_cosmos
Note
By default the trained models and logs will be saved to IssacLab/logs/robomimic
.
Evaluation#
The robust_eval.py
script evaluates trained visuomotor policies in simulation. This evaluation helps assess how well the policy generalizes to different visual variations and whether the visually augmented data has improved the policy’s robustness.
Below is an explanation of the different settings used for evaluation:
Evaluation Settings
|
Exact same setting as that used during Mimic data generation. |
|
Light intensity/brightness is varied, all other aspects remain the same. |
|
Light color is varied, all other aspects remain the same. |
|
Light texture/background is varied, all other aspects remain the same. |
|
Table’s visual texture is varied, all other aspects remain the same. |
|
Robot arm’s visual texture is varied, all other aspects remain the same. |
Required Arguments
|
Name of the environment. |
|
Directory containing the model checkpoints to evaluate. |
Optional Arguments
|
Epoch of the checkpoint to start the evaluation from. (default: 100) |
|
Step horizon of each rollout. (default: 400) |
|
Number of rollouts per model per setting. (default: 15) |
|
Number of random seeds to evaluate. (default: 3) |
|
List of specific seeds to use instead of random ones. |
|
Directory to write results to. (default: /tmp/policy_evaluation_results) |
|
Name of the output file. (default: results) |
|
Minimum value of the action space normalization factor. |
|
Maximum value of the action space normalization factor. |
|
Whether to disable fabric and use USD I/O operations. |
|
Whether to enable Pinocchio for IK controllers. |
Note
The evaluation results will help you understand if the visual augmentation has improved the policy’s performance and robustness. Compare these results with evaluations on the original dataset to measure the impact of augmentation.
Example usage for the cube stacking task:
./isaaclab.sh -p scripts/imitation_learning/robomimic/robust_eval.py \
--task Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-v0 \
--input_dir logs/robomimic/Isaac-Stack-Cube-Franka-IK-Rel-Visuomotor-Cosmos-v0/bc_rnn_image_franka_stack_mimic_cosmos/*/models \
--log_dir robust_results/bc_rnn_image_franka_stack_mimic_cosmos \
--log_file result \
--enable_cameras \
--seeds 0 \
--num_rollouts 15 \
--rendering_mode performance
Note
This script can take over a day or even longer to run (depending on the hardware being used). This behavior is expected.
We use the above script to compare models trained with 1000 Mimic-generated demonstrations, 2000 Mimic-generated demonstrations and 2000 Cosmos-Mimic-generated demonstrations (1000 original mimic + 1000 Cosmos augmented) respectively. We use the same seeds (0, 1000 and 5000) for all three models and provide the metrics (averaged across best checkpoints for each seed) below:
Model Comparison
Evaluation Setting |
Mimic 1k Baseline |
Mimic 2k Baseline |
Cosmos-Mimic 2k |
|
62% |
96.6% |
86.6% |
|
11.1% |
20% |
62.2% |
|
24.6% |
30% |
77.7% |
|
16.6% |
20% |
68.8% |
|
0% |
0% |
20% |
|
0% |
0% |
4.4% |
The above trained models’ checkpoints can be accessed here in case you wish to use the models directly.