First, thank you to the OpenOOD team for maintaining this benchmark — it has been invaluable for the OOD detection community.
We would like to request a kind inspection of possibly incorrect results on the leaderboard.
While benchmarking for research purpose, we found a significant discrepancy in the reported Near-OOD results.
We reimplemented CombOOD as an OpenOOD postprocessor following the [official codebase]
(https://github.com/rmagesh148/combood) and evaluated across leaderboard benchmarks using the v1.5 evaluation pipeline.
Summary
CIFAR-10 and CIFAR-100 results match the leaderboard within ~2 points, confirming our implementation is sound.
However, the problem is limited to Near-OOD for ImageNet-1K and ImageNet-200 numbers diverge by ~13–22 points, while Far-OOD agrees within ~1–3 points.
One possible explanation is that the leaderboard entries were computed using the v1.0 Near-OOD split rather than the v1.5 split (SSB-hard, NINCO). We note that the official CombOOD codebase defines nearood: [species, inaturalist, openimageo, imageneto] and does not reference SSB-hard or NINCO, which may be related (issue #235).
Results — ImageNet-1K (ResNet50, torchvision pretrained)
OOD
| Dataset |
Leaderboard AUROC (issue#235) |
Our Reproduced AUROC |
| SSB-hard |
92.62 |
64.94 |
| NINCO |
97.82 |
80.92 |
| Near-OOD |
95.22 |
72.93 |
| iNaturalist |
87.13 |
85.16 |
| Textures |
97.01 |
96.78 |
| OpenImage-O |
86.59 |
86.32 |
| Far-OOD |
90.24 |
89.42 |
Results — ImageNet-200 (ResNet18, 3-seed average)
OOD
| Dataset |
Leaderboard AUROC(issue#235) |
Our Reproduced AUROC |
| SSB-hard |
93.66 |
78.09 ± 0.10 |
| NINCO |
97.81 |
86.60 ± 0.20 |
| Near-OOD |
95.74 ± 0.00 |
82.35 ± 0.10 |
| iNaturalist |
92.22 |
92.02 ± 0.50 |
| Textures |
96.18 |
95.75 ± 0.10 |
| OpenImage-O |
89.31 |
88.98 ± 0.40 |
| Far-OOD |
92.57 ± 0.00 |
92.25 ± 0.30 |
Results — CIFAR-10 (ResNet18, 3-seed average)
| Split |
Leaderboard AUROC |
Our Reproduced AUROC |
| Near-OOD |
91.13 ± 0.00 |
90.81 ± 0.20 |
| Far-OOD |
94.65 ± 0.00 |
93.26 ± 0.20 |
Results — CIFAR-100 (ResNet18, 3-seed average)
| Split |
Leaderboard AUROC |
Our Reproduced AUROC |
| Near-OOD |
78.77 ± 0.00 |
80.78 ± 0.10 |
| Far-OOD |
85.87 ± 0.00 |
81.74 ± 0.20 |
Observations
- CIFAR-10/100: Our reproduction is within ~1–4 points of the leaderboard, validating the implementation.
- ImageNet-1K Near-OOD: 22-point gap (95.22 vs 72.93). Far-OOD agrees within ~1 point (90.24 vs 89.42).
- ImageNet-200 Near-OOD: 13-point gap (95.74 vs 82.35). Far-OOD agrees within ~0.3 points (92.57 vs 92.25).
- The pattern — Near-OOD diverges while Far-OOD matches
Could the ImageNet-1K and ImageNet-200 Near-OOD entries be inspected?
We are happy to share our reproduction code and OpenOOD postprocessor implementation.
First, thank you to the OpenOOD team for maintaining this benchmark — it has been invaluable for the OOD detection community.
We would like to request a kind inspection of possibly incorrect results on the leaderboard.
While benchmarking for research purpose, we found a significant discrepancy in the reported Near-OOD results.
We reimplemented CombOOD as an OpenOOD postprocessor following the [official codebase]
(https://github.com/rmagesh148/combood) and evaluated across leaderboard benchmarks using the v1.5 evaluation pipeline.
Summary
CIFAR-10 and CIFAR-100 results match the leaderboard within ~2 points, confirming our implementation is sound.
However, the problem is limited to Near-OOD for ImageNet-1K and ImageNet-200 numbers diverge by ~13–22 points, while Far-OOD agrees within ~1–3 points.
One possible explanation is that the leaderboard entries were computed using the v1.0 Near-OOD split rather than the v1.5 split (SSB-hard, NINCO). We note that the official CombOOD codebase defines
nearood: [species, inaturalist, openimageo, imageneto]and does not reference SSB-hard or NINCO, which may be related (issue #235).Results — ImageNet-1K (ResNet50, torchvision pretrained)
OOD
Results — ImageNet-200 (ResNet18, 3-seed average)
OOD
Results — CIFAR-10 (ResNet18, 3-seed average)
Results — CIFAR-100 (ResNet18, 3-seed average)
Observations
Could the ImageNet-1K and ImageNet-200 Near-OOD entries be inspected?
We are happy to share our reproduction code and OpenOOD postprocessor implementation.