Skip to content

CombOOD Near-OOD results on ImageNet-1K / ImageNet-200 leaderboard appear incorrect #307

@JinMoYang

Description

@JinMoYang

First, thank you to the OpenOOD team for maintaining this benchmark — it has been invaluable for the OOD detection community.

We would like to request a kind inspection of possibly incorrect results on the leaderboard.
While benchmarking for research purpose, we found a significant discrepancy in the reported Near-OOD results.
We reimplemented CombOOD as an OpenOOD postprocessor following the [official codebase]
(https://github.com/rmagesh148/combood) and evaluated across leaderboard benchmarks using the v1.5 evaluation pipeline.

Summary

CIFAR-10 and CIFAR-100 results match the leaderboard within ~2 points, confirming our implementation is sound.
However, the problem is limited to Near-OOD for ImageNet-1K and ImageNet-200 numbers diverge by ~13–22 points, while Far-OOD agrees within ~1–3 points.
One possible explanation is that the leaderboard entries were computed using the v1.0 Near-OOD split rather than the v1.5 split (SSB-hard, NINCO). We note that the official CombOOD codebase defines nearood: [species, inaturalist, openimageo, imageneto] and does not reference SSB-hard or NINCO, which may be related (issue #235).

Results — ImageNet-1K (ResNet50, torchvision pretrained)

OOD

Dataset Leaderboard AUROC (issue#235) Our Reproduced AUROC
SSB-hard 92.62 64.94
NINCO 97.82 80.92
Near-OOD 95.22 72.93
iNaturalist 87.13 85.16
Textures 97.01 96.78
OpenImage-O 86.59 86.32
Far-OOD 90.24 89.42

Results — ImageNet-200 (ResNet18, 3-seed average)

OOD

Dataset Leaderboard AUROC(issue#235) Our Reproduced AUROC
SSB-hard 93.66 78.09 ± 0.10
NINCO 97.81 86.60 ± 0.20
Near-OOD 95.74 ± 0.00 82.35 ± 0.10
iNaturalist 92.22 92.02 ± 0.50
Textures 96.18 95.75 ± 0.10
OpenImage-O 89.31 88.98 ± 0.40
Far-OOD 92.57 ± 0.00 92.25 ± 0.30

Results — CIFAR-10 (ResNet18, 3-seed average)

Split Leaderboard AUROC Our Reproduced AUROC
Near-OOD 91.13 ± 0.00 90.81 ± 0.20
Far-OOD 94.65 ± 0.00 93.26 ± 0.20

Results — CIFAR-100 (ResNet18, 3-seed average)

Split Leaderboard AUROC Our Reproduced AUROC
Near-OOD 78.77 ± 0.00 80.78 ± 0.10
Far-OOD 85.87 ± 0.00 81.74 ± 0.20

Observations

  1. CIFAR-10/100: Our reproduction is within ~1–4 points of the leaderboard, validating the implementation.
  2. ImageNet-1K Near-OOD: 22-point gap (95.22 vs 72.93). Far-OOD agrees within ~1 point (90.24 vs 89.42).
  3. ImageNet-200 Near-OOD: 13-point gap (95.74 vs 82.35). Far-OOD agrees within ~0.3 points (92.57 vs 92.25).
  4. The pattern — Near-OOD diverges while Far-OOD matches

Could the ImageNet-1K and ImageNet-200 Near-OOD entries be inspected?
We are happy to share our reproduction code and OpenOOD postprocessor implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions