Benchmarking Omni-Vision Representation through the Lens of Visual Realms

Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu ;

Abstract


"Though impressive performance has been achieved in specific visual realms (\eg faces, dogs, and places), an omni-vision representation that can generalize to many natural visual domains is highly desirable. Nonetheless, the existing benchmark for evaluating visual representations, such as ImageNet, VTAB-natural, and CLIP benchmark suite, is either limited in the spectrum of realms or built by arbitrarily integrating the current datasets. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark) that enables systematically measuring the generalization ability across a wide range of visual realms. OmniBenchmark firstly integrates the concepts from Wikidata to enlarge the storage of concepts of each sub-tree of WordNet. Then, it leverages expert knowledge from WordNet to define a comprehensive spectrum of 21 semantic realms in the natural domain, which is twice of ImageNet’s. Finally, we manually annotate all 7,372 valid concepts, forming a 21-realm dataset with 1,074,346 images. With OmniBenchmark, we propose a hierarchical instance contrastive learning framework for learning better omni-vision representation, \ie Relational Contrastive learning (ReCo), boosting the performance of representation learning across omni-realms. As the hierarchical semantic relation naturally emerges in the label system of visual datasets, ReCo attracts the representations within the same semantic realm during pre-training, facilitating the model converges faster than conventional contrastive learning when ReCo is further fine-tuned to the specific realm. Extensive experiments demonstrate the superior performance of ReCo over state-of-the-art contrastive learning methods on both ImageNet and OmniBenchmark. Beyond that, We conduct a systematic investigation of recent advances in both architectures (from CNNs to transformers) and learning paradigms (from supervised learning to self-supervised learning) on our benchmark. Multiple practical observations are revealed to facilitate future research."

Related Material


[pdf] [supplementary material] [DOI]