Benchmarking Omni-Vision Representation through the Lens of Visual Realms
Yuanhan Zhang, Zhenfei Yin, Jing Shao, Ziwei Liu
"Though impressive performance has been achieved in specific visual realms (\eg faces, dogs, and places), an omni-vision representation that can generalize to many natural visual domains is highly desirable. Nonetheless, the existing benchmark for evaluating visual representations, such as ImageNet, VTAB-natural, and CLIP benchmark suite, is either limited in the spectrum of realms or built by arbitrarily integrating the current datasets. In this paper, we propose Omni-Realm Benchmark (OmniBenchmark) that enables systematically measuring the generalization ability across a wide range of visual realms. OmniBenchmark firstly integrates the concepts from Wikidata to enlarge the storage of concepts of each sub-tree of WordNet. Then, it leverages expert knowledge from WordNet to define a comprehensive spectrum of 21 semantic realms in the natural domain, which is twice of ImageNet’s. Finally, we manually annotate all 7,372 valid concepts, forming a 21-realm dataset with 1,074,346 images. With OmniBenchmark, we propose a hierarchical instance contrastive learning framework for learning better omni-vision representation, \ie Relational Contrastive learning (ReCo), boosting the performance of representation learning across omni-realms. As the hierarchical semantic relation naturally emerges in the label system of visual datasets, ReCo attracts the representations within the same semantic realm during pre-training, facilitating the model converges faster than conventional contrastive learning when ReCo is further fine-tuned to the specific realm. Extensive experiments demonstrate the superior performance of ReCo over state-of-the-art contrastive learning methods on both ImageNet and OmniBenchmark. Beyond that, We conduct a systematic investigation of recent advances in both architectures (from CNNs to transformers) and learning paradigms (from supervised learning to self-supervised learning) on our benchmark. Multiple practical observations are revealed to facilitate future research."