Foundation models for visual perception aim to address the fundamental tasks of object recognition and localization. This line of research focuses on visual backbone networks and object detection models, providing foundational architectures for general-purpose visual perception.
Representative Works:
High-Accuracy, High-Efficiency Object Detection Foundation Model
R-FCN: Object Detection via Region-based Fully Convolutional Networks
[3rd Most Influential Paper at NeurIPS 2016]
[Included in Pytorch Vision Operator Library]
Visual Backbone Networks Centered on Deformable Convolutions, Large-Scale General-Purpose Visual Foundation Models
Deformable Convolutional Networks v1/v2
[6th Most Influential Paper at ICCV 2017]
[Included in Pytorch Vision Operator Library]
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
[CVPR 2023 highlight paper]