This research explores unified task representations, network architectures, and training methodologies for visual and vision-language multitask learning. It aims to build general models for multimodal tasks, and to design a new paradigm of universal perception based on large models—enabling general capabilities for open-world and open-ended tasks.
Representative Works:
Unified Pretraining Algorithm for Vision-Language Multimodal Foundation Models
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
[7th Most Influential Paper at ICLR 2021]
Unified Representations for General Visual Tasks, a single model architecture and shared parameters used to solve diverse multimodal tasks
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
[NeurIPS 2022 Spotlight paper]
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
[CVPR 2023 Highlight paper]
Vision-Language Large Models for Open-World Tasks