Research Focus

This research explores unified task representations, network architectures, and training methodologies for visual and vision-language multitask learning. It aims to build general models for multimodal tasks, and to design a new paradigm of universal perception based on large models—enabling general capabilities for open-world and open-ended tasks.

Representative Works:

Unified Pretraining Algorithm for Vision-Language Multimodal Foundation Models


Unified Representations for General Visual Tasks, a single model architecture and shared parameters used to solve diverse multimodal tasks


uni-perceiver.png


uni-perciever-moe.png



uni-perceiver-v2.png


Vision-Language Large Models for Open-World Tasks

[Project page]

visionllm.png


visionllm3.png

visionllm2.png


+

Doctoral Degree in Engineering

Jifeng DAI
MOBILE Version