I am a Research Intern at Tencent Hunyuan (青云人才计划 / Qingyun Top Talent Program), where my research centers on omni-video understanding — jointly interpreting a video together with its accompanying audio and language (e.g., Script-a-Video, deep structured audio-visual captioning). Previously, I was a remote research intern at Vision-CAIR, KAUST, advised by Mohamed Elhoseiny, and a research intern at SenseTime. I received my B.S. in Mathematics (minor in Management) from Lanzhou University.
My research seeks to advance multimodal models from surface-level recognition toward genuine understanding — reasoning about why events occur and what follows, with inferences grounded in evidence that is consistent across modalities and over time. My work spans four interconnected directions:
In the long term, I aim to develop general-purpose multimodal systems that perceive, reason, and communicate across vision, audio, language, and action in dynamic, real-world environments.
I am always open to research collaborations and discussions — please feel free to reach out.
† Equal contribution * Corresponding author
ICML 2026
CVPR 2026
ECCV 2026
KDD 2026
WWW 2026
ICCV 2025
AAAI 2026
TMLR 2025
EMNLP 2025
ECCV 2026
Tech Report
Tech Report
Tech Report
SIGGRAPH Asia 2025
ACL 2026
CVPR 2026
Tech Report
Renewable Energy
FRL
↓ scroll to see more publications
Powered by Jekyll and Minimal Light theme.