I am a Research Intern at Tencent Hunyuan (青云人才计划 / Qingyun Top Talent Program), where I build models for omni-video understanding — making sense of a video together with the audio and language that come with it (e.g., Script-a-Video, deep structured audio-visual captioning). Before that, I was a remote research intern at Vision-CAIR, KAUST, advised by Mohamed Elhoseiny, and a research intern at SenseTime. I hold a B.S. in Mathematics (minor in Management) from Lanzhou University.
What I care about is moving models past recognition and into understanding — not just naming what appears in a scene, but reasoning about why it happens and what comes next, with the reasoning grounded in evidence from every modality and across time. The threads below are different angles on that same goal.
In the long run, I want to build general-purpose systems that perceive, reason, and communicate across vision, audio, language, and action in dynamic, real-world environments.
Feel free to reach out to me for collaborations, questions, or just to chat!
ICML 2026
CVPR 2026
WWW 2026
ICCV 2025
AAAI 2026
TMLR 2025
EMNLP 2025
Tech Report
SIGGRAPH Asia 2025
ACL 2026
CVPR 2026
Tech Report
Renewable Energy
FRL
Powered by Jekyll and Minimal Light theme.