Yean Cheng

yean_cheng[at]126.com

Haidian, Beijing, China

I am a research engineer at Zhipu.AI, focusing on improving the performance of VLM’s on video, alignment, reasoning, etc. My research interests lies in visual understanding and generation, language modeling and world modeling.

I received my Master’s degree from the School of Computer Science at Peking University, at CILab & AIIC, advised by Prof. Boxin Shi and Mr. Ming Lei. I received my Bachelor of Engineering degree in Automation and Bachelor of Arts degree in Economics from Tsinghua University in 2021. My academic research topic involves 3D modeling with neural implicit representations, computational photography, and image quality enhancement.

Deep learning is a useful tool (arguably more useful than most people think) for real-world applications. I enjoy tackling various tasks (e.g., molecule design, quantitative trading, interior design, recommendation system, image quality enhancement) with AI techniques and seeing the real-world impact of my work. I have worked (mostly interned) at wonderful AI start-ups (Collov, QuanMol), corporate companies (Alibaba, ByteDance), and a quantitative investment firm (Definitive Capital Management). Along the way, I have met many mentors and peers in the field of intelligence.

news

Mar 3, 2024	Our paper “MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models” is accepted by CVPR 2025.
Jan 22, 2024	I will join Zhipu.AI as a Research Engineer.
Dec 9, 2023	Our paper “Colorizing Monochromatic Radiance Fields” is accepted by The 38th AAAI Conference on Artificial Intelligence and selected for oral presentation.
Oct 21, 2023	Our paper “SPLiT: Single Portrait Lighting Estimation Via a Tetrad of Face Intrinsics” is accepted by T-PAMI.

selected publications

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong*, Yean Cheng*, Zhuoyi Yang*, and 6 more authors

CVPR, 2025

Abs arXiv Bib HTML Code

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models’ motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM’s ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension.
@article{hong2025motionbench, author = {Hong*, Wenyi and Cheng*, Yean and Yang*, Zhuoyi and Wang, Weihan and Wang, Lefan and Gu, Xiaotao and Huang, Shiyu and Dong, Yuxiao and Tang, Jie}, journal = {CVPR}, title = {MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models}, year = {2025}, volume = {}, number = {}, pages = {1-10}, doi = {10.48550/arXiv.2501.02955}, }
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, and 8 more authors

ICLR, 2025

Abs arXiv Bib HTML Code

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos that align seamlessly with text prompts, with a frame rate of 16 fps and resolution of 768 x 1360 pixels. Previous video generation models often struggled with limited motion and short durations. It is especially difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we introduce a 3D Variational Autoencoder (VAE) to compress videos across spatial and temporal dimensions, enhancing both the compression rate and video fidelity. Second, to improve text-video alignment, we propose an expert transformer with expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing progressive training and multi-resolution frame packing, CogVideoX excels at generating coherent, long-duration videos with diverse shapes and dynamic movements. In addition, we develop an effective pipeline that includes various pre-processing strategies for text and video data. Our innovative video captioning model significantly improves generation quality and semantic alignment. Results show that CogVideoX achieves state-of-the-art performance in both automated benchmarks and human evaluation.
@article{yang2025cogvideox, author = {Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others}, journal = {ICLR}, title = {CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, year = {2025}, volume = {}, number = {}, pages = {1-10}, doi = {10.48550/arXiv.2408.06072}, }
CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, and 8 more authors

Technical Report, 2024

Abs arXiv Bib HTML Code

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344×1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.
@article{hong2024cogvlm2, author = {Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others}, journal = {Technical Report}, title = {CogVLM2: Visual Language Models for Image and Video Understanding}, year = {2024}, volume = {}, number = {}, pages = {1-10}, doi = {10.48550/arXiv.2408.16500}, }
DreamPolish: Domain Score Distillation With Progressive Geometry Generation

Yean Cheng, Ziqi Cai, Ming Ding, and 10 more authors

2024

Abs arXiv Bib

We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process. Instead of relying solely on a view-conditioned diffusion prior in the novel sampled views, which often leads to undesired artifacts in the geometric surface, we incorporate an additional normal estimator to polish the geometry details, conditioned on viewpoints with varying field-of-views. We propose to add a surface polishing stage with only a few training steps, which can effectively refine the artifacts attributed to limited guidance from previous stages and produce 3D objects with more desirable geometry. The key topic of texture generation using pretrained text-to-image models is to find a suitable domain in the vast latent distribution of these models that contains photorealistic and consistent renderings. In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain. We draw inspiration from the classifier-free guidance (CFG) in textconditioned image generation tasks and show that CFG and variational distribution guidance represent distinct aspects in gradient guidance and are both imperative domains for the enhancement of texture quality. Extensive experiments show our proposed model can produce 3D assets with polished surfaces and photorealistic textures, outperforming existing state-of-the-art methods.
@article{cheng2024dreampolish, author = {Cheng, Yean and Cai, Ziqi and Ding, Ming and Zheng, Wendi and Huang, Shiyu and Dong, Yuxiao and Tang, Jie and Shi, Boxin and Wan, Renjie and Weng, Shuchen and Zhu, Chengxuan and Chang, Yakun and Shi, Boxin}, year = {2024}, eprint = {2411.01602}, title = {DreamPolish: Domain Score Distillation With Progressive Geometry Generation}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, }
[Oral] Colorizing Monochromatic Radiance Fields

Yean Cheng, Renjie Wan, Shuchen Weng, and 3 more authors

AAAI, 2024

Abs arXiv Bib HTML Code

Though Neural Radiance Fields (NeRF) can produce colorful 3D representations of the world by using a set of 2D images, such ability becomes non-existent when only monochromatic images are provided. Since color is necessary in representing the world, reproducing color from monochromatic radiance fields becomes crucial. To achieve this goal, instead of manipulating the monochromatic radiance fields directly, we consider it as a representation-prediction task in the Lab color space. By first constructing the luminance and density representation using monochromatic images, our prediction stage can recreate color representation on the basis of an image colorization module. We then reproduce a colorful implicit model through the representation of luminance, density, and color. Extensive experiments have been conducted to validate the effectiveness of our approaches.
@article{cheng2024colornerf, author = {Cheng, Yean and Wan, Renjie and Weng, Shuchen and Zhu, Chengxuan and Chang, Yakun and Shi, Boxin}, journal = {AAAI}, volume = {}, number = {}, pages = {1317-1325}, title = {[Oral] Colorizing Monochromatic Radiance Fields}, doi = {10.1609/aaai.v38i2.27895}, year = {2024}, }
SPLiT: Single Portrait Lighting Estimation Via a Tetrad of Face Intrinsics

Fei Fan*, Yean Cheng*, Yongjie Zhu, and 4 more authors

IEEE T-PAMI, 2023

Abs Bib HTML Code

This paper proposes a novel pipeline to estimate a non-parametric environment map with high dynamic range from a single human face image. Lighting-independent and -dependent intrinsic images of the face are first estimated separately in a cascaded network. The influence of face geometry on the two lighting-dependent intrinsics, diffuse shading and specular reflection, are further eliminated by distributing the intrinsics pixel-wise onto spherical representations using the surface normal as indices. This results in two representations simulating images of a diffuse sphere and a glossy sphere under the input scene lighting. Taking into account the distinctive nature of light sources and ambient terms, we further introduce a two-stage lighting estimator to predict both accurate and realistic lighting from these two representations. Our model is trained supervisedly on a large-scale and high-quality synthetic face image dataset. We demonstrate that our method allows accurate and detailed lighting estimation and intrinsic decomposition, outperforming state-of-the-art methods both qualitatively and quantitatively on real face images.
@article{10301699, author = {Fan*, Fei and Cheng*, Yean and Zhu, Yongjie and Zheng, Qian and Li, Si and Pan, Gang and Shi, Boxin}, journal = {IEEE T-PAMI}, title = {SPLiT: Single Portrait Lighting Estimation Via a Tetrad of Face Intrinsics}, year = {2023}, volume = {}, number = {}, pages = {1-14}, doi = {10.1109/TPAMI.2023.3328453}, }
Fault Diagnosis of Energy Networks Based on Improved Spatial–Temporal Graph Neural Network With Massive Missing Data

Jingfei Zhang, Yean Cheng, and Xiao He

IEEE Transactions on Automation Science and Engineering, 2023

Abs Bib HTML

In order to ensure the safe and reliable operation of the energy system, real-time fault diagnosis technology is indispensable. Energy systems are typically complex systems consisting of multiple subsystems that are coupled with each other. Before and after the occurrence of a fault, the system is generally in an abnormal or even harsh environment, which may cause a large number of randomly missing measurement data and make the application of fault diagnosis technology extremely difficult. In this paper, the graph attention network (GAT) is improved by a Gaussian mixture model (GMM) for incomplete-data representation. The iteratively updated expectation of the GMM serves as the characterization of missing data, which significantly improves the ability to fill in missing data. The GAT fuses multi-source data according to the topology structure so as to comprehensively exploit the spatial information. The gated recurrent units (GRU) extract dynamic fault information from embedded spatial features and classify the time series into various fault types. Moreover, we propose a loss function in the form of weighted focal loss so that the fault-class imbalance issue brought by the data deficiency can be solved. The proposed uniform spatial-temporal graph neural network classification framework together with the GMM (GM-STGNN) can effectively improve fault diagnosis performance and is applied on an experimental platform of an authentic industrial estate. Results of comparative experiments under different conditions of both sufficient and deficient data illustrate the efficiency and advancement of the proposed method. Note to Practitioners —This paper presents a fault diagnosis method for large-scale energy systems with massive missing data. The proposed GM-STGNN framework can be applied in complex energy networks consisting of coupling subsystems, such as power grids, heating networks, and gas networks. With an incomplete-data representation mechanism, the proposed method utilizes topology information to comprehensively exploit spatial features, it also recurrently transmits historical embedded features and extracts dynamic fault characteristics. Therefore, it can effectively improve energy-network fault identification accuracy when more than half of the sample exists vacant values randomly. In the training procedure, after pre-setting the model scale, data acquired by multi-source sensors is put into the model according to the real topology structure, and corresponding fault labels serve as the supervision. The statistical characteristics of missing data are learned with neural-network parameters until the loss converges. In practical application, the sampling data is divided by a time window of a few seconds. The missing data is mitigated by the estimated expectation of the GMM. Therefore, real-time fault classification results can be obtained with high accuracy. The effectiveness of the proposed method is illustrated by fault diagnosis of a typical distributed heating network under the noise influence. Benefiting from the ability to learn fault knowledge, the proposed method can be easily applied to new scenarios where the process data and topology structure of the system are known.
@article{10374148, author = {Zhang, Jingfei and Cheng, Yean and He, Xiao}, journal = {IEEE Transactions on Automation Science and Engineering}, title = {Fault Diagnosis of Energy Networks Based on Improved Spatial–Temporal Graph Neural Network With Massive Missing Data}, year = {2023}, volume = {}, number = {}, pages = {1-12}, doi = {10.1109/TASE.2023.3281394}, }
Fault Diagnosis of Energy Networks: A Graph Embedding Learning Approach

Jingfei Zhang, Yean Cheng, and Xiao He

IEEE Transactions on Instrumentation and Measurement, 2022

Abs Bib HTML

For industrial parks containing energy systems, fault diagnosis technology is of great significance for their safe operation. In recent years, the topology of energy systems has become more complex due to the use of technologies such as cogeneration, leading to multienergy coupling. Critical equipment and user nodes in these complex energy networks are vulnerable to a lack of sensor data or non-idealities in the measurement environment. There is an urgent need for a unified and robust fault-diagnosis framework for the overall system to identify faults even under non-ideal data conditions. In this article, to address the problem of fault identification and state prediction, a novel deep learning model is constructed based on graph-embedded recurrent neural networks (RNNs) with self-attentional layers. Unstructured data are put into the graph neural network to extract common spatial features. An additive attention mechanism is implemented in the graph attention network (GAT) to integrate multiscale node information. The graph operator is computed within a gated recurrent unit (GRU) that captures the full range of temporal features. In addition, loss functions are introduced for fault identification and state prediction. Data from an industrial park experiment platform are used for fault identification experiments. The advantages of the proposed approach are illustrated by comparative experiments with different levels of missing data.
@article{9927491, author = {Zhang, Jingfei and Cheng, Yean and He, Xiao}, journal = {IEEE Transactions on Instrumentation and Measurement}, title = {Fault Diagnosis of Energy Networks: A Graph Embedding Learning Approach}, year = {2022}, volume = {71}, number = {}, pages = {1-11}, doi = {10.1109/TIM.2022.3216669}, }
Structure-Preserving Super Resolution With Gradient Guidance

Cheng Ma, Yongming Rao, Yean Cheng, and 3 more authors

In CVPR, 2020

Abs Bib HTML Code

Structures matter in single image super resolution (SISR). Recent studies benefiting from generative adversarial network (GAN) have promoted the development of SISR by recovering photo-realistic images. However, there are always undesired structural distortions in the recovered images. In this paper, we propose a structure-preserving super resolution method to alleviate the above issue while maintaining the merits of GAN-based methods to generate perceptual-pleasant details. Specifically, we exploit gradient maps of images to guide the recovery in two aspects. On the one hand, we restore high-resolution gradient maps by a gradient branch to provide additional structure priors for the SR process. On the other hand, we propose a gradient loss which imposes a second-order restriction on the super-resolved images. Along with the previous image-space loss functions, the gradient-space objectives help generative networks concentrate more on geometric structures. Moreover, our method is model-agnostic, which can be potentially used for off-the-shelf SR networks. Experimental results show that we achieve the best PI and LPIPS performance and meanwhile comparable PSNR and SSIM compared with state-of-the-art perceptual-driven SR methods. Visual results demonstrate our superiority in restoring structures while generating natural SR images.
@inproceedings{DBLP:conf/cvpr/MaRCCL020, author = {Ma, Cheng and Rao, Yongming and Cheng, Yean and Chen, Ce and Lu, Jiwen and Zhou, Jie}, title = {Structure-Preserving Super Resolution With Gradient Guidance}, booktitle = {CVPR}, pages = {7766--7775}, publisher = {Computer Vision Foundation / {IEEE}}, year = {2020}, url = {}, doi = {10.1109/CVPR42600.2020.00779}, timestamp = {Tue, 31 Aug 2021 14:00:04 +0200}, biburl = {https://dblp.org/rec/conf/cvpr/MaRCCL020.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }