OpenVid-1M: A High-Quality Dataset for Text-to-Video Generation

OpenVid-1M addresses two critical challenges in text-to-video (T2V) generation: the scarcity of high-quality datasets and the underutilization of text data. This innovative dataset, comprising more than a million text-video pairs, features 433K high-definition videos. A novel model, the Multi-modal Video Diffusion Transformer (MVDiT), improves video generation by more effectively integrating text and visual data. Experimental results demonstrate enhancements over prior methods.

