Title: Multimodal Data Lifecycles: Labeling, Selection, Mixing, and Synthesis
Speaker: Tzu-Heng Huang 黃子恒
When & where: 10:45-12:00 on Jan 15th, Thursday, at BL-101
Abstract:
High-quality human-annotated data has fueled significant ML breakthroughs. However, as multimodal models scale, the demand for such data outpaces human labeling capacity, creating bottlenecks in cost and efficiency. In this talk, we will explore the multimodal data lifecycle through the lens of data-centric AI methods: labeling, selection, mixing, and synthesis. First, I will introduce The ALCHEmist, a new automated labeling system through program distillation and its application on LLM judges (NeurIPS 2024 Spotlight). Next, I will discuss Grad-Mimic for fine- grained data selection for efficient pretraining (ICML 2025 DataWorld Oral). Then, I will cover R&B, an overhead-reduced data mixing strategy that optimizes multimodal data portfolios. Finally, I will introduce a learnability- aware data synthesis framework. Ultimately, these data-centric approaches, grounded in weak supervision, promise more effective selection, efficient generation, and annotation-light ML systems for scaling future multimodal models.
Bio:
Tzu-Heng Huang is a Ph.D. Candidate (final-year) in Computer Science at the University of Wisconsin-Madison, advised by Prof. Frederic Sala. He earned his B.S. in Computer Science from National Chengchi University in 2020. His current research centers on Data-Centric AI for multimodal models, emphasizing methods that allow models to learn more with less supervision. Key publications cover areas including: (i) online domain mixing, (ii) model-aware pretraining data selection, (iii) data curation ensembles, and (iv) cost-effective automated labeling systems. His research experience includes internships at Apple AIML (focusing on large-scale CLIP pre-training), Meta GenAI (on synthetic data generation), and earlier positions at Argonne National Laboratory. Beyond academia, he founded Awan.AI LLC (now integrated into TechTCM) to apply AI agent in Traditional Chinese Medicine. Tzu-Heng has published in top venues such as NeurIPS, ICML, and ICCV, earning awards including a NeurIPS 2024 Spotlight, ICML 2025 DataWorld Oral, and ICCV DataComp 2023 First Place.