-演講時間: 114年7月8日(二) 15:00~16:30
-演講地點: E6-A203教室
-演講者: 郭宗杰講座教授 (南加大電機及計算機系)
-演講主題: Interpretable and Efficient Multimodal Learning
-摘要:
Conventional computer vision tasks rely on visual input, categorical labels, and bounding boxes. A recent trend is to consider multimodal inputs that contain images, videos, scene graphs, and captions. Multimodal learning, e.g., OpenAI’s CLIP (Contrastive Language–Image Pre-training), has received a lot of attention. Although large language models (LLMs) have impressive performance on multi-modal benchmarks. Their black-box nature and high training and inference costs are major challenges. My lab at USC has developed multimodal learning algorithms grounded in the Green Learning principle in the last three years. They are interpretable and efficient. They cluster heterogeneous multimodal data into homogeneous subgroups and then establish an interpretable and efficient mechanism to connect visual and textual data. The modular design yields intermediate results with semantic meaning. The final decisions are made by aggregating conditional probabilities. Furthermore, subgroup inference eliminates the need to train complex large models that handle heterogeneous data simultaneously. I will illustrate the above-mentioned points using three examples. They are human-object-interaction (HOI) detection, image-text retrieval, and video-text retrieval.