TUM Master's Seminar: Advanced Topics in Vision-Language Models (SS 2024)


The seminar aims to explore cutting-edge advancements in the realm of Vision-Language Models (VLMs), focusing on various topics crucial to their development and application. Through a deep dive into seminal papers and latest research, students will gain an understanding of how models like CLIP, Llama, and Stable Diffusion work at an architectural and mathematical level. By the end of the seminar, students should have a comprehensive perspective on the current state and future potential of vision-language modeling. They will be equipped to evaluate new research, identify promising applications, and contribute meaningfully to the responsible development of this important field.

This is a Master's level course. Since these topics are very complex, prior participation in at least one of the following lectures is required:

  • Introduction to Deep Learning (IN2346)
  • Machine Learning (IN2064)

Additionally, we recommend to have taken at least one advanced deep learning lecture, for example:

  • AML: Deep Generative Models (CIT4230003)
  • Machine Learning for Graphs and Sequential Data (IN2323)
  • Computer Vision III: Detection, Segmentation, and Tracking (IN2375)
  • Machine Learning for 3D Geometry (IN2392)
  • Advanced Natural Language Processing (CIT4230002)
  • ADL4CV (IN2390)
  • ADL4R (IN2349)

or a related practical.


The preliminary meeting will take place at 1pm on Thursday, 8th of Februray 2024 on Zoom. You can find the slides here.

The seminar awards 5 ECTS Credits and will take place in person at Helmholtz AI Munich.

All students will be matched to one topic group including a primary paper and two secondary papers. They are expected to give one short and one long presentation on their primary paper (from the perspective of an academic reviewer) as well as a one-slide on the secondary papers from two different perspectives (industry practitioner and academic researcher).

The tentative schedule of the seminar is as follows:

  • Online Introductory Session, April 18th, (1-2:30pm), https://zoom.us/j/94077626596. You can find the slides here.
  • Onsite Short presentations on May 16th, 23rd (1-3pm)
  • Onsite Long presentations on June 6th, 13th, 20th, 27th, July 4th (1-3pm)

For questions, please contact yiran.huang@helmholtz-munich.de or luca.eyring@helmholtz-munich.de.

Topics to select from:

Foundation VLMs

  1. Learning Transferable Visual Models From Natural Language Supervision (CLIP, https://arxiv.org/abs/2103.00020)
  2. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (https://arxiv.org/abs/2201.12086)
  3. Visual Instruction Tuning (https://arxiv.org/abs/2304.08485)
  4. CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation (https://arxiv.org/pdf/2311.18775.pdf)
  5. High-Resolution Image Synthesis With Latent Diffusion Models (Stable Diffusion, https://arxiv.org/abs/2112.10752)

Zero/few-shot learning in VLMs

  1. Test-time Adaptation with CLIP Reward for Zero-shot Generalization in VLMs (https://arxiv.org/pdf/2305.18010.pdf)
  2. DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models (https://arxiv.org/pdf/2310.16436.pdf)
  3. Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners (https://arxiv.org/pdf/2303.02151.pdf)
  4. Waffling around for Performance: Visual Classification with Random Words and Broad Concepts (https://arxiv.org/pdf/2306.07282.pdf)
  5. Learning Vision from Models Rivals Learning Vision from Data (https://arxiv.org/pdf/2312.17742.pdf)

Language guidance in computer vision

  1. DeViL: Decoding Vision features into Language (https://arxiv.org/abs/2309.01617)
  2. Label-Free Concept Bottleneck Models (https://arxiv.org/abs/2304.06129)
  3. Visual Classification via Description from Large Language Models (https://arxiv.org/abs/2210.07183)
  4. ViperGPT: Visual Inference via Python Execution for Reasoning (https://arxiv.org/abs/2303.08128)
  5. Image-free Classifier Injection for Zero-Shot Classification (https://arxiv.org/abs/2308.10599)

Personalized text2image models

  1. Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet, https://arxiv.org/abs/2302.05543)
  2. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion (https://textual-inversion.github.io/)
  3. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation (https://dreambooth.github.io/)
  4. Differential Diffusion: Giving Each Pixel Its Strength (https://differential-diffusion.github.io/)
  5. Anti-DreamBooth: Protecting users from personalized text-to-image synthesis (https://arxiv.org/pdf/2303.15433.pdf)


  1. Vision-by-Language for Training-Free Compositional Image Retrieval (https://arxiv.org/abs/2310.09291)
  2. CoVR: Learning Composed Video Retrieval from Web Video Captions (https://imagine.enpc.fr/~ventural/covr/dataset/covr.pdf)
  3. Zero-Shot Composed Image Retrieval with Textual Inversion (https://arxiv.org/abs/2303.15247)
  4. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models (https://arxiv.org/abs/2301.13826)
  5. When and why vision-language models behave like bags-of-words, and what to do about it? (https://arxiv.org/abs/2210.01936)


A successful participation in the seminar includes:

  • Active participation in the entire event: We have 70% attendance policy for this seminar. (You need to attend at least 5 of the 7 sessions.)
  • Short presentation on May 16th or 23rd (10 minutes talk including questions)
  • Long presentation on June 6th, 13th, 20th, 27th, or July 4th (20 minutes talk including questions)


The registration must be done through the TUM Matching Platform.