The seminar aims to explore cutting-edge advancements in the realm of Vision-Language Models (VLMs), focusing on various topics crucial to their development and application. Through a deep dive into seminal papers and latest research, students will gain an understanding of how models like CLIP, Llama, and Stable Diffusion work at an architectural and mathematical level. By the end of the seminar, students should have a comprehensive perspective on the current state and future potential of vision-language modeling. They will be equipped to evaluate new research, identify promising applications, and contribute meaningfully to the responsible development of this important field.
This is a Master's level course. Since these topics are very complex, prior participation in at least one of the following lectures is required:
Additionally, we recommend to have taken at least one advanced deep learning lecture, for example:
or a related practical.
The seminar awards 5 ECTS Credits and will take place in person at Helmholtz AI Munich.
All students will be matched to one topic group including a primary paper and two secondary papers. They are expected to give one short and one long presentation on their primary paper as well as a one-slide on the secondary papers from two different perspectives. Additionally, they should submit a two page final report at the end of the seminar.
The tentative schedule of the seminar is as follows:
A successful participation in the seminar includes:
The registration must be done through the TUM Matching Platform.