
Advanced Multimodal AI Engineering
This advanced course delves into cutting-edge multimodal AI architectures, large-scale training techniques, and real-world deployment. It explores the latest multimodal models like GPT-4V, Gemini, CLIP, and Flamingo, along with advanced topics such as multimodal reasoning, retrieval-augmented generation (RAG), sensor fusion, and autonomous AI agents. Learners will gain expertise in optimizing multimodal AI systems for efficiency, robustness, and scalability, with hands-on projects and real-world case studies.
Add a Title
Add paragraph text. Click “Edit Text” to update the font, size and more. To change and reuse text themes, go to Site Styles.
Course Duration:
36 hours
Level:
Advanced

Course Objectives
Master advanced multimodal fusion strategies (early, late, and hybrid fusion).
Understand and implement state-of-the-art multimodal transformers (GPT-4V, Gemini, LLaVA).
Optimise multimodal training with self-supervised learning, contrastive learning, and LoRA/QLoRA.
Develop multimodal generative AI models for text-to-image, text-to-video, and speech synthesis.
Explore multimodal reasoning and autonomous AI agents for decision-making.
Learn scalable deployment techniques (cloud, edge AI, streaming).
Address ethical, security, and privacy challenges in multimodal AI.
Build and deploy a full-scale multimodal AI system as a capstone project.
Prerequisites
Strong foundation in deep learning (CNNs, RNNs, Transformers)
Proficiency in Python, PyTorch, TensorFlow
Experience with computer vision, NLP, or speech/audio AI models
Understanding of self-supervised learning and generative AI techniques
