top of page
Abstract Linear Background

Advanced Multimodal AI Engineering

This advanced course delves into cutting-edge multimodal AI architectures, large-scale training techniques, and real-world deployment. It explores the latest multimodal models like GPT-4V, Gemini, CLIP, and Flamingo, along with advanced topics such as multimodal reasoning, retrieval-augmented generation (RAG), sensor fusion, and autonomous AI agents. Learners will gain expertise in optimizing multimodal AI systems for efficiency, robustness, and scalability, with hands-on projects and real-world case studies.

Add a Title

Add paragraph text. Click “Edit Text” to update the font, size and more. To change and reuse text themes, go to Site Styles.

Next Item
Previous Item

Course Duration:

36 hours

Level:

Advanced

Course Objectives

  • Master advanced multimodal fusion strategies (early, late, and hybrid fusion).

  • Understand and implement state-of-the-art multimodal transformers (GPT-4V, Gemini, LLaVA).

  • Optimise multimodal training with self-supervised learning, contrastive learning, and LoRA/QLoRA.

  • Develop multimodal generative AI models for text-to-image, text-to-video, and speech synthesis.

  • Explore multimodal reasoning and autonomous AI agents for decision-making.

  • Learn scalable deployment techniques (cloud, edge AI, streaming).

  • Address ethical, security, and privacy challenges in multimodal AI.

  • Build and deploy a full-scale multimodal AI system as a capstone project.

Prerequisites

  • Strong foundation in deep learning (CNNs, RNNs, Transformers)

  • Proficiency in Python, PyTorch, TensorFlow

  • Experience with computer vision, NLP, or speech/audio AI models

  • Understanding of self-supervised learning and generative AI techniques

bottom of page