Multimodal AI: What is it and Why is it Better Than Regular AI

by Roveen

During the early days of ChatGPT, if you asked it if it could accept images or videos as prompts, the chatbot would turn down the request.

However, in recent times, ChatGPT Plus (GPT-4) has been able to accept a variety of prompts, from words to images and videos.GPT-4, then, is what we would call a multimodal AI.

The name multimodal comes from a combination of multiple modes. Thus, a multimodal AI is an AI that has been trained through different input sources and, thus, can accept prompts either in text, videos, images, or voice.

What Makes Multimodal AI Better?

Well, first, obviously is the fact that multimodal AI can have a wide range of usage across different platforms and various places and such. Since the AI can take input in two or more modes, its usage can spread far beyond simply being a machine that asks questions and responds.

Secondly, multimodal AI are simply much more ‘knowledgeable’ than regular AI. Multimodal AI have been trained much more broadly to identify various input methods. This, then, makes multimodal AI that much more accurate than their unimodal predecessors.

Because of these benefits, multimodal AI is quickly finding its usage in various industries. Mobile phone makers are already using multimodal AI to improve many aspects of their phones, such as cameras, microphones, light and depth sensors, and editing tools, among others.

Examples of Multimodal AIs

GPT-4 is an example of a multimodal AI, but others include Runway Gen-2, which can turn text to video, image into video, or create original video content from text prompts.

Google Gemini is also a multimodal AI, Meta ImageBind is another one which accept text, images and audio prompts.

Multimodal AI, for all its flaws and worst-case usage scenario, has the potential to transform various industries across the world, such as medical research, disease prevention, and engineering.

