Have you ever read a text message or a Slack message and found yourself trying to discern if the sender was joking or being serious? But, if you’d heard their tone of voice, their intention would likely have been clear. In person, you can read their facial expressions and body language. This just shows the impact that voice and imagery has.
Last week, OpenAI introduced voice and image capabilities in ChatGPT. At its core, Generative AI relies on a Large Language Model (LLM) which is fed by trillions of content, including web pages, articles, books, and code. Until now, users interacted with ChatGPT primarily through text. However, now you can input not only text but also imagery, as well as speak to it, and it will be able to understand that image as well your voice command.
Multimodal AI brings in multiple modalities: text, images, audio, and video – where it understands information from different data types. This allows for richer interactions, whether it’s with a single data type like an image or a combination of multiple forms.
Almost all documentation has images on it. Just think back to when you assembled IKEA furniture. Did you ever read text instructions? No, they just provided you with images on how to put together your 6-drawer dresser. Because a picture is worth a thousand words.
Multimodal AI will have an impact across industries:
- Travel: Imagine strolling through Paris, capturing photos of landmarks and asking a chatbot for historical context. You’re now given a mini history lesson. Or you just might have this question:
- eCommerce: Struggling with assembling an appliance? Just snap a picture and get guidance from a chatbot.
- Education: Explanation of any image that could be used in classes like Science or History or Art
- Healthcare: Instead of tediously typing insurance details, you can now simply upload your insurance card.
- Software: Can write code for you by just looking at an image
- Finance: Receive insights by having the AI analyze financial graphs and market trends.
- Home Design: Upload images of rooms of a house and how it could be interior styled.
Now with voice, you can ask a question by speech instead of typing it in. But also imagine a scenario where a colleague is double booked so they have to miss a meeting. An AI bot could describe a participant’s spoken words. And it could gauge the urgency from their tone and facial expressions.
In the future, humanoid robots might even replace copilots in aircraft. Armed with advanced AI-driven senses, these humanoids could instantly respond to visual cues from the plane’s instruments and audio commands from Air Traffic Control.
Multimodal AI is set to change how we work and communicate in the business realm. It’s all about richer interactions, smarter solutions, and bridging gaps in communication.