← All terms

define multimodal --plain-english

Illustration for "Multimodal" from the Non-Technical Technical Dictionary

Multimodal

TLDR:For most of their history, the AI models you chat with could only read and write.

For most of their history, the AI models you chat with could only read and write. Text in, text out. They were brilliant pen pals who happened to be blind and deaf. Multimodal is the word for the models that aren't anymore.

A multimodal model is one that can take in, and sometimes produce, more than just text: images, audio, video, not only words. "Modal" means mode, as in mode of input. One mode is text. Add the ability to see a picture or hear a clip and you've gone multi-modal.

The simplest way to feel it is senses. An old text-only model had one sense, reading. A multimodal model has more. It can look at a photo you paste in and describe what's in it, read a screenshot, pull the numbers off a receipt, watch a short video, or listen to a voice note. Same underlying idea as before, predicting the next token, but now what it can take in isn't limited to typed words.

Real examples, so it's not abstract. Claude can read text and images in the same message, so you can hand it a screenshot or a PDF and ask about it. Google's Gemini goes further on input, taking in audio and video as well. And GPT-4o was the first model to handle voice natively, hearing and speaking in a single step instead of bolting a separate transcriber and voice onto a text model, which is what made real, flowing spoken conversation with an AI possible. The exact mix differs by model and changes fast, but the direction is one way: toward models that handle whatever you throw at them, not just text.

Why this quietly changes what you can build. The instant a model can see, "read this contract," "what's wrong with this design," "pull the line items off this invoice photo," and "what's happening in this clip" all become things you can just ask. You stop writing fussy descriptions of what's in an image and start handing over the image itself. For a non-coder, that's huge, because most of your real-world stuff isn't neat text. It's screenshots, PDFs, photos, recordings.

One practical note tied to cost. Feeding a model an image or an audio clip isn't free. It gets measured and billed too, often converted into a surprising number of tokens. So a detailed image can cost more to "look at" than a paragraph costs to read. Worth knowing before you pipe a thousand photos through it.

Multimodal means the model has more than one sense. It can see, and sometimes hear, not just read. The practical unlock: stop describing your screenshots, contracts, and photos, and just hand them over.