Multi-Modal AI is rapidly taking over.

It’s truly amazing how fast LlamaIndex incorporated a robust pipeline for multi-modal RAG capabilities.

Here’s a beginners-friendly guide to get started with multi-modal RAG using LlamaIndex.


Multi-Modal LLM

First let’s start with some simple stuff.

We just want to ask questions about our images.

OpenAIMultiModal is a wrapper around OpenAI’s latest vision model that lets us do exactly that.

  • First we create documents as usual, like below:
image_documents = SimpleDirectoryReader("./input_images").load_data()
  • Next we initialize the multi-modal llm instance:
openai_mm_llm = OpenAIMultiModal()
  • Then, we pass the image_documents to the complete() method of the llm. Use stream_complete() instead of complete() for token streaming.
response = openai_mm_llm.complete(
    prompt="Describe the images as an alternative text",


We can see it that the multi-modal llm responded with an alternative text for each of the input images.

Told you it was easy. LlamaIndex handles all the underlying logic for converting those image_documents to compatible format for the multi-modal llm.

But there’s an issue !!

Just like text based RAG, where we were limited by the context length, here we’re also limited by how many images we pass.

We can’t just pass loads of images. We would only want to pass those images that are related to our query.

How do you find images related to your query?? Yep, with the help of EMBEDDING

Multi-Modal Vector Store Index

LlamaIndex has MultiModalVectorStoreIndex which creates embedding for both image and text nodes and stores them in vector stores.

For image nodes it uses clip and for text nodes it uses ada for getting the embedding.

Let’s create the multi-modal index

documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(documents)

Here we pass the loaded documents, which could include both image and text nodes.

In this function we can specify the vector stores to use for text and image. By default, it uses simple in-memory vector stores.

Here’s how to use two different Qdrant vector stores for image and text nodes:

# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_mm_db")

text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
storage_context = StorageContext.from_defaults(vector_store=text_store)

documents = SimpleDirectoryReader("./mixed_wiki/").load_data()
index = MultiModalVectorStoreIndex.from_documents(
    documents, storage_context=storage_context, image_vector_store=image_store

Now that we’ve created the multi-modal index, we wanna retrieve relevant nodes(both image and text) from that index, based on our query.

Multi-Modal Vector Store Index Retriever

So we create a MultiModalVectorIndexRetriever from the index:

mm_retriever = index.as_retriever()
retrieval_results = mm_retriever.retrieve('here goes my query')

We can specify how many relevant text and image nodes to retrieve by passing similarity_top_k=3, image_similarity_top_k=3 while creating the retriever.

While encoding the queries, the above retriever encodes query text using both the embed_model and image_embed_model

Thus, the query_embedding from the embed_model is used to retrieve text nodes from the text vector store.

And the query_embedding from the image_embed_model is used to retrieve image nodes from the image vector store.

Multi-Modal Query Engine

Now that we have retrieved the relevant image and text nodes using the multi-modal retriever above, let’s create a multi-modal query engine so that we can query about the retrieved image and text nodes.

query_engine = index.as_query_engine()
query_str = "Tell me more about the Porsche"

Here we create the query engine from the as_query_engine() method of the index which first creates the retriever like we did earlier, and then creates the query engine from that retriever.

We can pass the multi_modal_llm and text_qa_template if we don’t wanna use the default one:

qa_tmpl_str = (
    "Context information is below.\n"
    "Given the context information and not prior knowledge, "
    "answer the query.\n"
    "Query: {query_str}\n"
    "Answer: "
qa_tmpl = PromptTemplate(qa_tmpl_str)

query_engine = index.as_query_engine(
    multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl

The query engine then passes the retrieved text and image nodes as context to the Vision API.

Now let’s use the query engine:

query_str = "Tell me more about the Porsche"
response = query_engine.query(query_str)

The multi-modal query engine stores the retrieved image and text nodes by the multi-modal retriever in the metadata of the response.

If we inspect those nodes from the metadata, we can see that 3 text nodes and 2 images were retrieved for the given query.

Retrieved nodes

These were all sent to the multi-modal llm as context to get the final response.

Here’s the official documentation for LlamaIndex multi-modal retrieval

Thanks for reading. Stay tuned for more.

I tweet about these topics and anything I’m exploring on a regular basis. Follow me on twitter