Multimodal Rag Using Generative Search
Create comprehensive insights by processing multiple types of data simultaneously - image, text, video and audio. Multimodal RAG combines these different modalities to generate richer, more contextual understanding of your content.
Image Analysis with Metadata
When you need to understand multiple images beyond their visual content by combining OCR results, auto-generated captions, and metadata, use .generate.fromMany()
with messages and vars.
const {
data: { generate }
} = await unbody.get
.imageBlock
.where({ mimeType: "image/jpeg" })
.limit(2)
.select("url", "autoCaption", "alt", "autoOCR", "originalName")
.generate.fromMany({
messages: [
{
content: "Analyze these images comprehensively. Focus on visual content, caption accuracy, and any text found in the images."
},
{
type: "image",
content: "{imageData.urls}"
},
{
content: `Please analyze based on this information:
Image Details:
{imageData.details}
OCR Text Found:
{imageData.ocrText}
Generated Captions:
{imageData.captions}
Consider:
1. Are the auto-generated captions accurate?
2. How does the OCR text relate to the visual content?
3. Compare the alt descriptions with what you see.`
}
],
options: {
model: "gpt-4-turbo",
vars: [
{
name: "imageData",
formatter: "jq",
expression: `{
urls: [.[].url],
details: map({
filename: .originalName,
altDescription: .alt
}),
ocrText: map(.autoOCR),
captions: map(.autoCaption)
}`
}
],
maxTokens: 1500,
temperature: 0.7
}
})
.exec();
Learn more about generative search in detail in our Generative Search Guide.