Multimodal Rag Using Generative Search
Create comprehensive insights by processing multiple types of data simultaneously - visual content (images), text (OCR, captions), and metadata. Multimodal RAG combines these different modalities to generate richer, more contextual understanding of your content.
Image Analysis with Metadata
Need to analyze multiple images while considering their metadata, OCR text, and captions? Use .generate.fromMany()
with Messages and Vars to combine visual analysis with metadata examination for comprehensive image insights.
const {
data: { payload }
} = await unbody.get
.imageBlock
.where({ mimeType: "image/jpeg" })
.limit(2)
.select("url", "autoCaption", "alt", "autoOCR", "originalName")
.generate.fromMany({
messages: [
{
content: "Analyze these images comprehensively. Focus on visual content, caption accuracy, and any text found in the images."
},
{
type: "image",
content: "{imageData.urls}"
},
{
content: `Please analyze based on this information:
Image Details:
{imageData.details}
OCR Text Found:
{imageData.ocrText}
Generated Captions:
{imageData.captions}
Consider:
1. Are the auto-generated captions accurate?
2. How does the OCR text relate to the visual content?
3. Compare the alt descriptions with what you see.`
}
],
options: {
model: "gpt-4-turbo",
vars: [
{
name: "imageData",
formatter: "jq",
expression: `{
urls: [.[].url],
details: map({
filename: .originalName,
altDescription: .alt
}),
ocrText: map(.autoOCR),
captions: map(.autoCaption)
}`
}
],
maxTokens: 1500,
temperature: 0.7
}
})
.exec();
Learn more about generative search in detail in our Generative Search Guide.