Building Multimodal LLM Applications: Engineering Guide
A practical AI engineering guide to building multimodal LLM apps that handle images, audio, and documents reliably — with modality-specific evals and grounding.

Text-only LLM apps are relatively forgiving. The input is a string, the output is a string, and a well-placed unit test can cover a surprising amount of failure surface. Add images, audio, or multi-page PDFs and the picture changes: the input pipeline grows, the failure modes multiply across modalities, and the eval suite that worked for text tells you almost nothing about whether the vision or audio path is actually working.
Multimodal AI is not a feature you bolt on. It touches your ingestion pipeline, prompt design, model routing, error handling, and evaluation infrastructure all at once. Teams that treat it as a simple model swap ("the API accepts images now, so we're multimodal") ship systems that look fine in demos and fall apart on production traffic within two weeks.
At Laxaar we've built multimodal systems for document processing, visual QA, and audio transcription pipelines. The engineering patterns here are drawn from those deployments, not from toy examples.
What you'll learn
- What multimodal LLM applications actually are
- Ingestion pipelines for each modality
- Prompt design for mixed-modality inputs
- Grounding: connecting model output to source content
- Model and routing decisions
- Modality-specific evals that actually catch regressions
- Infrastructure and cost considerations
- Frequently Asked Questions
What multimodal LLM applications actually are
A multimodal LLM application accepts two or more input types (text, image, audio, video, or documents), processes them through a language model, and produces grounded outputs tied back to the source material.
The word "grounded" is doing real work in that definition. A multimodal app that describes an image is not the same as one that answers questions about specific regions of that image and cites page numbers from an uploaded PDF. The second is grounded; the first is a caption generator.
The distinction matters because grounding is what makes multimodal outputs trustworthy enough to act on. Without it, the model's vision or audio understanding is essentially a black box assertion: plausible but unverifiable. With grounding, downstream systems can validate claims against source coordinates, timestamps, or page references.
Modalities in production systems fall into roughly three buckets:
- Vision. Static images, screenshots, charts, scanned documents
- Audio. Speech, meetings, calls, short-form voice commands
- Documents. PDFs, Word files, spreadsheets (often treated as image + text combined)
Each bucket has its own ingestion requirements, its own failure modes, and its own eval approach. Conflating them into a single "non-text" category is the most common architectural mistake we see.
Ingestion pipelines for each modality
The ingestion pipeline transforms raw input into whatever representation the model accepts. Getting this wrong silently is far more dangerous than a hard failure: the model receives a degraded or malformed input and produces a confident-sounding wrong answer with no indication anything went awry.
Vision ingestion involves three concerns: resolution, format, and content density. Models have a token budget for images (GPT-4o tiles images at up to 2048px on the long edge per tile). Sending a 6000px architectural diagram as a single image means the model is working with a heavily downscaled version. For detail-sensitive tasks, tile the image yourself and pass tiles with positional context rather than relying on the model's automatic tiling.
from PIL import Image
import base64
import io
def prepare_image_tiles(image_path: str, max_tile_px: int = 1024) -> list[dict]:
"""Split a large image into tiles with positional metadata."""
img = Image.open(image_path).convert("RGB")
width, height = img.size
tiles = []
for row_start in range(0, height, max_tile_px):
for col_start in range(0, width, max_tile_px):
box = (
col_start,
row_start,
min(col_start + max_tile_px, width),
min(row_start + max_tile_px, height),
)
tile = img.crop(box)
buf = io.BytesIO()
tile.save(buf, format="JPEG", quality=85)
b64 = base64.b64encode(buf.getvalue()).decode()
tiles.append({
"b64": b64,
"position": {"row": row_start, "col": col_start, "box": box},
})
return tiles
Audio ingestion requires a transcription step before the LLM ever sees the content. Whisper (or a managed equivalent) is the standard choice. The real decision is speaker diarisation: do you attribute transcript segments to specific speakers? Diarisation adds latency and cost but is non-negotiable for meeting summarisation or call analysis where "who said what" is part of the business question.
Document ingestion is the most under-engineered step in most pipelines. A PDF is not a flat text file. Tables, headers, footers, multi-column layouts, and embedded images all require structural parsing that naive text extraction destroys. Tools like pdfplumber for structured extraction and pymupdf for coordinate-aware text give you the bounding-box metadata you need to cite specific locations later.
import pdfplumber
def extract_pdf_with_structure(pdf_path: str) -> list[dict]:
"""Extract PDF content preserving table structure and page coordinates."""
pages = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
text_blocks = page.extract_words(
extra_attrs=["fontname", "size"],
use_text_flow=True,
)
tables = page.extract_tables()
pages.append({
"page": page_num,
"text_blocks": text_blocks,
"tables": tables,
"width": page.width,
"height": page.height,
})
return pages
Prompt design for mixed-modality inputs
Text prompts for pure-text tasks have well-understood failure modes. Mixed-modality prompts have all of those plus several new ones: the model may attend to text instructions and ignore the image, or answer from training data rather than the provided visual, or hallucinate detail that doesn't exist in a low-resolution region.
A few patterns that consistently improve reliability:
Explicit modality instruction. Tell the model which modality to prioritise for which part of the task. "Answer only from the chart provided. Do not use external knowledge about this company" is more effective than assuming the model will infer that.
Ask for source citations in the output schema. Requesting structured output that includes a source field (page number, bounding box, timestamp) forces the model to make its grounding explicit. This also creates a natural signal for eval: if the cited page doesn't contain the claimed information, the answer is wrong.
Separate perception from reasoning. For complex visual tasks, a two-step prompt often outperforms a single prompt. First, ask the model to describe what it sees in the image (raw perception). Then, in a second call, provide that description as text context and ask the analytical question. This separates transcription errors from reasoning errors, which makes debugging far easier.
from openai import OpenAI
client = OpenAI()
def two_step_visual_analysis(image_b64: str, question: str) -> dict:
# Step 1: perception
perception = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_b64}"},
},
{
"type": "text",
"text": (
"Describe every visible element in this image in detail. "
"Include all text, numbers, labels, and visual structures. "
"Do not interpret — only describe."
),
},
],
}
],
)
description = perception.choices[0].message.content
# Step 2: reasoning over the description
analysis = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "Answer the question using only the image description provided.",
},
{
"role": "user",
"content": f"Image description:\n{description}\n\nQuestion: {question}",
},
],
)
return {
"perception": description,
"answer": analysis.choices[0].message.content,
}
Grounding: connecting model output to source content
Grounding is the mechanism that ties a model's claim to a specific location in the source material. Without it, you have a fluent answer with no way to verify it.
The simplest form of grounding for documents is page-and-paragraph citation: the model's output includes a reference like [Page 3, Table 2] and your system can render that reference back to the user or validate it programmatically. For images, bounding-box coordinates serve the same role. For audio, timestamps.
The honest trade-off: strict grounding constraints reduce the model's ability to synthesise across sources. An analysis that draws on findings from five different sections of a report is harder to ground than a direct quote from one location. The right balance depends on your use case. A legal document review tool needs strict citation; a general document Q&A assistant can be somewhat more liberal. Define your grounding policy explicitly before you build, because retrofitting it is genuinely painful.
Grounding also creates a validation surface. Your eval pipeline can automatically check whether cited pages contain the claimed content (even a fuzzy string match catches a large fraction of hallucinations). This turns grounding from a UX nicety into an engineering quality gate.
Model and routing decisions
Not every modality benefits from the same model, and not every request in a multimodal app needs the most capable (and expensive) model available.
| Task | Recommended model tier | Notes |
|---|---|---|
| Image description / OCR | Mid-tier vision model (GPT-4o mini) | Sufficient for most; escalate for complex layouts |
| Chart and diagram analysis | High-tier vision model (GPT-4o) | Requires fine spatial reasoning |
| Audio transcription | Whisper large-v3 or managed STT | Model quality, not LLM routing |
| Document Q&A over extracted text | Mid-tier text model + retrieval | Vision only for scanned docs |
| Multi-document synthesis | High-tier text model | Context length and reasoning quality matter |
Routing logic based on input type and confidence lets you keep costs down on high-volume simple tasks while routing edge cases to stronger models. Confidence-based escalation is a pattern worth building early: a mid-tier model flags low-confidence outputs and they re-run on a stronger model.
One opinionated take: GPT-4o's vision is good enough to replace a dedicated OCR pipeline for most printed-text documents in 2026. Purpose-built OCR tools still win on handwriting, very degraded scans, and non-Latin scripts. But for standard business documents, the "vision model as OCR" pattern is simpler to maintain, produces comparable accuracy, and drops a separate service dependency entirely.
Modality-specific evals that actually catch regressions
This is where most multimodal systems are weakest. Teams write evals for their text paths and assume the multimodal behaviour is covered. It isn't.
Effective multimodal evals need to be modality-specific because the failure modes are different:
Vision evals should test OCR accuracy on a representative sample of your actual document types, bounding-box grounding accuracy (does the cited region actually contain the claim?), and robustness to degraded inputs: rotated images, low contrast, partially occluded text.
Audio evals should test transcription accuracy (word error rate against ground truth), speaker attribution correctness if you're using diarisation, and handling of overlapping speech or background noise.
Document evals should test table extraction accuracy (do the extracted tables match the source?), multi-column layout handling, and cross-page reference resolution.
# Example: grounding accuracy eval for document Q&A
def evaluate_grounding(predictions: list[dict], ground_truth: list[dict]) -> dict:
"""
Each item: {"answer": str, "cited_page": int, "cited_text": str}
Ground truth: {"answer": str, "source_page": int, "source_text": str}
"""
correct_page = 0
grounded = 0
for pred, gt in zip(predictions, ground_truth):
if pred["cited_page"] == gt["source_page"]:
correct_page += 1
# Fuzzy match: cited text appears in source page text
if gt["source_text"] and pred["cited_text"] in gt["source_text"]:
grounded += 1
n = len(predictions)
return {
"page_accuracy": correct_page / n,
"grounding_rate": grounded / n,
"n": n,
}
Run these evals in CI on every model version change, every prompt change, and on a scheduled basis against a held-out production sample. Multimodal models update frequently, and a model upgrade that improves text tasks can quietly degrade vision performance. We've seen it happen in Laxaar's deployments.
Infrastructure and cost considerations
Multimodal inputs are expensive, and image tokens are priced differently from text tokens. GPT-4o charges per tile; a high-resolution image processed at full quality can consume 1000+ tokens before you've written a single word of prompt. Audio adds transcription costs on top of LLM inference costs.
A few patterns that keep costs manageable:
Pre-filter before the model. For document pipelines, extract text with a traditional parser first. Only route pages to a vision model when the text extraction quality is below a confidence threshold (many PDF parsers expose character-level confidence scores). This can reduce vision model calls by 60–80% on well-formatted document sets.
Cache aggressively. Image inputs with the same content produce the same output. A content-addressed cache keyed on the image hash avoids re-processing the same document twice. OpenAI's prompt caching covers the prefix of the conversation, including image tokens. Structure your prompts to put static content (including images) before variable content to maximise cache hits.
Size images to task requirements. A 4000px image sent for a yes/no classification wastes tokens. Resize to the minimum resolution that preserves the relevant detail for your specific task. For most text extraction tasks, 1500px on the long edge is sufficient.
For custom software development projects that include multimodal pipelines, the infrastructure architecture decisions made here (caching, routing, eval coverage) directly affect both the operating cost and the system's long-term maintainability.
Our AI development services cover the full stack from ingestion pipeline design through production deployment, including the modality-specific eval infrastructure that keeps multimodal systems honest over time.
If you're evaluating whether to build multimodal capability in-house or partner with a specialist team, the Laxaar portfolio includes several production multimodal deployments with real-world scale and performance data.
Teams considering generative AI development services for document or vision workloads often underestimate the ingestion and grounding work relative to the model integration. It's consistently the part that takes longest and breaks most often in production.
Frequently Asked Questions
How many tokens does an image consume with GPT-4o?
GPT-4o uses a tile-based system. A 512x512 pixel image uses 85 base tokens plus 170 tokens per tile. High-resolution mode tiles the image at up to 2048px per tile. A 1024x1024 image in high-resolution mode consumes 765 tokens (85 base plus 4 tiles at 170 each). For cost estimation, budget roughly 800–1200 tokens per average business document page sent as an image.
Should we use a vision model for OCR or a dedicated OCR service?
For standard printed business documents (contracts, invoices, reports), a vision model like GPT-4o produces accuracy comparable to purpose-built OCR services while also understanding layout and structure contextually. For handwriting, badly degraded scans, or non-Latin scripts, purpose-built OCR (Google Document AI, AWS Textract, Azure Document Intelligence) still outperforms vision models and is worth the extra dependency.
What's the best way to handle very long audio files?
Chunk the audio into segments (10–15 minutes is a practical size for Whisper), transcribe each segment independently, and then run a second pass to clean up chunk boundaries and re-attribute speakers across segment boundaries. Store timestamps with each transcript segment so you can trace claims back to specific moments in the recording.
How do we prevent the model from hallucinating image content?
Three mitigations work well together: ask for citations (force the model to ground claims to visible elements), use the two-step perception-then-reasoning pattern to separate description from interpretation, and run eval cases specifically designed around low-detail or ambiguous images where hallucination is most likely. No approach eliminates hallucination entirely. Build your application to surface uncertainty rather than hide it.
Can we run multimodal evals automatically in CI?
Yes, and you should. The practical setup is a held-out eval dataset of representative inputs with known ground-truth outputs, a scoring function for each modality (grounding accuracy for documents, word error rate for audio, bounding-box accuracy for vision), and a CI step that runs the eval suite on every model or prompt change and fails the build if scores drop below a threshold. LLM-as-judge scoring works for open-ended outputs; rule-based scoring works for structured outputs.
Building a multimodal system and running into ingestion or eval design questions? Talk to the Laxaar team. We can review your pipeline architecture and help you build the eval coverage that keeps multimodal behaviour reliable as models evolve.
Working on something like this?
Get a fixed scope, timeline, and price within one business day — no obligation.


