How to Build Zero-Shot Anomaly Detection with Vision Language Models, NVIDIA Jetson, and Ollama

January 06, 2026

Introduction

Here at ZEDEDA, we’re always pushing the limits of running AI at the edge. We’re especially excited by powerful edge platforms like NVIDIA Jetson AGX Series GPUs. For computer vision applications, there are plenty of vision-language models that can run on AGX-class devices, and by combining them with tools built on an Agentic AI framework, we can address a broad range of use cases on edge devices.

In this post, we’ll explore a high-performance vision processing application that combines classical computer vision (CV) models with modern Generative AI (GenAI) models. While GenAI “agents” have become ubiquitous, many market solutions rely on rigid SDKs that struggle with hardware compatibility in constrained environments. This pattern can be seen in factories where redundant systems still use deprecated solutions even though the data has drifted beyond any explainable distributions. Some devices run on slower infrastructure, as the engines and runtimes that they were built on are monolithic without any separation of concerns in the application layer.

Our solution negates these issues by building an agent that interacts directly with hardware and tools. By running this pipeline on edge devices running EVE, we can identify boxes with missing labels in about 500 ms. This architectural pattern is much faster than the typical round-trip delays associated with cloud-based AI processing, which involves sending a video stream to a cloud datacenter and requires an expensive, high-bandwidth network connection.

The result is a robust, customizable, and resilient architecture that brings intelligent decision-making directly to the source of the data.

Use Case: Real-Time Monitoring and Anomaly Detection

Our application is designed for real-time monitoring of production lines or shipping processes to ensure that no boxes are missing their shipping labels.

In a busy facility, an unlabeled box is a lost box, leading to shipping errors and customer dissatisfaction. By deploying overhead or side-mounted cameras paired with our AI agent, the system instantly validates label presence. If a label is absent or unreadable, the package is flagged for immediate intervention, preserving supply chain reliability and cutting operational recovery costs.

Note that Cloud AI – sending a video stream for inference at a cloud datacenter – is not feasible for high-throughput logistics. By the time AI flagged a box without a label, it could be well down the production line. Indeed, crossbelt sorters can process 14,000 units per hour (UPH) – that’s four packages every second, or 250 milliseconds per package. This is why sub-second latency is crucial.

Edge AI is the better solution here, since it can flag a labelless box immediately, and stop it from progressing further down the line, so that factory workers can intervene, remove the box from the line, and get it properly labeled.

 

But what if we told you that you can go from this to solving entirely different problems without any retraining?

Because this is a true GenAI agent, it can literally do whatever is asked of it. Unlike traditional computer vision models that are rigid and single-purpose, this system is driven by natural language, and thus flexible enough to be applied to multiple use cases. By simply changing the prompt, you can repurpose the same architecture to spot safety hazards (e.g., workers not wearing helmets, gloves, or goggles), inspect product quality (e.g., poor welds or cracks), or monitor machines to determine whether they need maintenance (also called predictive maintenance). The agent adapts to your instructions instantly; the core functionality remains the same, but the problem it solves is up to you.

The system relies on:

  • Capturing and processing high-throughput video streams.
  • Analyzing frames using a combination of fast, classical CV models and powerful GenAI models.
  • Making intelligent decisions based on your specific prompts using a dedicated Large Language Model (LLM).
  • Providing immediate alerts displayed on a live web-based operational dashboard.

Why This Matters: The Advantage of Agentic Edge AI

Adopting this architecture represents more than just a technology upgrade; it is a fundamental change in how we deploy intelligence to the edge.

  1. A Paradigm Shift from Static to Dynamic Traditional Machine Learning has long been the standard for edge vision, but is inherently rigid. A model trained specifically to detect safety helmets cannot suddenly detect a spill on the floor or a misaligned label without a complete development cycle. This system introduces a shift from Static to Dynamic intelligence. By leveraging Generative AI, the agent possesses semantic reasoning capabilities, allowing it to adapt to unseen scenarios and understand complex, context-heavy queries that rigid convolutional neural networks (CNNs) simply miss.
  2. Operational Agility via Zero-Shot Adaptation This dynamic nature unlocks unprecedented operational agility. In the past, modifying a vision pipeline meant gathering new datasets, annotating images, retraining models, and redeploying binaries, a process that could take weeks or months. With our Agentic approach, this is replaced by Zero-Shot Adaptation (sometimes called Zero-Shot Learning). System behavior is updated instantly through prompt engineering. If business requirements shift,for example, from “counting boxes” to “identifying damaged corners”, operators can simply update the text prompt, and the agent adapts immediately without a single line of code change.
  3. Real-World Viability: Solving the Latency Challenge Deploying such powerful models at the edge does come with implementation challenges. Large Multimodal Models (LMMs) naturally impose a higher computational overhead than lightweight, purpose-built CNNs. To make this viable on hardware like the NVIDIA Jetson, we have architected a solution that balances power with performance. Rather than brute-forcing a continuous stream, we optimize throughput by implementing intelligent frame sampling.

Intelligent frame sampling is an AI-driven technique in video processing that selectively captures specific, high-value frames based on content (such as motion, scene changes, or objects) rather than capturing frames at fixed time intervals. By ignoring redundant static footage and focusing only on “key moments,” this method significantly reduces data storage and processing costs while maintaining high accuracy for analysis.

By processing frames only when necessary, using lightweight pre-filters to ignore static scenes, we achieve the deep reasoning power of GenAI while maintaining the sub-second latency required for industrial operations.

Putting the pieces together: The ZEDEDA GenAI Solution Blueprint for the Edge

The system is designed with Efficiency First in mind. Running Large Language Models (LLMs) and Vision Language Models (VLMs) on the edge requires minimizing expensive model calls. We achieved this by compartmentalizing the system into four main stages: Frame CaptureMonitoring & AI PipelineAlerting & Persistence, and Web UI & Services.

The foundation of the system is the lightweight, local deployment of GenAI models via the Ollama service that runs locally on the edge device. Ollama runs on a broad range of edge devices and can be deployed in containers running on EVE.

Here’s the overall architecture of what we’re building:

Technical Data Flow

Our solution has the following components within its data flow:

Component Function Technology / Runtime
Camera Hardware Interface /dev/video0
Publisher Video Stream Fan-out CameraFeedPublisher (OpenCV)
Monitor Primary Processing Service WebCameraAgent (CameraMonitoringService) – ZEDEDA-built using GStreamer
Agent Pipeline Orchestrator MonitorDetectionAgent – ZEDEDA-built
SSIM Similarity Caching SSIM (Structural Similarity Index) Guard (Similarity Cache)
Classical Fast, Rule-Based Analysis Classical Computer Vision Analyzers
Vision Semantic Understanding Ollama Vision Model / Unified VLM (Scene Summary), such as Gemma 3 or Llama 3.2.
Decision Tool-Calling and Logic Decision LLM (Ollama Tool Calling)
Flask Operator Interface Backend Flask Web App
SocketIO Live Metrics / Events Socket.IO Live Updates
Ollama Local GenAI Runtime Ollama API endpoints (/api/generate & /api/chat)

Data Flow Diagram

The following flowchart illustrates the complete application pipeline:

1. Frame Acquisition and Distribution

The CameraFeedPublisher acts as a zero-latency distributor. It uses OpenCV to capture frames from the dedicated hardware device (/dev/video0) and fans out the stream to multiple consumers: the Monitor service for processing and the Stream endpoint for the live Web UI feed.

2. Smart Caching with SSIM

The SSIM Guard implements a key optimization for edge deployment. Before engaging expensive models, it compares the current frame against the Detection Cache using the Structural Similarity Index Measure (SSIM). If the frame’s content is virtually identical to a recently analyzed frame (e.g., the conveyor belt is stopped), the pipeline reuses the prior decision. This reduces local LLM calls by up to 95%, drastically reducing Jetson resource utilization and inference time.

3. Layered Intelligence and Unified Vision Language Model (VLM)

  • Classical CV Analyzers: Highly optimized, low-latency models (like Yolo for object detection) handle routine, high-frequency checks.
  • Unified VLM Client: When a change is detected, the frame is passed to our Unified Vision Language Model. Unlike older pipelines that separated “vision” and “decision” models, our streamlined agent uses a single, powerful VLM (like Gemma 3 or Llama 3.2) to both analyze the image and make a business decision in one pass, which further reduces latency.
  • Hybrid Intelligence (Confidence Blending): We fuse the output of traditional computer vision techniques to augment the input for the LLM agent. This ensures that a “hallucination” from the AI doesn’t trigger a false alarm.

4. Decision LLM and Tool Calling

The Decision LLM serves as the brain of the operation. It is configured for Ollama Tool Calling. The LLM is instructed to call a predefined internal function (a “tool”) only when it identifies a high-confidence anomaly, such as a packaging box ready to be shipped, in a production line without any shipping labels, generating an alert.

Here is an example of how we can write a tool call:

   “save_evidence”: ToolDefinition(

        name=“save_evidence”,

        description=“Save the current frame as evidence for later review. Use when you detect something that should be documented.”,

        parameters={

            “label”: {

                “type”: “string”,

                “description”: “Label describing what was detected (e.g., ‘unlabeled_box’, ‘ppe_violation’)”

            },

            “metadata”: {

                “type”: “object”,

                “description”: “Additional metadata to store with the evidence”

            }

        },

        required_params=[“label”],

        handler=_tool_save_evidence,

    ),

5. Resiliency: Circuit Breaker Pattern

To ensure high availability, we implemented a Circuit Breaker. If the local AI service running on the edge device (Ollama) becomes unresponsive or overloaded, the breaker “trips,” preventing cascading failures and allowing the system to recover gracefully.

6. Agent Memory & Context

The system maintains Agent Memory, allowing it to recall recent events. This prevents duplicate alerts for the same issue and enables context-aware summaries of the day’s activity.

Deployment and User Interface

Once the solution blueprint is finalized, the deployment process begins. To do this, we use ZEDEDA Edge Kubernetes App Flows, documented here. First, we’ll upload the blueprint to the ZEDEDA Kubernetes Marketplace. From there, we select the specific application—in this case, the Camera Agent—and initiate the installation. During this phase, we configure the Helm values to suit the specific environment, allowing for seamless differentiation between testing and production configurations.

First, we select camera-agent in ZEDEDA Kubernetes Marketplace:

Next, we select the application version to deploy, 1.0.7:

We then fill in the Install name and title, test1:

And we see that camera-agent has been successfully deployed and is online:

After confirming the setup, ZEDEDA’s Kubernetes engine automates the installation across the selected clusters. Initially the application status will be Unknown.

We can monitor the deployment progress in real-time until the application status transitions to “Ready.”

Once the agent is live, our application dashboard provides a comprehensive overview of system health, including:

  • Real-time Metrics: Monitoring of system resources and throughput.
  • Object Detection: The agent is highly sensitive; for instance, it can immediately identify objects like shipping boxes even under obscure lighting or angles.
  • Log Management: The frontend provides direct access to database logs, allowing users to review previous activities, analyze trends, and generate custom reports.

Our application dashboard counts the number of unlabeled boxes. Below, you see that Detections has a value of 1, due to the unlabeled box against the far orange wall:

We can also set a Custom Prompt, for instance, send an email if a box is ripped open, crushed, or otherwise damaged:

When a box is detected, a summary of what the AI agent sees is provided in Vision Model Analysis:

The application logs contain a record of any previous object detections. , Below, we see that logs appear in the frontend, and the agent can analyze trends and report anything the user asks of it.

 

Configuration and LLM Management

The underlying Large Language Model (LLM) powering the agent can be managed via the settings page. While the model can be reconfigured post-deployment, it is generally recommended to finalize this during the initial setup. Changing the LLM on an active cluster requires a significant data download to an edge device, which can lead to increased downtime or latency.

Cloud-Native on NVIDIA Jetson

The entire application stack, including the Ollama service running the GenAI models (e.g., Gemma 3:4b), is containerized with Docker and orchestrated via Helm charts and K3s using ZEDEDA Edge Kubernetes App Flows. This isolates the architecture and maximizes the Jetson’s hardware acceleration features (CUDATensorRT), making deployment seamless across fleets of edge AI nodes.

Configuration is flexible; the system and user prompts, as well as the tools used, can be updated via the config.yaml file consumed by the MonitorDetectionAgent.

Web UI & Notifications

We decoupled the monitoring interface from the agent logic to ensure human supervision does not affect performance. The Flask Web App provides a central interface integrating:

  • Live Updates: Socket.IO pushes real-time events and performance statistics to the operator’s browser.
  • Video Feed: The Stream endpoint delivers the raw video feed for visual inspection.
  • Persistent Storage: All detection logs are stored in a local SQLite database (camera_agent.db).
  • Alerting: Critical events trigger immediate Desktop Notification and external SMTP Email via the AlertManager.

Next Steps

The ZEDEDA Camera Monitoring Agent represents a shift towards autonomous edge intelligence. By combining lightweight preprocessing with capable local LLMs, robust software patterns like Circuit Breakers, and a containerized architecture, we have built a system that is not only smart but resilient enough for the real world.

Deploy your own edge models using our Kubernetes product, and contact us to see how we might help you build GenAI solutions for our edge devices.

FAQ

Q: How does the architecture handle VRAM contention in Kubernetes? Do models like YOLO and the VLM stay resident in memory simultaneously?

A: Yes, in our current architecture, both the Object Detection model (YOLO) and the Vision Language Model (e.g., Gemma 3) are deployed within the same Kubernetes pod. This design keeps both models resident in GPU memory simultaneously to minimize inference latency, avoiding the high time cost of swapping models in and out of VRAM (cold starts).

To manage VRAM contention and prevent Out Of Memory (OOM) errors on edge devices with limited resources, we rely on Quantization. This allows powerful model combinations to run concurrently on consumer-grade GPUs or embedded edge hardware without saturating the available VRAM.


Q: Can we split the models into separate pods? How would GPU resources be shared?

A: Yes, the architecture supports decoupling the models into separate Kubernetes pods (e.g., one pod for YOLO, another for the VLM). However, since a physical GPU cannot natively be shared by multiple containers simultaneously without specific configurations, this requires enabling GPU Sharing mechanisms. This depends on the hardware and operator support.


Q: How does this “Agentic” approach differ from traditional computer vision (CV) models?

A: Traditional CV relies on discriminative models (like CNNs) that are trained to classify specific, pre-defined categories (e.g., “helmet” or “no helmet”). These models are “static,” meaning they cannot detect new anomalies—like a spill on the floor—without retraining and redeployment. Our Agentic approach utilizes Generative AI and Vision Language Models (VLMs), which possess semantic reasoning. This allows for Zero-Shot Adaptation, where the system can solve entirely new problems simply by changing the text prompt (e.g., from “check for labels” to “check for damaged corners”) without altering the underlying code.


Q: How does the system achieve sub-second latency while running heavy GenAI models on the edge?

A: We utilize a “Hybrid Intelligence” pipeline that minimizes expensive model calls.

  1. Frame Acquisition: Frames are captured via OpenCV and GStreamer-based pipelines for zero-latency distribution.
  2. Intelligent Sampling: We implement an SSIM Guard (Structural Similarity Index Measure). This metric compares the current frame to the cache; if the structural similarity is high (indicating a static scene), the system reuses the previous decision, reducing Large Language Model (LLM) calls by up to 95%. This concept aligns with similarity caching principles, where returning a “close enough” result drastically improves performance without sacrificing accuracy.
  3. Layered Inference: Fast, classical models like YOLO (You Only Look Once) handle object detection first, while heavier VLMs are invoked only when deep semantic understanding is required.


Q: What specific AI models power this solution?

A: The system leverages Large Multimodal Models (LMMs) or VLMs, such as Gemma 3 or Llama 3.2. Unlike unimodal LLMs that process only text, these models utilize vision encoders (often Vision Transformers) to map visual data into vector embeddings the AI can understand. For example, Gemma 3 provides multimodal understanding (text and image) with a 128k-token context window, allowing it to reason about complex visual scenes directly on edge hardware.


Q: How does the AI “act” on what it sees?

A: The system uses a Decision LLM configured for Tool Calling (or function calling). Instead of just outputting text, the model can generate structured requests to execute specific functions, such as sending an alert via SMTP, updating a dashboard, or stopping a conveyor belt. This transforms the AI from a passive observer into an active agent capable of interacting with external tools and APIs.


Q: What prevents the AI from hallucinating and generating false alerts?

A: Generative models can sometimes produce factually incorrect outputs, known as hallucinations. To mitigate this, we employ Agent Memory and Context. By retaining a history of recent events (short-term memory), the agent can cross-reference current detections with past frames to ensure consistency. Additionally, the Hybrid Intelligence approach fuses the confidence scores of classical CV models (which are less prone to hallucination) with the reasoning of the GenAI model to validate anomalies before triggering an alert.


Q: How is the edge infrastructure secured and managed?

A: The application runs on EVE-OS, a universal, open Linux-based operating system from the LF Edge organization. EVE-OS provides a “secure by default” foundation, utilizing hardware roots of trust (like TPM) and eliminating the risk of “bricking” devices during updates. The orchestration is handled by ZEDEDA, which allows for zero-touch provisioning and secure management of the containerized AI apps (Docker/K3s) across distributed fleets, even in air-gapped or intermittent connectivity environments.


Q: Can this system be applied to industries other than logistics?

A: Yes. While the blog highlights shipping label detection, the architecture is sector-agnostic.

  • Manufacturing: It can be adapted for predictive maintenance by visually monitoring machine health (e.g., vibrations or misalignments) to prevent costly downtime.
  • Retail: It can handle complex tasks like shelf monitoring or customer behavior analysis.
  • Energy: It can help oil producers increase worker safety and reduce injuries, by finding people not wearing PPE (personal protective equipment), such as hard hats, goggles, and gloves.


Q: What hardware is required to run this?

A: To ensure optimal performance, we recommend evaluating the computational cost (FLOPs) of your ML models against the processing power (FLOPS) of your target hardware. This ensures that latency requirements are met. Zededa supports a wide variety of hardware vendors; please contact us for assistance with your specific hardware target.

Subscribe to the ZEDEDA Blog
for the latest insights & trends

RELATED BLOG POSTS 

Get In Touch

Subscribe