How to Monitor Your AI Chatbot for Harmful Outputs | Monit247 Blog

You deployed a chatbot. It answers customer questions, handles support tickets, recommends products. It works well. Your monitoring dashboard shows green. Response times are fine. Uptime is 100%.

Then someone screenshots your chatbot telling a grieving customer about a discount that doesn't exist. Or giving legal advice that's illegal. Or writing a poem about how terrible your company is.

This keeps happening. Air Canada's chatbot invented a fake bereavement fare and the company had to pay damages. Character.AI's chatbot was linked to a teenager's death. NYC's government chatbot told citizens they could legally discriminate against employees. A Chevrolet dealership's ChatGPT bot agreed to sell a Tahoe for $1 and confirmed the deal was "legally binding." DPD's delivery bot swore at a customer and called DPD "the worst delivery firm in the world."

In every case, the infrastructure was fine. The servers were up. The API was responding. The chatbot was working exactly as deployed. It was just saying things it shouldn't.

And in most cases, the company found out because someone posted a screenshot on social media. Not because their monitoring caught it.

The monitoring gap

If you've deployed an LLM-powered chatbot, you probably have some combination of these:

Infrastructure monitoring (is the server up, is the API responding)
Application monitoring (error rates, latency, throughput)
Log aggregation (conversations stored somewhere, maybe sampled for review)

What you probably don't have is something watching what the chatbot actually says in real time and flagging when it crosses a line.

Most teams rely on one of two approaches, and both have problems.

Provider-level safety filters. OpenAI has a moderation API. Google's Vertex AI has configurable safety thresholds. Amazon Bedrock has guardrails. Anthropic bakes safety into Claude's training. These are useful. They catch the obvious stuff: slurs, explicit content, direct incitement to violence. But they're designed to catch clearly harmful content, not the subtle failures that cause real damage. Air Canada's chatbot didn't use hate speech. It just made up a policy. NYC's chatbot didn't threaten anyone. It just gave confidently wrong legal advice. Provider filters don't catch hallucinated facts, off-brand responses, or creative misinterpretations of your business rules.

Retrospective log review. Many teams log conversations and sample-review them later. A QA person reads through 5% of yesterday's conversations looking for problems. This catches patterns over time, but it's slow. The DPD incident went viral within hours. By the time a weekly review would have caught it, the screenshot was already on every tech news site. And if your sampling rate is 5%, you're missing 95% of what the chatbot says.

Neither approach monitors the actual live output in real time.

What "monitoring your chatbot" actually means

There are really three layers to this, and most teams only have the first one:

Layer 1: Is it running? Uptime checks, health endpoints, response time. Every monitoring tool handles this. If your chatbot API goes down, you know immediately.

Layer 2: Is it running correctly? Error rates, timeout rates, model version tracking, token usage, cost per conversation. Application performance monitoring tools handle this. Datadog, New Relic, custom dashboards.

Layer 3: Is it saying acceptable things? Content safety classification of actual outputs. This is the layer almost nobody has. It's the layer that would have caught the Air Canada, DPD, Character.AI, NYC, and Chevrolet incidents before they became news.

Approaches to layer 3

Here's what exists right now for monitoring what your chatbot actually says:

Inline guardrails (pre-response filtering)

Tools like Guardrails AI, Nvidia NeMo Guardrails, and Lakera Guard sit between your LLM and the user. Every response passes through a filter before it reaches the user. The filter checks for toxicity, prompt injection, PII leakage, off-topic responses, and whatever rules you define.

This is the most common approach for teams that take content safety seriously. It works well for blocking clearly harmful outputs. The tradeoff is latency. Adding a classification step to every response adds 100-500ms depending on the tool and the complexity of your rules. For a support chatbot where response time matters, that's noticeable.

The bigger limitation: these tools see individual responses in isolation. They catch "this response contains a slur" but not "this response invented a company policy that doesn't exist" unless you've written explicit rules for every policy. They're as good as the rules you configure, and writing comprehensive rules for everything a chatbot might hallucinate is not a solvable problem.

LLM-as-judge (automated review)

Use a second LLM to evaluate whether the first LLM's output is appropriate. You send each response (or a sample) to a judge model with instructions like "evaluate whether this response is factually accurate, on-brand, and safe."

This catches subtler issues than rule-based filters. A judge model can understand context and identify hallucinated policies or off-brand tone. The tradeoffs: cost (you're paying for two LLM calls per interaction), latency (if inline), and reliability (the judge model can also be wrong).

Some teams use this asynchronously. The chatbot responds immediately, but every response is queued for judge evaluation in the background. If the judge flags something, the team gets notified and can intervene. This doesn't prevent the harmful response from being seen, but it catches it faster than manual review.

Live page monitoring

This is what I built into Monit247. Instead of inserting into the application pipeline, it monitors the live rendered page where the chatbot's responses appear. It visits the URL, reads the content, and classifies it against 11 harm categories (hate speech, harassment, self-harm, sexual content, violence, dangerous content, discrimination, profanity, threats, child safety, substance abuse).

The advantage: no integration into your application code. You don't need to modify your chatbot pipeline or add middleware. You give it the URL where your chatbot widget lives and it monitors what's actually visible to users.

The limitation: it monitors at intervals (1 to 60 minutes), not in real time per-message. It catches persistent harmful content on a page, not a single bad response in a conversation that's already gone. It's better suited for catching defacement, sustained chatbot misbehavior, or harmful content that stays visible, than for catching a one-off bad response.

For catching every individual response, inline guardrails or LLM-as-judge are more appropriate. For catching "the chatbot has been giving bad advice for the last 3 hours and nobody noticed," live page monitoring fills the gap.

What you should actually set up

There's no single tool that covers everything. Here's what a practical setup looks like:

At minimum (if you do nothing else):

Use your LLM provider's built-in safety features. Turn them on. Don't lower the thresholds to reduce refusals unless you've thought carefully about the tradeoffs.
Log every conversation. Not just errors, every conversation. Storage is cheap. The ability to investigate an incident after it happens is essential.

If you're serious about it:

Add inline guardrails for your highest-risk categories (PII leakage, off-topic responses, explicit content). Guardrails AI and NeMo Guardrails are both open source.
Set up live page monitoring for the pages where your chatbot appears. This catches the case where your chatbot goes off the rails and nobody in your team notices because the infrastructure monitoring says everything is fine.
Define your chatbot's boundaries explicitly. Not just "be helpful." Write down: what topics can it discuss? What should it refuse? What's the escalation path when it doesn't know the answer? Vague instructions produce vague behavior.

If you're regulated:

The EU AI Act requires deployers of high-risk AI systems to monitor operations and report serious incidents. If your chatbot operates in healthcare, employment, education, or financial services, you probably have reporting obligations.
The UK Online Safety Act requires platforms to take proactive steps to prevent harmful content, including AI-generated content.
Log retention and audit trails become requirements, not nice-to-haves.

The real problem

The real problem isn't technical. The tools exist. Guardrails, moderation APIs, LLM-as-judge, live page monitoring. They all work to varying degrees.

The real problem is that most teams deploy a chatbot and then treat it like a static feature. They test it before launch, maybe run a red-team session, and then move on to the next thing. The chatbot keeps running, the conversations keep happening, and nobody is watching.

Stanford's AI Index Report tracked a 56% year-over-year increase in AI safety incidents between 2023 and 2024. The incidents aren't getting less frequent. The chatbots are getting more capable, which means they're getting more creative in ways that include creative failures.

If you've deployed a customer-facing chatbot, monitoring what it says is not optional anymore. The question is just how much coverage you need.

If you want to start with live page monitoring for your chatbot, Monit247 has AI content monitoring on every plan including free. It won't catch every bad response in real time, but it will catch the case where your chatbot page is serving harmful content and your uptime monitor is reporting all clear.

Try Monit247 free

7 monitor types including AI content monitoring. No credit card required.

Get Started Free