Output moderation
Output moderation runs after the upstream model has answered but before the response is returned to the caller. The LLM does see the prompt; routeur.ai protects the end user from what the model produced.
Different from DLP. DLP protects the upstream model from your input. Output moderation protects your users from the model's output. They run at different points in the request lifecycle.
Actions
lognon-blockingRecords the match on the trace without modifying the response.
redactRewrites the matched substring with [REDACTED] in the caller-visible content.
rewriteReplaces the matched span with a configured substitution string.
blockreplaces the responseReturns 403 blocked_by_moderation instead of the upstream content. Token usage is still recorded.
Example: block a managed category
{
"name": "secret_leak_block",
"category": "secret-leak",
"action": "block",
"severity": "high",
"enabled": true
}
{
"error": {
"code": "blocked_by_moderation",
"message": "moderation:secret_leak_block",
"type": "routeur_error"
}
}
Managed categories
routeur.ai ships a managed set of categories that map to shared policy detectors:
- prompt-injection: response contains an attempt to subvert the system prompt.
- pii-leak: response surfaces personally-identifiable information not present in the prompt.
- secret-leak: response contains a credential format (API key, private key block).
- unsafe-code: response contains a runnable destructive command (e.g.
rm -rf /).