Rubric Gates: Hierarchical Runtime Verification and Rubric-Integrated Training for Clinical AI Safety
A Three-Tier Architecture with Conditioned Generation, Specialist Hives, Curriculum RL, and Continual Self-Evolution
Submitted to JAIR (Journal of Artificial Intelligence Research) / CHIL 2026
Training-time alignment provides distributional safety guarantees but does not verify individual outputs at inference time. This gap between population-level and instance-level safety assurance is a fundamental limitation of current approaches.
We introduce RubricGates, a hierarchical runtime verification framework inspired by surgical safety checklists. The system decomposes "is this output safe?" into dozens of independently verifiable clinical checks, organized into a hierarchy that resists gaming. Concretely: 62 rubrics across 7 clinical domains, arranged in three tiers — Frozen (constitutional safety constraints that no optimization process can touch), Governed (clinical domain knowledge requiring human approval to change), and Learnable (task-specific thresholds that self-improve within safety bounds). Each rubric operates as a gate with approve/revise/block semantics. A single Tier-1 failure blocks the output regardless of every other score.
We also introduce four AI mechanisms that integrate rubrics into the generation and training process: (i) rubric-conditioned generation, which steers LLM hidden states toward rubric-compliant outputs during decoding; (ii) a hive of rubric-specialist small LLMs, where each 2–4B parameter specialist owns a subset of rubrics and a consensus mechanism aggregates their verdicts; (iii) rubric-structured curriculum RL with constrained PPO and PID Lagrangian updates; and (iv) rubric-guided continual self-evolution, a closed-loop system where rubric gate failures are analyzed, converted to targeted training data via self-play, and used to adapt the model through LoRA fine-tuning with catastrophic-forgetting safeguards.
In evaluation on 500 harm-injected clinical scenarios from MIMIC-IV and PhysioNet data, a DeepSeek-V3 LLM judge operating within the gate pipeline achieves HPR 0.875 at a 9.5% false alarm rate.