Research

Publications

Peer-reviewed research and preprints from the Medtwin AI team on clinical AI safety, runtime verification, and trustworthy medical systems.

📄
2
Papers Published
🔬
62
Clinical Rubrics Evaluated
🏥
500
Clinical Scenarios Tested

Rubric Gates: Hierarchical Runtime Verification and Rubric-Integrated Training for Clinical AI Safety

A Three-Tier Architecture with Conditioned Generation, Specialist Hives, Curriculum RL, and Continual Self-Evolution

Abhishek Sehgal, Aditi Garg, Dhruva Angachekar, Johnny Kim, Vibhuti Rajpal

Submitted to JAIR (Journal of Artificial Intelligence Research) / CHIL 2026

Training-time alignment provides distributional safety guarantees but does not verify individual outputs at inference time. This gap between population-level and instance-level safety assurance is a fundamental limitation of current approaches.

We introduce RubricGates, a hierarchical runtime verification framework inspired by surgical safety checklists. The system decomposes "is this output safe?" into dozens of independently verifiable clinical checks, organized into a hierarchy that resists gaming. Concretely: 62 rubrics across 7 clinical domains, arranged in three tiers — Frozen (constitutional safety constraints that no optimization process can touch), Governed (clinical domain knowledge requiring human approval to change), and Learnable (task-specific thresholds that self-improve within safety bounds). Each rubric operates as a gate with approve/revise/block semantics. A single Tier-1 failure blocks the output regardless of every other score.

We also introduce four AI mechanisms that integrate rubrics into the generation and training process: (i) rubric-conditioned generation, which steers LLM hidden states toward rubric-compliant outputs during decoding; (ii) a hive of rubric-specialist small LLMs, where each 2–4B parameter specialist owns a subset of rubrics and a consensus mechanism aggregates their verdicts; (iii) rubric-structured curriculum RL with constrained PPO and PID Lagrangian updates; and (iv) rubric-guided continual self-evolution, a closed-loop system where rubric gate failures are analyzed, converted to targeted training data via self-play, and used to adapt the model through LoRA fine-tuning with catastrophic-forgetting safeguards.

In evaluation on 500 harm-injected clinical scenarios from MIMIC-IV and PhysioNet data, a DeepSeek-V3 LLM judge operating within the gate pipeline achieves HPR 0.875 at a 9.5% false alarm rate.

Clinical AI Safety Runtime Verification Rubric-Based Gating Small Language Models Constrained RL Self-Evolution Patient Safety Certificates
arXiv Preprint · January 2026

Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems

Amir Khurshid, Abhishek Sehgal

arXiv:2601.10681 [cs.AI] · January 2026

Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG), which involves ranking and selecting the top-k passages. The approach causes fragmentation in information graphs in document structures, over-retrieval, and duplication of content alongside insufficient query context, including 2nd and 3rd order facets.

In this paper, a structure-informed and diversity-constrained context bubble construction framework is proposed that assembles coherent, citable bundles of spans under a strict token budget. The method preserves and exploits inherent document structure by organising multi-granular spans (e.g., sections and rows) and using task-conditioned structural priors to guide retrieval. Starting from high-relevance anchor spans, a context bubble is constructed through constrained selection that balances query relevance, marginal coverage, and redundancy penalties.

It will explicitly constrain diversity and budget, producing compact and informative context sets, unlike top-k retrieval. Moreover, a full retrieval is emitted that traces the scoring and selection choices of the records, thus providing auditability and deterministic tuning.

Experiments on enterprise documents demonstrate the efficiency of context bubble as it significantly reduces redundant context, is better able to cover secondary facets and has a better answer quality and citation faithfulness within a limited context window. Ablation studies demonstrate that both structural priors as well as diversity constraint selection are necessary; removing either component results in a decline in coverage and an increase in redundant or incomplete context.

Retrieval-Augmented Generation Context Construction Document Structure Diversity Constraints Enterprise RAG Token Budget Optimization

Active Research Areas

🛡️
Clinical AI Safety
Runtime verification frameworks that ensure every AI output meets clinical safety standards before reaching patients or papers.
🧠
Meta-Cognition Systems
AI agents that reason about their own uncertainty, know when to defer to human experts, and learn from failures.
📊
Reproducible Medical Statistics
Deterministic statistical pipelines with full provenance tracking — every number traces back to its source data and computation.
🔗
Biomedical Knowledge Integration
Federated search across 50+ biomedical databases with cross-reference evidence synthesis and confidence scoring.

Interested in collaborating?

We are actively seeking research partnerships with clinical institutions and AI safety researchers.

Get in Touch →