Vicert's Perspective: Applying AI and LLMs to Healthcare Documentation Challenges

Case Study

At Vicert, we started an experimental journey to explore how AI and large language models could tackle one of healthcare's most persistent operational challenges: systematically measuring clinical documentation quality at scale.

Healthcare organizations have clear definitions of quality clinical documentation through standards like QNOTE and PDQI-9—criteria requiring thorough notes that demonstrate clinical reasoning, justify medical necessity, and enable care coordination. The real challenge wasn't understanding quality—it was systematically measuring it at the massive scale healthcare demands.

Manual documentation review provides thoroughness but reaches only a small sample of total documentation. Rule-based automated systems can analyze everything but fail to measure sophisticated criteria like "thoroughness" or "clarity of clinical reasoning"—qualities that demand contextual understanding rather than simple pattern recognition.

This fundamental disconnect between meaningful standards and practical measurement capabilities creates an ongoing operational challenge. Organizations must either water down their standards to match their technological limitations, or preserve sophisticated standards while accepting severely limited evaluation coverage.

At Vicert, we developed a clinical documentation quality evaluation framework to demonstrate how artificial intelligence could connect sophisticated healthcare standards with the operational scale healthcare requires.

The Healthcare Challenge We Addressed

Our focus targeted a challenge with real, measurable business impacts. Inadequate documentation quality drives tangible costs: insurance claims face denial or downcoding when documentation lacks necessary elements, directly reducing revenue. Compliance audits uncover documentation gaps. Quality performance metrics drop when high-quality care lacks proper documentation. Provider evaluation becomes inconsistent when different reviewers interpret standards subjectively.

We addressed the core issue underlying these problems—the basic incompatibility between complexity and scale:

The complexity dimension: Healthcare standards like QNOTE and PDQI-9 establish quality through inherently sophisticated criteria. Documentation needs to achieve "clarity with comprehensiveness," "conciseness with completeness." Clinical notes must "show clinical reasoning," "justify medical necessity," and "enable care coordination." These aren't simple binary conditions—they represent nuanced assessments along continuous spectrums that traditional technology cannot measure.

The scale dimension: Major healthcare systems produce millions of clinical documents each year. Manual evaluation reaches only small samples. Traditional improvement methods attempted to address this disconnect by reducing standards to simple checkboxes that automated systems could process—but this fundamentally altered the measurement and overlooked critical quality dimensions.

The Vicert Solution: A Flexible AI Evaluation Framework

We built our platform around a fundamental insight: large language models possess the capability to comprehend and evaluate nuanced criteria in ways traditional technology cannot achieve. While rule-based systems search for specific patterns and conditions, LLMs actually interpret meaning and context—they can assess whether clinical documentation "shows empathy" or "contains adequate detail for care continuity" because they comprehend the content rather than merely scanning for specific terms.

Our framework incorporates three core components:

1. Configurable Evaluation Standards

The platform accommodates both established industry standards (QNOTE, PDQI-9) and customized evaluation criteria matching organizational needs. This adaptability proved essential—different healthcare organizations operate with varying payer requirements, patient demographics, and regulatory obligations.

We organized evaluation standards hierarchically, reflecting how healthcare professionals naturally conceptualize documentation quality. Question Groups arrange related evaluation criteria into coherent sections. QNOTE, for instance, contains groups such as "Chief Complaint(s)", "History of Present Illness (HPI)", "Assessment (diagnosis; differential)", and "Plan of Care".

Within each group, Questions or Attributes specify particular evaluation criteria. Each question includes an attribute name identifying what to evaluate (e.g., "Conciseness", "Sufficient information", "Clear clinical reasoning") and an ideal note description establishing the quality benchmark.

For example, the "History of Present Illness" group evaluates whether the HPI contains sufficient information, is concise, clear, and organized—each with specific descriptions of what the ideal documentation looks like.

The platform supports four distinct scoring approaches, providing organizations flexibility to select evaluation scales matching their requirements:

Rubrics - Categorical scales with levels like "Fully", "Partially", and "Unacceptable" for comprehensive quality evaluation
Yes/No Binary - Straightforward binary evaluation for compliance verification
Scale 1 to 5 - Five-point ordinal scale with descriptive anchors for comparative tracking
Scale 1 to 10 - Ten-point scale for maximum precision in performance evaluation

Organizations can create highly specialized evaluation standards matching specific clinical departments, encounter categories, or organizational objectives. We developed examples like a cardiology-focused standard that evaluates cardiovascular documentation using criteria cardiology teams prioritize—criteria that general standards don't address.

This custom cardiology standard employs four distinct scoring approaches strategically aligned with evaluation requirements. It uses rubrics for patient history and treatment domains, a 1-5 scale for diagnostics, a 1-10 scale for risk stratification where granular differentiation matters, and Yes/No for checking the presence of regulatory requirements like patient education and consent.

The criteria capture cardiology-specific elements that general standards miss: cardiovascular disease family history specifically (not just "family history"), ECG findings contextualized with symptoms (not just "test results included"), and ejection fraction, arrhythmia burden, and sudden cardiac death risk assessment (not just "risk discussed").

A cardiology department using this standard can verify consistent documentation of cardiovascular risk factors, monitor appropriate cardiac risk stratification, confirm adequate documentation detail for device implantations and surgical procedures, track informed consent documentation compliance, and discover training opportunities where providers inadequately document critical cardiovascular prognostic measures.

The critical advantage: organizations can establish quality definitions matching their experts' standards, not their technology's constraints.

2. Scalable Evaluation API

The evaluation engine is exposed through a straightforward API accepting clinical notes and returning structured evaluations. Organizations submit clinical notes with the evaluation standard specification, the system retrieves corresponding evaluation prompts, the LLM generates structured evaluations with scores and comprehensive explanations for each criterion, and results return in standardized formats enabling workflow integration.

This architecture achieves scalable evaluation with consistency. All notes undergo evaluation using identical sophisticated criteria through the same methodology. Providers receive feedback within hours of documentation completion.

What evaluation results look like: Each evaluation includes both scores and comprehensive explanations justifying each score. For example, evaluating a STEMI patient note might show that the History of Present Illness receives "Fully" ratings for sufficient information and conciseness, with detailed reasoning explaining why. The Plan of Care might receive "Partially" with specific guidance on missing elements like heparin dosing, consent documentation, and disposition plans.

The system doesn't merely identify problems—it precisely indicates what's missing and provides actionable guidance. It acknowledges strengths alongside improvement areas, and demonstrates comprehension of clinical context, recognizing urgent situations rather than mechanically checking boxes.

For a complete QNOTE evaluation, a clinical note receives 43 individual assessments (one per QNOTE attribute across all sections). Each delivers specific, contextualized guidance.

Following complete evaluation, the system generates summary scores aggregating all individual assessments. Each rating receives numeric values (Fully = 10 points, Partially = 5 points, Unacceptable = 0 points), producing an overall score. For QNOTE evaluations with 43 attributes, the maximum possible score is 430 points.

The overall score enables benchmarking across departments and providers, trend monitoring over time, establishing performance standards for various scenarios, and tracking individual provider growth through training initiatives.

3. Quality Verification Layer

A fundamental challenge with AI-generated evaluations involves establishing trust. Our platform incorporates a verification layer addressing this issue.

Utilizing a framework called OpenEvals, the system executes quality assessments on its own evaluations, checking for hallucination detection (do evaluations make claims unsupported by actual documentation?) and conciseness evaluation (are evaluations clear and targeted, or verbose and redundant?).

These checks run automatically per evaluation, generating quality scores for the evaluations themselves. Organizations establish confidence requirements—evaluations failing quality thresholds trigger human review rather than automatic acceptance.

This verification architecture establishes accountability and preserves trust, recognizing AI systems' fallibility and incorporating systematic error detection before operational impact occurs.

Real-World Applications: From Clinical Notes to Broader Documentation Challenges

Clinical documentation quality served as our starting point due to clear business impacts and volumes making manual review unfeasible. However, the platform architecture reaches well beyond clinical notes. The core capability—comprehending nuanced criteria and applying them scalably—functions wherever healthcare organizations require systematic evaluation of human-created content against sophisticated standards.

Clinical Note Quality Assessment: The primary use case is assessing clinical notes against QNOTE, PDQI-9, or organizational standards before claim submission. Organizations can detect documentation deficiencies while corrections remain possible, monitor provider performance uniformly, deliver immediate feedback, and identify systematic patterns.

Prior Authorization Documentation: Organizations can establish payer-specific evaluation criteria and assess requests pre-submission. This exceeds simple checkbox verification; it evaluates whether clinical narratives successfully demonstrate medical necessity per each payer's nuanced requirements, resulting in enhanced approval rates with decreased administrative workload.

Discharge Summary Quality: Discharge summaries undergo evaluation for completeness and clarity before patient discharge. The platform checks whether summaries contain medication reconciliation, clear follow-up directions, and essential narrative context for subsequent providers—complex criteria that directly affect readmission rates but resist rule-based evaluation.

Quality Measure Documentation: Before quality reporting deadlines, organizations systematically verify documentation supporting quality measures. Documentation deficiencies frequently impact quality scores—not from care failures, but documentation inadequacies.

Peer Review and Case Analysis: Organizations can establish custom peer review evaluation frameworks—care decision appropriateness, clinical pathway compliance, reasoning quality—and apply them uniformly across reviews, introducing consistency to traditionally subjective processes.

The unifying element across these applications: sophisticated quality standards systematically applied at scale to human-created healthcare content.

Key Learnings from Building This Platform

Our development and testing revealed several critical insights:

Customization proved indispensable. Healthcare's diversity precludes universal solutions. Various organizations face different documentation demands based on payer relationships, patient characteristics, and internal priorities. Platforms forcing organizations to accept pre-defined criteria fail to address practical requirements.

Verification required foundational integration. Establishing trust in AI evaluations demands systematic quality validation. Organizations need assurance that evaluations prove accurate and grounded in real documentation. Embedding verification in core architecture—rather than adding it afterward—proves essential for operational implementation.

Standards existed, but scalable application represented the gap. Healthcare professionals had created sophisticated documentation standards like QNOTE and PDQI-9. The technology gap wasn't in quality definition—it involved consistent evaluation across millions of documents. LLMs address that gap.

The pattern generalizes broadly. Though we initially targeted clinical notes, the architecture functions wherever healthcare professionals create content requiring complex standard compliance. The investment creates reusable capability, not single-function tools.

Human expertise gains power rather than losing relevance. The platform doesn't replace clinicians and quality professionals who established these standards. It amplifies their expertise. What previously applied only to small samples through costly manual review now scales across comprehensive documentation.

Implications for Healthcare Organizations

Our platform at Vicert proves that the divide between sophisticated documentation standards and operational scale can be crossed. Organizations needn't choose between substantial quality criteria and comprehensive assessment.

For healthcare organizations considering AI-powered quality assurance, several insights from our work prove valuable:

Begin with definite business value. We targeted clinical notes because denied claims, audit discoveries, and quality measure deficiencies create quantifiable costs. Improvement ROI proves measurable.

Favor adaptability over fixed solutions. Healthcare organizations possess varied requirements. Platforms adjusting to your standards and criteria provide greater value than inflexible, vendor-specified solutions.

Integrate verification fundamentally. Healthcare organizations require AI output confidence. Quality validation shouldn't be optional or subsequent—it must be an architectural foundation.

Consider capabilities over isolated solutions. Organizations implementing AI evaluation for clinical notes gain insights applicable to prior authorizations, discharge summaries, peer reviews, and numerous other documentation categories.

Moving Forward

Our clinical documentation quality evaluation framework at Vicert demonstrates what AI enables when thoughtfully applied to genuine healthcare operational problems. Sophisticated standards like QNOTE and PDQI-9 receive evaluation at healthcare's required scale—comprehensively, consistently, and with suitable quality safeguards.

This doesn't replace human expertise. It scales that expertise. The clinicians and quality professionals who created documentation standards remain vital—their knowledge now extends to every clinical note rather than limited samples.

At Vicert, we developed this platform after recognizing a distinct gap between healthcare organizations' requirements and existing tools' capabilities. Technology can connect sophisticated standards with operational scale when implemented with adaptability, transparency, and trust.

Organizations successfully deploying AI for quality assurance begin with defined business challenges, implement solutions adjusting to their particular requirements, preserve suitable human supervision, and develop organizational capabilities spanning multiple applications.

We've validated the approach for clinical documentation quality. The identical methodology functions wherever healthcare organizations must systematically assess human-created content against sophisticated standards.

The technology stands ready. The standards exist. The opportunity is clear.

View All Success Stories ->