AlphaGrid v0.1 Blog

Automated systems for oncologic imaging have made rapid progress in perception and language generation. However, many remain difficult to audit, fragile under numeric drift, or limited to slice-level or report-level reasoning. As a result, even strong models often fail to produce reports that can be reliably compared across patients, sites, or time.

AlphaGrid v0.1 was evaluated as an end-to-end system for automated lung cancer reporting from thoracic CT, with a deliberately narrow objective: generate factually grounded, structured oncology reports directly from full 3D CT volumes, and quantify their clinical correctness using standard report-level evaluators.

This post reports quantitative evaluation results, compares AlphaGrid against prior models using a CheXbert-style Micro-F1 analysis, highlights what is structurally different about AlphaGrid, and explains how this structure enables a path toward cancer digital twins.

Evaluation Setup

AlphaGrid v0.1 was evaluated on a held-out internal cohort of 450 de-identified thoracic CT studies, split at the patient level and stratified by site. All studies were processed independently. When prior CT scans were available, they were provided only as comparators; no longitudinal modeling or progression reasoning was performed.

The system was pretrained on open CT datasets for detection and segmentation and finetuned on internal data spanning multiple scanners, reconstruction kernels, and slice thicknesses between 1 and 2.5 mm. Ground truth consisted of structured annotations and audited radiology reports, with a subset reviewed by radiologists to assess material clinical errors.

End-to-End Reporting Performance

Figure 1 summarizes end-to-end reporting performance across progressively stronger system classes. AlphaGrid v0.1 improves consistently across lesion detection sensitivity, primary tumor segmentation accuracy, TNM staging accuracy, and factual consistency between structured outputs and generated reports.

Figure 1. CT lung cancer reporting performance comparison across system classes.

Detection sensitivity reaches 91.8% at one false positive per scan. Primary tumor segmentation achieves a Dice score of 87.4%. Exact TNM match reaches 78.6%. Factual consistency between structured slots and generated text reaches 94.7%, reflecting the elimination of numeric drift.

Factuality by Construction

AlphaGrid generates narrative reports from structured findings under explicit constraints. Numeric values, anatomical labels, and staging outputs are copied directly from normalized slots rather than inferred during text generation.

Figure 2 isolates the effect of this design decision. Removing constrained decoding reduces factual consistency from 94.7% to 81.2%, despite identical vision outputs. The improvement exceeds 13 percentage points, with reduced variance across audited reports.

Figure 2. Effect of constrained decoding on factual consistency.

This demonstrates that factual correctness is enforced at the system level rather than emerging from the language model.

Staging as a Derived Outcome

In AlphaGrid, TNM staging is not predicted directly. It is derived deterministically from detected entities and measured attributes.

Figure 3 breaks down TNM accuracy by component. T-stage accuracy reaches 85.9%, N-stage accuracy reaches 81.1%, and M-stage accuracy reaches 92.0%. Exact TNM accuracy is lower, reflecting the compounding of component-level uncertainty rather than failures in any single stage.

Figure 3. TNM staging accuracy by component and exact match.

This separation between perceptual extraction and clinical derivation ensures that staging remains auditable and adaptable to guideline updates.

Model-Level Clinical Agreement

Following the evaluation protocol used in Mecha-Net v0.1, we report model-level Micro-F1 scores computed using a standardized clinical report labeller.

Specifically, we evaluate generated reports using RadGraph Micro F1, which measures agreement at the level of clinical entities and relations and is commonly used to assess factual correctness in radiology reports. This metric serves an analogous role to CheXbert Micro F1 for chest X-ray reporting.

Figure 4 reports RadGraph Micro F1 scores across prior vision-language and radiology-specific models, using their reported configurations. AlphaGrid v0.1 achieves the highest Micro F1 across both entities and relations.

Figure 4. RadGraph Micro F1 scores comparison across models.

This comparison is intentionally conservative. Most prior models operate on 2D images, selected slices, or report-level inputs, whereas AlphaGrid operates directly on full 3D CT volumes. Despite this difference, all models are evaluated at the report level using the same clinical graph metric.

Operational Characteristics

Figure 5 reports end-to-end system latency as a function of CT volume size. Latency scales approximately linearly with slice count. For typical thoracic CT studies, median processing time is 14.3 seconds on a single datacenter GPU, including preprocessing, inference, and report generation.

Figure 5. End-to-end processing latency as a function of CT volume size.

What Is Structurally Different About AlphaGrid

Most models in the Micro-F1 comparison operate on 2D images or report-level abstractions. AlphaGrid differs in that it ingests full 3D CT volumes, performs volumetric detection and segmentation, and derives all downstream clinical facts from these measurements before any text is generated.

The primary output of AlphaGrid is not a paragraph, but a time-stamped, patient-scoped structured state. The report is a rendering of this state, not the state itself.

This distinction determines what the system can support next.

From Reporting to Cancer Digital Twins

A cancer digital twin is a strict superset of reporting. It requires persistent entity identity, longitudinal state, and uncertainty propagation.

AlphaGrid v0.1 does not perform longitudinal reasoning, outcome forecasting, or treatment simulation. These capabilities are explicitly out of scope for this release.

However, because each CT study is converted into a normalized, auditable patient state, these states can be accumulated across time without reinterpreting prior scans. This makes longitudinal modeling a structural extension, not a redesign.

In this sense, reporting is not a precursor to digital twins. It is the substrate.

Conclusion

AlphaGrid v0.1 demonstrates that automated lung cancer reporting from CT can be accurate, auditable, and comparable across models when factual constraints are enforced by design and evaluation is grounded in clinical graph metrics.

The results presented here reflect what the system does today. The structure of the system determines what it can become next.