Introducing Jolia: Our Vision-Language Foundation Model for CT - Raidium Blog

At Raidium, building toward AGI in radiology requires models that do more than see. They need to understand — to read a scan the way a radiologist does, connecting visual structure to clinical language. Today, we are releasing Jolia, our vision-language foundation model for chest and abdominal CT, trained on paired scans and radiology reports.

Alongside this release, we are also introducing the Raidium model family: a named framework that describes how each layer of our technology — from foundation to application to clinical workflow — fits together.

Jolia Model Evaluation at a Glance

Jolia sets a new standard for vision-language AI in radiology. By combining global and concept-level image-text alignment, Jolia outperforms leading CT foundation models across findings classification, cross-center transfer, and report generation.

Jolia introduces ConQuer, a new pre-training method that augments global CLIP alignment with localized, per-concept alignments, one per anatomical region, learned end-to-end without any segmentation supervision
Jolia sets a new state-of-the-art on findings classification, outperforming all public baselines across chest and abdominal CT on both in-distribution and out-of-distribution benchmarks, including models trained on significantly larger private datasets
Jolia achieves the highest cross-center transfer performance, demonstrating robust generalization to unseen clinical institutions
Jolia leads on radiology report generation for abdominal CT, with a +34% relative gain on RadGraph-F1 over the previous best model and the highest clinical fidelity score across all evaluated systems

The Raidium Model Family

Raidium models are organized across three tiers.

At the foundation sit RadSAM, Curia, and Jolia. RadSAM is our original segmentation foundation model, generalizing anatomy segmentation. Curia is our vision-only foundation model family, trained on over 200 million CT and MRI slices to build deep structural understanding across modalities. Jolia is our vision-language family — trained on paired scans and radiology reports to align imaging and clinical text.

Above the foundation sit our application models: RaidiumSeg for segmentation, RaidiumDetect for lesion detection, and RaidiumProp for longitudinal propagation across timepoints. RaidiumFlow orchestrates these into coherent workflows.

At the top sit our clinical workflows. OncoPilot brings detection, segmentation, longitudinal tracking, and structured reporting into a single experience for oncology radiologists.

One platform. One foundation. Each layer built on the one below.

I. Why Vision-Language Alignment Matters

Curia and Curia-2 demonstrated the power of scale: pre-training on vast, unlabeled CT and MRI slices produces a model that generalizes across anatomy, modality, and disease in ways task-specific architectures cannot. But structural understanding alone has limits. The clinical value of a finding depends on being able to name it, describe it, and communicate it — tasks that require connecting imaging to language.

Jolia is trained to make that connection. Pre-trained on chest and abdominal CT paired with radiology reports, Jolia learns to map what it sees in the volume to the clinical concepts radiologists use to describe it. This enables classification of findings across 171 abnormalities, structured report generation, and representations that transfer reliably across centers — all from a single foundation model.

II. The Challenge: Structure Lost in Translation

The standard approach to vision-language pre-training, CLIP-style alignment, compresses both the entire image and report into single global vectors and learns to match them. This works well for natural images with short captions. It is a poor fit for radiology.

A CT scan of the chest and abdomen spans dozens of organs. A radiological report describes findings organ by organ — liver, lungs, kidneys, lymph nodes — in structured sections that are much longer and richer than a typical caption. Encoding all of this into one global token inevitably loses detail. An observation about the liver can be overwhelmed by the rest of the report. A focal lesion in a single lobe can be averaged out.

Recent approaches have tried to recover this structure using segmentation masks, explicitly cropping organs before alignment. This works, but at a cost: it limits coverage to the organs a segmentation tool supports, and it adds a dependency on expensive spatial supervision.

III. ConQuer: Concept-Level Alignment Without Spatial Supervision

Jolia is trained using ConQuer (Concept Queries), a new image-text pre-training method we developed to address this gap.

ConQuer augments standard global CLIP alignment with a parallel set of localized alignments — one per anatomical concept. On the text side, we use an LLM to split each radiology report into concept-specific sections, grouping findings by organ. On the image side, we introduce a small set of learnable cross-attention queries — one per concept — that pool concept-specific features directly from the image encoder, with no segmentation mask or spatial annotation required.

Each query learns where to attend purely through the per-concept contrastive loss: it is trained to match the image features for its concept (e.g., liver) against the text describing that concept across patients in the batch. As a byproduct, these queries produce attention maps that are anatomically coherent and interpretable — the model learns to look at the liver when reading about the liver, without ever being told where the liver is.

Jolia uses 102 anatomical concepts spanning chest and abdominal CT, and is pre-trained on 74,434 public CT–report pairs.

IV. Results: State of the Art Across the Board

Findings classification

Jolia sets a new state-of-the-art on findings classification across both chest and abdominal CT, evaluated on four datasets covering 252 abnormalities in- and out-of-distribution.

Against our in-house CLIP baseline (same encoders, global alignment only), Jolia gains +1.7 AUROC on average, with the largest gains on out-of-distribution test sets. Against the strongest public baselines, Jolia outperforms Pillar-0 by +2.2 AUROC on CT-RATE and beats SPECTRE by +2.0 AUROC on external abdominal CT — while also surpassing segmentation-based methods that rely on organ-level masks, by up to +9.4 AUROC on chest.

Crucially, per-concept tokens add consistent value beyond the global [CLS] representation, and the combination of both is best on every evaluated configuration.

Cross-Center Transfer

Medical AI models are routinely tested at the center they were trained on — and routinely fail when deployed elsewhere. Jolia is designed to transfer. Trained on public chest and abdominal CT datasets, it achieves the highest cross-center transfer performance on both external evaluation sets: 77.05% average AUROC on out-of-distribution data, outperforming all baselines including models trained on significantly larger private corpora.

The ConQuer loss adds +2.7 AUROC on average over the global CLIP baseline in the transfer setting — the largest gains concentrated on the hardest, out-of-distribution splits.

Radiology Report Generation

We fine-tuned Jolia's encoder with a Qwen3.5-9B language model to generate the findings section of abdominal CT reports from the volume alone. Jolia achieves state-of-the-art performance on the Merlin-Abd-CT benchmark, leading on four of six clinical metrics including a +34% relative gain on RadGraph-F1 over Merlin, and the highest GREEN score across all evaluated models.

V. Looking Ahead

Jolia is the first member of a family of vision-language models. The next steps are clear: scaling to larger and more diverse data, extending to new modalities and anatomical regions, and continuing to improve the ConQuer methodology — including jointly training the text encoder and scaling the concept taxonomy beyond anatomy.

Within the Raidium platform, the structural backbone provided by Curia already powers our application and workflow layers. Jolia opens the next chapter: by aligning imaging and clinical language, it creates the foundation for a findings applied model — translating what the model sees in a scan into structured, clinically precise observations.

Ready to explore Jolia?

Read our preprint on arXiv
Download our model on Huggingface
Contact us to learn more
Subscribe to our newsletter

Acknowledgements

Thank you to our partners CIN, IDRIS and GENCI.