Introducing RadImageNet-VQA: A New Benchmark for Radiologic Visual Understanding - Raidium Blog

Interpreting radiology exams requires combining visual recognition and clinical reasoning, often with medical context. To be useful in this setting, a model must not only recognize what appears in an image, but connect those visual patterns to medically meaningful findings.

By training on large datasets of visual and text pairs, vision-language models (VLMs) are well suited to tackle this challenge. Medical VLMs, such as MedGemma and Lingshu, now incorporate curated medical data to better align with clinical knowledge.

These models are particularly promising for radiology, where generating reports and supporting clinical reasoning require combining visual and textual information. But measuring progress is not straightforward. Radiology report generation is difficult to evaluate reliably, because standard free-text metrics often miss what matters most in practice: clinical correctness, factual consistency, and alignment with the actual findings in the scan.

Radiologic visual question answering offers a useful alternative. Instead of scoring an entire report, it tests whether a model can answer targeted questions about an image. This makes evaluation more structured, more interpretable, and closer to the way radiologists examine imaging findings. Yet, existing medical VQA benchmarks remain limited: many are small, cover only a narrow range of anatomy or disease, focus mainly on X-ray images or biomedical figures, or contain textual shortcuts that let models guess the answer without truly using the image.

To help address this gap, we introduce RadImageNet-VQA, a large-scale dataset for radiologic VQA on CT and MRI. Built from expert-curated annotations, it includes 750,000 images paired with 7.5 million generated samples for training and evaluation. The dataset and benchmark are publicly available on Hugging Face, and we will present this work at MIDL 2026.

Figure 1. Overview of the RadImageNet-VQA dataset

I. RadImageNet-VQA: Dataset and Benchmark

RadImageNet-VQA is a large-scale CT and MRI dataset for training and evaluating VLM on radiologic VQA. It is built from the CT and MRI portion of RadImageNet, an expert-annotated medical imaging database where each image carries structured labels for modality, anatomical region, and pathology category. Figure 1 gives an overview of the dataset structure.

In total, RadImageNet-VQA includes 750,000 images and 7.5 million generated samples: 750,000 caption pairs for image-text alignment and 6.75 million QA pairs spanning three core tasks:

Anatomy recognition focuses on identifying the body region being imaged.
Example: Which anatomical region is shown in this scan?
Abnormality detection asks whether the image shows a meaningful finding or appears normal.
Example: Is there any abnormality present?
Fine-grained pathology identification tests whether the model can identify the specific disease or lesion visible in the image.
Example: What pathology is shown in this scan?

The dataset covers 8 anatomical regions and 97 pathology categories, with questions in open-ended, closed-ended, and multiple-choice formats. For evaluation, we built a curated benchmark of 9,000 QA pairs from 1,000 images, stratified across tasks and question types.

A key design choice was to reduce shortcut answering. For pathology multiple-choice questions, incorrect options are restricted to clinically plausible alternatives from the same anatomical region, so the task cannot be solved by anatomy cues alone. Each pathology question also includes a “no pathology seen” option, so the modelverifies that a finding is truly present rather than assume one exists.

II. What the Benchmark Reveals: Key findings

We evaluated a broad set of state-of-the-art VLMs on RadImageNet-VQA under zero-shot conditions, including: OpenAI’s GPT-5, Google DeepMind’s Gemini 2.5 Pro, Shanghai AI Lab’s InternVL, Alibaba’s Qwen2.5-VL, Google DeepMind’s MedGemma, and Alibaba DAMO Academy’s Lingshu (as shown in Table 1).

Table 1. Zero-shot accuracies (%) of VLMs on RadImageNet-VQA benchmark.

The benchmark reveals a consistent pattern across today’s VLMs: anatomy is relatively easy, pathology is not. Models perform strongly on anatomy recognition, with InternVL3.5-8B reaching 93.3% accuracy on anatomy multiple-choice questions. But performance drops sharply on fine-grained pathology identification, especially in open-ended settings. Most models score below 20% there, and the best result on open-ended pathology comes from MedGemma-4B at 30.6%.

This suggests that current models are much better at recognizing the part of the body being imaged than at identifying the precise disease process visible within it.

Another notable result is the relative strength of general-purpose models. Across the benchmark, InternVL3.5-14B achieves the best overall average accuracy at 63.6%. Medical specialization helps in some settings, but it does not yet consistently outperform broader multimodal pre-training.

On abnormality detection, GPT-5 and Gemini 2.5 Pro score below random guessing, which may reflect a conservative bias toward “no abnormality” answers. This likely reflects safety alignment that pushes these models toward conservative "no abnormality" responses, rather than engaging with the visual content directly.

III. Does the Benchmark Require Visual Reasoning?

One of the central questions for any benchmark is whether it measures the capability it claims to measure.

To test that directly, we run a text-only ablation: models are asked questions without access to the image. On older medical VQA datasets such as VQA-RAD and SLAKE, models still achieve non-trivial accuracy from the question text alone, showing that those benchmarks contain textual shortcuts. On RadImageNet-VQA, by contrast, performance drops close to random. For open-ended questions, text-only accuracy falls to near-random levels; for multiple-choice pathology questions, it collapses to the expected 25% baseline as shown in Figure 2.

Figure 2. Text-only analysis of multiple VLMs’ accuracy for open-ended questions on RadImageNet-VQA.

The same pattern holds for multiple-choice questions. On MMMU-Med-val, text-only accuracy remains noticeably above random, while on RadImageNet-VQA collapses to the expected 25% baseline as shown in Figure 3.

Figure 3. Text-only analysis of multiple VLMs’ accuracy for multiple choice questions on RadImageNet-VQA.

These findings are important because it means good performance on RadImageNet-VQA requires actually using the image. The benchmark is not just measuring familiarity with medical phrasing or dataset priors; it is measures visual grounding.

IV. What Happens After Fine-Tuning

Fine-tuning on RadImageNet-VQA leads to strong and consistent gains across model families. In the paper’s experiments, average performance improves by roughly 19.5 to 22.5 points, depending on the model. Anatomy recognition reaches near-ceiling performance after training, and abnormality detection improves substantially as well.

However, even after fine-tuning, fine-grained pathology identification remains the hardest task. The models become better, but the core challenge does not disappear. That makes RadImageNet-VQA useful not only as a training dataset, but also as a diagnostic tool for understanding where current radiologic VLMs still fall short.

One additional finding is worth noting. In the paper’s ablations, models initialized with the medically pre-trained MedSigLIP vision encoder did not outperform those using standard SigLIP. In this setting, broad visual pre-training appeared to matter more than medical-domain pre-training for downstream CT/MRI VQA performance.

Together, these results show that RadImageNet-VQA is useful both as a training resource and as a tool for understanding where current models still fall short.

V. Why This Matters for Radiology

A benchmark's value is not just that it ranks models. It is clarifying what they can and cannot do.

At Raidium, building foundation models for radiology and evaluating them rigorously are two sides of the same effort. Curia is our foundation model for radiology; RadImageNet-VQA is part of how we measure whether models like it are actually learning what matters, distinguishing specific diseases from visual evidence rather than relying on language patterns. You cannot build better models without better ways to measure them.

The dataset and benchmark are fully open and available to researchers in both academia and industry on Hugging Face. We see this as a shared resource for the community, and we hope it helps move the field toward more precise measurement of what foundation models can actually do in radiology.

Ready to explore RadImageNet-VQA? Here’s where to start:

Download RadImageNet-VQA dataset from HuggingFace:
https://huggingface.co/datasets/raidium/RadImageNet-VQA
Access the RadImageNet-VQA paper here:
https://arxiv.org/abs/2512.17396
Read our other blogposts:
https://raidium.eu/blog/

Acknowledgements:

Thank you to our partner GENCI