MedQA: Fine-Tuning a Clinical AI on AMD ROCm — No CUDA Required
A complete walkthrough of LoRA fine-tuning Qwen3-1.7B on MedMCQA using AMD MI300X, built for the AMD Developer Hackathon on lablab.ai.

Medical question answering is one of those tasks where the stakes are genuinely high. A model that confidently picks the wrong answer on a clinical MCQ isn't just wrong — it's dangerous. At the same time, most open-source medical AI work assumes you have an NVIDIA GPU.
CUDA is the default. Everything else is an afterthought. MedQA is a LoRA fine-tuned clinical question-answering model built entirely on AMD hardware using ROCm.
It takes a multiple-choice medical question and returns both the correct answer letter and a clinical explanation of the reasoning. The entire training pipeline — from data loading to adapter export — runs on an AMD Instinct MI300X without a single CUDA dependency. The AMD Instinct MI300X is a remarkable piece of hardware: 192 GB of HBM3 memory in a single device.
For LLM fine-tuning, VRAM is often the binding constraint — it dictates batch size, sequence length, and whether you need to quantize at all. With 192 GB available, we trained Qwen3-1.7B with LoRA in full fp16 without any 4-bit or 8-bit quantization hacks. More importantly, the goal was to prove that the HuggingFace ecosystem — Transformers, PEFT, TRL, Accelerate — works seamlessly on ROCm.
It does. The same training code that runs on CUDA runs on ROCm with three environment variables set: That's it. No code changes.
No custom kernels. No CUDA compatibility shims. MedMCQA is a large-scale multiple-choice question dataset derived from Indian medical entrance exams (AIIMS, USMLE-style).
Each example contains: For this project we used 2,000 training samples — a deliberately small slice to demonstrate that meaningful fine-tuning is achievable quickly. Training took approximately 5 minutes on the MI300X. The base model is Qwen/Qwen3-1.7B — Alibaba's latest small-scale language model.
At 1.7 billion parameters it's compact enough to fine-tune cheaply but capable enough to produce coherent clinical reasoning. It supports trust_remote_code=True and loads cleanly with HuggingFace Transformers. Consistency in prompt formatting is critical for instruction fine-tuning.
Every training example and every inference call uses the same template: During training the model sees the full sequence including the answer and explanation. During inference we provide everything up to ### Answer: and let the model complete from there. Rather than fine-tuning all 1.5 billion parameters, we use LoRA (Low-Rank Adaptation) via the PEFT library.
LoRA injects small trainable rank-decomposition matrices into the attention layers, leaving the base weights frozen. Only ~2.2 million of the model's 1.5 billion parameters are trained. This keeps memory usage low and training fast.
Source: Hugging Face