Title : A multimodal deep learning and machine learning framework for predicting breast cancer recurrence using imaging, clinical, and genomic data
Abstract:
Breast cancer is the most commonly diagnosed cancer among women worldwide, with over two million new cases annually, and remains a leading cause of cancer-related mortality. Despite improvements in early detection and therapy, 20-30% of early-stage patients experience recurrence, rising to ~40% in aggressive subtypes such as triple-negative breast cancer (TNBC). Early identification of high-risk patients is critical for guiding personalized treatment and improving outcomes. We developed a multimodal framework that integrates magnetic resonance imaging (MRI)-based imaging predictions, radiomic features, clinical variables, and genomic data to improve breast cancer recurrence prediction. Data were obtained from the Duke Breast Cancer MRI dataset (The Cancer Imaging Archive, TCIA), comprising 922 patients and 201 MRI volumes after preprocessing. MRI scans were resized to 32 × 64 × 64 voxels and normalized for convolutional neural network (CNN) input. Clinical variables included age, tumor size, hormone receptor status, and treatment response, while genomic features comprised human epidermal growth factor receptor 2 (HER2), breast cancer gene 1 (BRCA1), estrogen receptor alpha (ESR1), and progesterone receptor (PGR). Missing values were imputed, key genomic features selected, and all features normalized. The dataset was split into 70% training, 15% validation, and 15% testing. A CNN trained solely on MRI data showed limited predictive performance (area under the curve, AUC = 0.6). In contrast, the multimodal approach combining MRI, clinical, and genomic data substantially improved performance (accuracy ~72%, F1 score 0.77). Key predictive features included tumor volume, signal enhancement ratio (SER)-based heterogeneity, and wash-in enhancement dynamics, reflecting tumor morphology associated with aggressiveness. Predictive performance was constrained by class imbalance, highlighting the need for data augmentation and cohort expansion in future work. These results demonstrate that integrating imaging, clinical, and genomic data can more accurately predict breast cancer recurrence than single-modality approaches, supporting early identification of high-risk patients and advancing the application of artificial intelligence in precision oncology.

