Author: Luca Candelori
From our previous work with Blackrock (article, paper), we demonstrated that QCML excels in environments where datasets are sparse and imbalanced. This advantage proves particularly valuable in healthcare analytics, where patient data is inherently variable due to individual differences in medical profiles, inconsistent documentation, and the inherent sparsity of medical records. Medical data presents unique challenges:
These factors make identifying patient similarity particularly challenging, precisely the kind of environment where QCML demonstrates its strengths.
Imbalanced datasets require fundamentally different analytical approaches than balanced ones. Consider our UTI prediction case with data from only 1,436 patients: when 95% of urinalysis results are negative, a model that simply predicts "negative" for every case would achieve 95% accuracy… while providing absolutely no clinical value.
Balanced accuracy provides a more representative performance metric by giving equal weight to positive and negative classes regardless of their proportions in the dataset:
Formula: Balanced Accuracy = (Sensitivity + Specificity)/2
Where:
To illustrate this difference, our example "always negative" model would score:
ROC AUC (Receiver Operating Characteristic Area Under Curve) evaluates model performance across all possible threshold values by plotting:
The Area Under Curve provides a comprehensive evaluation metric ranging from 0 to 1:
ROC AUC is particularly valuable for imbalanced datasets because:
In our UTI example, while a model always predicting "negative" would achieve high standard accuracy (95%), its ROC AUC would correctly reveal its lack of discriminative power (0.5).
When we use the term "feature", we're referring to interpretable properties of input that systems respond to. In medical contexts, features might include patient symptoms, lab values, or demographic information.
Collecting these features from patient data is challenging for several reasons:
This makes finding patient similarity difficult because the data is inconsistently reported as well as sparse. QCML does well at this kind of dataset.
For our investigation of patient similarity in urinalysis data, we utilized the Urinalysis Test Results dataset available on Kaggle.
We began by creating a classification baseline using standard QCML models. The performance metrics demonstrated promising results:
Model | Balanced Accuracy | ROC AUC |
---|---|---|
QCML | 0.7902 | 0.765 |
QCML Weighted | 0.7921 | 0.7668 |
QCML Ensemble | 0.7995 | 0.8033 |
Fig 1: ROC curves comparison showing baseline QCML model performance
Building on the baseline results, we implemented a K-Nearest Neighbors (KNN) using QCML generated similarity matrix. The KNN algorithm works by:
The visualization of patient distances in our QCML model reveals clear clustering patterns. In the scatter plot, we can observe how positive UTI cases (red points) form distinct clusters, particularly along the top and bottom edges of the distribution, while negative cases (blue points) show more diffuse patterns throughout the feature space. This visual representation confirms that our QCML approach successfully captures meaningful patterns in patient similarity that correlate with UTI outcomes.
Fig 2: QCML proximity visualization showing patient clustering patterns
When comparing our KNN QCML approach to the baseline models, we observed significant improvements in predictive performance. Fig 3 demonstrates that:
Fig 3: KNN QCML performance aggregated across 32 seeds
Our results demonstrate that QCML is a powerful tool for medical prediction tasks involving sparse, high-dimensional data. Its performance on urinalysis-based UTI prediction highlights its potential for broader clinical applications where data sparsity and class imbalance are persistent challenges. The intuitive clustering of similar patients observed in our visualization has practical significance, as it enables clinicians to understand which patient characteristics drive predictions, enhancing trust and interpretability in AI-assisted medical decision-making.