Overview
MSc Artificial Intelligence mid-module assessment implementing binary classification to predict heart disease risk using the UCI Heart Disease dataset. Comprehensive comparison of three machine learning approaches: Random Forest (ensemble learning), Support Vector Machines (kernel methods), and Neural Networks (deep learning). Each model tested with 3 different configurations to explore hyperparameter effects.
The Problem
Develop a reliable predictive model for heart disease risk assessment using patient health metrics. The challenge involves handling a relatively small dataset (303 samples), selecting informative features from 13 clinical variables, and choosing the optimal algorithm and hyperparameters for this specific medical classification task.
The Approach
Implemented a complete ML pipeline including exploratory data analysis with correlation heatmaps, feature importance ranking, and distribution analysis. Data preprocessing with StandardScaler normalisation and stratified train-test splits. Trained 9 model configurations: Random Forest (default, depth-limited, sample-constrained), SVM (RBF, linear, tuned RBF), and Neural Networks (simple, with dropout, wider architecture). Evaluation using accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices.
Outcome
Random Forest with sample constraints achieved the highest accuracy at 85.25%, with chest pain type (cp), maximum heart rate (thalachh), and number of major vessels (caa) identified as the most predictive features. Neural networks showed tendency to overfit on the small dataset despite dropout regularisation. Comprehensive Jupyter notebook with visualisations, statistical analysis, and reproducible results.
