1. Introduction
Cancer is the second leading cause of death worldwide. In 2018, there were approximately 9.6 million deaths worldwide. These numbers are on the rise and by 2040 there are expected to be as many as 29.5 million new cases. Regardless of the type of cancer present, cancer cells are distinguished by growing out of control and becoming invasive (WHO). They harness the space and nutrients that healthy organs require, that could then trigger the malfunctioning of certain body systems (NIH).
Breast cancer is a type of cancer that starts in the breast and
occurs almost entirely in women, 2,1 million of new cases
were estimated on 2018 (WHO).
A total of 44,130 deaths
from breast cancer are estimated
to occur on 2021. (ASCO).
There is a need to search for new techniques in order to correctly diagnose and treat breast cancer.
The objective of this project is to develop an algorithm that
correctly predicts whether a
breast cancer cell nucleus sample is benign or malignant
.
For this purpose, several machine learning algorithms will be compared
taking into account their accuracy and execution time.
2. Dataset
This dataset is hosted on Kaggle (Breast Cancer Wisconsin (Diagnostic) Data Set), and it was from UCI Machine Learning Repository.
The dataset contains 30 features
plus the ID number and
diagnosis of the participants. The features were computed by measuring
10 parameters on each cell nucleus from a digitized image of a fine
needle aspirate (FNA) of a breast mass. For each parameter,
the mean, standard error and "worst"
(mean of the three
largest values) were obtained.
A breast fine needle aspiration (FNA) removes some fluid or cells from a breast lesion (a cyst, lump, sore or swelling) with a fine needle similar to that used for blood tests. The sample of fluid or cells (or tissue) is then examined (Borecky).
Real-valued features
:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” - 1)
The main libraries required for the development of this project are shown on the chunk below.
# Libraries
library(rpart.plot)
library(tidyverse)
library(skimr)
library(ggpubr)
# Helper packages
library(ggplot2) # for awesome graphics
library(dplyr) # for data manipulation
library(visdat) # for additional visualizations
# Feature engineering packages
library(caret) # for various ML tasks
library(recipes) # for feature engineering tasks
library(gridExtra)
The database used can be visualized with the table shown below. The sample ID and an empty column were eliminated as they have no relevance in the experiment.
# Read the data
data <- read.csv('./files/data.csv')
datatable(data, class = 'cell-border stripe')
data <- data[,-c(1,33)] # delete id and empty column
3. Exploratory data analysis
Variable type
First, a descriptive exploration of the variables was performed in order to understand them better and detect possible problems.
skim(data)
Name | data |
Number of rows | 569 |
Number of columns | 31 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 30 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
diagnosis | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
radius_mean | 0 | 1 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 | ▂▇▃▁▁ |
texture_mean | 0 | 1 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 | ▃▇▃▁▁ |
perimeter_mean | 0 | 1 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 | ▃▇▃▁▁ |
area_mean | 0 | 1 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 | ▇▃▂▁▁ |
smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 | ▁▇▇▁▁ |
compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
concave.points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
area_se | 0 | 1 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 | ▇▁▁▁▁ |
smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
concave.points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
radius_worst | 0 | 1 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
texture_worst | 0 | 1 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 | ▃▇▆▁▁ |
perimeter_worst | 0 | 1 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 | ▇▇▃▁▁ |
area_worst | 0 | 1 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 | ▇▂▁▁▁ |
smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
concave.points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
After visually analyzing the distribuitions of all the features,
no abnormalities
seem to be found. Other than the target
variable, which is a factor, all variables are numeric. The type of
variable is correctly assigned to all features.
In addition, there are no missing values
.
Dataset features
Target variable
The target variable indicates whether the cancer is benign (B) or
malignant (M). As seen before, there are 357 benign
samples
and 212 malignant
samples.
ggplot(data, aes(diagnosis, fill=diagnosis)) +
geom_bar() +
labs(x="Diagnosis", y="Number of patients") +
guides(fill=FALSE) +
scale_fill_manual( values = c("#686868","#9F2042"))
Posteriorly, boxplots of all the features were created in order to visualize its importance on classification. These boxplots are grouped in mean, standard deviation and “worse”.
Mean
BOXPLOTS
Other than the fractal_dimension_mean boxplot, there seems to be a
significant difference
in the value off all features when
comparing benign and malignant samples. There are many samples that seem
to have outliers
in the features values. In addition, the
range of values
varies a lot between features. For example,
the mean area of a sample can have a maximum value of 2500 while the
highest fractal dimension of a sample doesn’t reach 0,1.
library(gridExtra)
p <- list()
for (j in colnames(data)[2:11]) {
p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))
Standard Error
BOXPLOTS
The standard deviation features show less difference
between classes. As before, there are many outliers present.
p <- list()
for (j in colnames(data)[12:21]) {
p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))
"Worst"
BOXPLOTS
The “worst” features show a similar behaviour as that of the mean features.
p <- list()
for (j in colnames(data)[22:31]) {
p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))
Correlation
Because of the source of the features, a correlation is obviously
expected. As seen on the correlation heatmap
the highest
correlation is found between the perimeter, area and radious.
library(pheatmap)
pheatmap(cor(data[,-1]))
4. Division of data in training and testing & data preprocessing
Stratified sampling
As seen before, there are more benign that malignan samples.
Stratified sampling was used in order to guarantee the
same proportion in each class
than the one had in the
complete data set.
library(rsample)
set.seed(123)
split_strat <- initial_split(data,prop = 0.8, strata = "diagnosis")
datos_train <- training(split_strat)
datos_test <- testing(split_strat)
prop.table(table(data$diagnosis))
##
## B M
## 0.6274165 0.3725835
table(datos_test$diagnosis) %>% prop.table()
##
## B M
## 0.626087 0.373913
table(datos_train$diagnosis) %>% prop.table()
##
## B M
## 0.6277533 0.3722467
5. Feature Engineerig
Some transformations were performed on raw data with the aim of improving the performance of the algorithms.
The recipe object was created in order to be able to pre-process data.
obj_recipe <- recipe(diagnosis~.,data = datos_train)
obj_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
Imputation of missing values
There are no missing values
, thus there is no need to
imputate.
Variables with variance close to zero
As seen on the table below, all predictors have a
significant variance
. Hence, no features were removed based
on this criteria.
datatable(nearZeroVar(data,saveMetrics = T), class = 'cell-border stripe')
Standardization and scaling
The boxplots demonstrated the
need for centering and scaling
. The difference in scales
can have a great impact in the model. The step_normalize function
centers and scales the data.
obj_recipe <- obj_recipe %>%
step_normalize(all_numeric())
obj_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Operations:
##
## Centering and scaling for all_numeric()
Once the recipe object has been created with the preprocessing transformations, they are learned with the training data and applied to the two sets.
trained_recipe <- obj_recipe %>% prep(training=datos_train)
trained_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Training data contained 454 data points and no missing data.
##
## Operations:
##
## Centering and scaling for radius_mean, texture_mean, perimeter_mean, area... [trained]
datos_train_prep <-trained_recipe %>% bake(new_data = datos_train)
datos_test_prep <- trained_recipe %>% bake(new_data = datos_test)
A “glimpse” of the values of the features after applying the transformations is shown below.
glimpse(datos_train_prep)
## Rows: 454
## Columns: 31
## $ radius_mean <dbl> -0.1331989, -0.2666662, -0.2811735, -1.6837408…
## $ texture_mean <dbl> -1.15699712, -0.83706049, -0.19481734, -0.5692…
## $ perimeter_mean <dbl> -0.1500359, -0.2273717, -0.3549968, -1.6608301…
## $ area_mean <dbl> -0.22350716, -0.35979687, -0.34861110, -1.2961…
## $ smoothness_mean <dbl> 0.12069668, 0.80586200, -0.44098365, -0.711239…
## $ compactness_mean <dbl> -0.39892053, 0.50036692, -1.25728657, -0.82898…
## $ concavity_mean <dbl> -0.23484688, -0.51550068, -0.78410351, -0.9145…
## $ concave.points_mean <dbl> 0.02249063, -0.43068270, -0.48139689, -1.11364…
## $ symmetry_mean <dbl> 0.31700458, 0.62488714, -1.25244555, -0.118536…
## $ fractal_dimension_mean <dbl> -0.72673063, 0.74189265, -0.59040866, 0.309035…
## $ radius_se <dbl> -0.48714657, -0.82574532, -0.83094223, -0.9412…
## $ texture_se <dbl> -0.785250299, -0.860917446, 2.088621247, -0.47…
## $ perimeter_se <dbl> -0.39912768, -0.78454448, -0.90616489, -0.9495…
## $ area_se <dbl> -0.36505392, -0.58327249, -0.59579121, -0.7419…
## $ smoothness_se <dbl> 0.43268936, -0.96202503, -0.88054687, 0.594367…
## $ compactness_se <dbl> -0.5831375, -0.3325462, -1.1381571, -0.4767220…
## $ concavity_se <dbl> -0.2349066, -0.4748685, -0.5985064, -0.5131788…
## $ concave.points_se <dbl> 0.27878814, -0.85016740, 0.02282375, -0.947298…
## $ symmetry_se <dbl> -0.04727782, -0.42542760, 0.81795892, 0.696500…
## $ fractal_dimension_se <dbl> -0.5637065, -0.5160455, -0.7631200, -0.4561834…
## $ radius_worst <dbl> -0.2101238437, -0.3388391990, -0.5920497340, -…
## $ texture_worst <dbl> -1.03615624, -0.83677911, -0.46071819, -0.5984…
## $ perimeter_worst <dbl> -0.19285315, -0.30330931, -0.65915563, -1.4914…
## $ area_worst <dbl> -0.27323198, -0.41777357, -0.56930045, -1.1132…
## $ smoothness_worst <dbl> 0.49854224, -0.05383891, -1.52930076, -0.11857…
## $ compactness_worst <dbl> -0.4577360, 0.1763841, -1.2866442, -0.7207410,…
## $ concavity_worst <dbl> -0.12049281, -0.36870261, -1.06701605, -0.9653…
## $ concave.points_worst <dbl> 0.2596773, -0.6015747, -0.9508764, -1.3277223,…
## $ symmetry_worst <dbl> 0.152282948, 0.482873080, -1.428800296, 0.3567…
## $ fractal_dimension_worst <dbl> -0.61945910, -0.12328529, -1.20477236, -0.5389…
## $ diagnosis <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B…
Mean boxplots after centering and scaling
The boxplots bellow show the result of the transformed mean features (all other features were also transformed).
p <- list()
for (j in colnames(datos_train_prep)[1:10]) {
p[[j]] <- ggplot(data=datos_train_prep, aes_string(x="diagnosis", y=j)) +
geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
theme(axis.title.y = element_text(size=8)) +
geom_jitter(alpha = 0.2, width = 0.2) +
scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))
6. Predictive models
The train control for all models was created using the
cross-validation
resampling method with
10 folds
.
control_train<-trainControl(method = "cv",
number=10,
returnResamp = "all", #all resampled performance measures are saved
classProbs = TRUE, # class probabilities are computed
savePredictions = TRUE)
Model 1: Glmnet
Glmnet fits a generalized linear model via penalized maximum likelihood. The penalization parameters are alpha and lambda, the hypeparameter selection was left by default. These hyperparameters deal with correlated predictors, which is of importance in this project.
set.seed(123)
modelo_glm <- train(diagnosis ~.,
method="glmnet",
family="binomial",
trControl=control_train,
data=datos_train_prep,
metric="Accuracy")
modelo_glm$bestTune
## alpha lambda
## 2 0.1 0.007604831
ggplot(modelo_glm, highlight = T)
The most important variable is radius_worst.
vip(modelo_glm, geom = "point")
confMat(modelo_glm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 0
## M 1 43
##
## Accuracy : 0.9913
## 95% CI : (0.9525, 0.9998)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9815
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9861
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9773
## Prevalence : 0.6261
## Detection Rate : 0.6174
## Detection Prevalence : 0.6174
## Balanced Accuracy : 0.9931
##
## 'Positive' Class : B
##
Model 2: Random Forest
The Random forest is based on constructing a multitude of decision trees.
set.seed(123)
hip <- data.frame(mtry=1:30) # Randomly selected predictors
modelo_rf <-train(diagnosis ~.,
method="rf",
trControl=control_train,
data=datos_train_prep,
tuneGrid=hip,
metric="Accuracy")
modelo_rf$bestTune
## mtry
## 10 10
ggplot(modelo_rf, highlight = T)
The most important variable of the model is area_worst.
vip(modelo_rf, geom = "point")
confMat(modelo_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 70 0
## M 2 43
##
## Accuracy : 0.9826
## 95% CI : (0.9386, 0.9979)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9632
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9722
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9556
## Prevalence : 0.6261
## Detection Rate : 0.6087
## Detection Prevalence : 0.6087
## Balanced Accuracy : 0.9861
##
## 'Positive' Class : B
##
Model 3: Support Vector Machines with Linear and Polynomial Kernel
SVM training algorithm builds a model that assigns new examples to one category or the other. The classification performed can be linear or non-linear. In this case, the hyperparameters are cost, degree and scale. The chosen values are shown on the chunk bellow. The best model was the linear model.
set.seed(123)
hip_svmP <- expand.grid(C=c(0.001, 0.01, 0.1, 0.5, 1, 10),degree=c(1,2,3),scale=1)
modelo_svmPoly <- train(diagnosis ~.,
method = "svmPoly",
trControl = control_train,
data = datos_train_prep,
tuneGrid = hip_svmP,
metric = "Accuracy")
modelo_svmPoly$bestTune
## degree scale C
## 4 1 1 0.01
ggplot(modelo_svmPoly, highlight = T)
confMat(modelo_svmPoly)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 0
## M 1 43
##
## Accuracy : 0.9913
## 95% CI : (0.9525, 0.9998)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9815
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9861
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9773
## Prevalence : 0.6261
## Detection Rate : 0.6174
## Detection Prevalence : 0.6174
## Balanced Accuracy : 0.9931
##
## 'Positive' Class : B
##
Model 4: Naive Bayes
The Naive Bayes algorithm is based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The hypermarameters were left by default. The results obtained are not as favorable compared to the other models probably because it doesn’t consider any kind of correlation between features.
set.seed(123)
modelo_nb <- train(diagnosis ~.,
method = "nb",
trControl = control_train,
data = datos_train_prep,
metric = "Accuracy")
modelo_nb$bestTune
## fL usekernel adjust
## 2 0 TRUE 1
ggplot(modelo_nb, highlight = T)
confMat(modelo_nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 1
## M 4 42
##
## Accuracy : 0.9565
## 95% CI : (0.9015, 0.9857)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9084
##
## Mcnemar's Test P-Value : 0.3711
##
## Sensitivity : 0.9444
## Specificity : 0.9767
## Pos Pred Value : 0.9855
## Neg Pred Value : 0.9130
## Prevalence : 0.6261
## Detection Rate : 0.5913
## Detection Prevalence : 0.6000
## Balanced Accuracy : 0.9606
##
## 'Positive' Class : B
##
Model 5: k-Nearest Neighbors
K-nearest neighbors is an algorithm that classifies new cases based on a similarity measure (distance between neighbors). The hyperparameters were also left by default.
set.seed(123)
modelo_kknn <- train(diagnosis ~.,
method = "kknn",
trControl = control_train,
data = datos_train_prep,
metric = "Accuracy")
modelo_kknn$bestTune
## kmax distance kernel
## 2 7 2 optimal
ggplot(modelo_kknn, highlight = T)
confMat(modelo_kknn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 1
## M 1 42
##
## Accuracy : 0.9826
## 95% CI : (0.9386, 0.9979)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9629
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9861
## Specificity : 0.9767
## Pos Pred Value : 0.9861
## Neg Pred Value : 0.9767
## Prevalence : 0.6261
## Detection Rate : 0.6174
## Detection Prevalence : 0.6261
## Balanced Accuracy : 0.9814
##
## 'Positive' Class : B
##
Comparison of first models
The accuracy and kappa metrics of the 10-folds for each model is shown on the following table.
modelos <- list(GLM=modelo_glm, RF=modelo_rf, SVM=modelo_svmPoly, NB=modelo_nb, KNN=modelo_kknn)
results_resamples <- resamples(modelos)
datatable(results_resamples$values)
The time taken to compute all the folds and the final model for each algorithm is shown on the next table.
datatable(results_resamples$timings)
Some models obtained an accuracy of 1 on one of the folds. For this reason, in order to decide which model had a better performance the mean accuracy of all folds was computed.
metricas_resamples <- results_resamples$values %>%
gather(key = "modelo", value = "valor", -Resample) %>%
separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)
metricas_resamples <- metricas_resamples %>% filter(metrica == "Accuracy") %>%
group_by(modelo) %>%
mutate(media = mean(valor)) %>%
ungroup() %>%
ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.6) +
theme_bw() +
labs(title = "Validation: Mean Accuracy of the repeated-CV",
subtitle = "Models are ordered based on the mean") +
coord_flip()
metricas_resamples
The models with the highest accuracy are the
SVM models and the GLM models
. The SVM algorithm has the
highest mean but takes more time to execute than the GLM algorithm,
which is actually the one that takes the shortest. Their ROC curves are
shown below, the AUC of both models demonstrate their perfect
performance.
ROC curves of top models
library(ROCR)
predictions <- predict(modelo_glm, datos_test_prep %>% dplyr::select(-diagnosis))
pred <- prediction(as.numeric(predictions),as.numeric(datos_test_prep$diagnosis))
perf <- performance(pred,"tpr","fpr")
AUC <- as.numeric(performance(pred,"auc")@y.values)
plot(perf,colorize=TRUE)
text(0.5, 0.4, paste("AUC = ",round(AUC,4)))
title("Glmnet")
grid()
predictions <- predict(modelo_svmPoly, datos_test_prep %>% dplyr::select(-diagnosis))
pred <- prediction(as.numeric(predictions),as.numeric(datos_test_prep$diagnosis))
perf <- performance(pred,"tpr","fpr")
AUC <- as.numeric(performance(pred,"auc")@y.values)
plot(perf,colorize=TRUE)
text(0.5, 0.4, paste("AUC = ",round(AUC,4)))
title("SVM")
grid()
The models obtained are prooved to be accurate and precise.
Nevertheless, the correlation problem was only addressed on the Glmnet
algorithm. Next, a principal component analysis is performed in order to
reduce the dimensionality
of the data which also reduces
multicollinearity
.
7. Predictive models with PCA
The same six type of models were created after applying the PCA transformation to the data.
obj_recipe <- obj_recipe %>% step_pca(all_numeric(),num_comp = 10)
trained_recipe2 <- obj_recipe %>% prep(training=datos_train)
trained_recipe2
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 30
##
## Training data contained 454 data points and no missing data.
##
## Operations:
##
## Centering and scaling for radius_mean, texture_mean, perimeter_mean, area... [trained]
## PCA extraction with radius_mean, texture_mean, perimeter_mean, area_... [trained]
datos_train_prep2 <-trained_recipe2 %>% bake(new_data = datos_train)
datos_test_prep2 <- trained_recipe2 %>% bake(new_data = datos_test)
Comparison of models with PCA
Metrics…
datatable(results_resamples2$values)
Time…
datatable(results_resamples2$timings)
metricas_resamples2 <- results_resamples2$values %>%
gather(key = "modelo", value = "valor", -Resample) %>%
separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)
metricas_resamples2 <- metricas_resamples2 %>% filter(metrica == "Accuracy") %>%
group_by(modelo) %>%
mutate(media = mean(valor)) %>%
ungroup() %>%
ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.6) +
theme_bw() +
labs(title = "Validation: Mean Accuracy of the repeated-CV",
subtitle = "Models are ordered based on the mean") +
coord_flip()
As seen on the graph below, the SVM model still outperforms all other models, indicating that reducing dimensionality does not necessarily improve the model regardless of the number of variables or the correlation. If execution time is taken into account the GLMpca model is much faster and also has a good performance.
metricas_resamples2
Finally, the top two algorithms (SVM and GLM) were again used to create two models that take into account the class imbalance. All the features were taken into account.
8. Best predictive models with weights
The weight of each class has to be computed before implementing the algorithm.
datos_train_prep <- datos_train_prep[order(datos_train_prep$diagnosis),]
datos_train_prep[,order("diagnosis")]
## # A tibble: 454 × 1
## radius_mean
## <dbl>
## 1 -0.133
## 2 -0.267
## 3 -0.281
## 4 -1.68
## 5 -0.148
## 6 -0.104
## 7 -0.597
## 8 -0.719
## 9 -0.275
## 10 -1.56
## # … with 444 more rows
n1train <- length(which(datos_train_prep$diagnosis=="B"))
n2train <- length(which(datos_train_prep$diagnosis=="M"))
ntrain<- dim(data)[1]
n_classes <- 2
weight1 <- ntrain/(n_classes*n1train)
weight2 <- ntrain/(n_classes*n2train)
allWeights <- c(rep(weight1,n1train),rep(weight2,n2train))
Final comparison
modelos3 <- list.append(modelos2, GLMweights=modelo_glm3,SVMweights=modelo_svmPoly3)
results_resamples3 <- resamples(modelos3)
Metrics…
datatable(results_resamples3$values)
Time…
datatable(results_resamples3$timings)
metricas_resamples3 <- results_resamples3$values %>%
gather(key = "modelo", value = "valor", -Resample) %>%
separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)
metricas_resamples3 <- metricas_resamples3 %>% filter(metrica == "Accuracy") %>%
group_by(modelo) %>%
mutate(media = mean(valor)) %>%
ungroup() %>%
ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
geom_boxplot(alpha = 0.6, outlier.shape = NA) +
geom_jitter(width = 0.1, alpha = 0.6) +
theme_bw() +
labs(title = "Validation: Mean Accuracy of the repeated-CV",
subtitle = "Models are ordered based on the mean") +
coord_flip()
metricas_resamples3
Surprisingly, the SVM model that takes into account the proportion of samples in each class has the same accuracy mean as the SVM model but takes a bit less time to compute.
9. Conclusions
As seen, the algorithm that best predicted the outcome of this data set was the SVM, the addition of weights had a positive impact on the model. The GLM model also has outstanding performance and takes less time. Run time is of great importance in some applications. Therefore, if time is relevant, the Glmnet with PCA is also a suitable choice. By reducing the number of features, the execution time of the algorithm is reduced. The Glmnet model also has a high accuracy and high AUC on all performed folds. In the future, other algorithms can be tested or other hyperparameters can be chosen to find better results. Other approaches can also be adopted to deal with multicollinearity.