BREAST CANCER WISCONSIN DATA SET.

Katyna Sada

2/17/2021

1. Introduction

Cancer is the second leading cause of death worldwide. In 2018, there were approximately 9.6 million deaths worldwide. These numbers are on the rise and by 2040 there are expected to be as many as 29.5 million new cases. Regardless of the type of cancer present, cancer cells are distinguished by growing out of control and becoming invasive (WHO). They harness the space and nutrients that healthy organs require, that could then trigger the malfunctioning of certain body systems (NIH).

Breast cancer is a type of cancer that starts in the breast and occurs almost entirely in women, 2,1 million of new cases were estimated on 2018 (WHO). A total of 44,130 deaths from breast cancer are estimated to occur on 2021. (ASCO).

There is a need to search for new techniques in order to correctly diagnose and treat breast cancer.

The objective of this project is to develop an algorithm that correctly predicts whether a breast cancer cell nucleus sample is benign or malignant. For this purpose, several machine learning algorithms will be compared taking into account their accuracy and execution time.

2. Dataset

This dataset is hosted on Kaggle (Breast Cancer Wisconsin (Diagnostic) Data Set), and it was from UCI Machine Learning Repository.

The dataset contains 30 features plus the ID number and diagnosis of the participants. The features were computed by measuring 10 parameters on each cell nucleus from a digitized image of a fine needle aspirate (FNA) of a breast mass. For each parameter, the mean, standard error and "worst" (mean of the three largest values) were obtained.

A breast fine needle aspiration (FNA) removes some fluid or cells from a breast lesion (a cyst, lump, sore or swelling) with a fine needle similar to that used for blood tests. The sample of fluid or cells (or tissue) is then examined (Borecky).

Real-valued features:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)

The main libraries required for the development of this project are shown on the chunk below.

# Libraries
library(rpart.plot)
library(tidyverse)
library(skimr)
library(ggpubr)

# Helper packages
library(ggplot2)  # for awesome graphics
library(dplyr)    # for data manipulation
library(visdat)   # for additional visualizations

# Feature engineering packages
library(caret)    # for various ML tasks
library(recipes)  # for feature engineering tasks
library(gridExtra)

The database used can be visualized with the table shown below. The sample ID and an empty column were eliminated as they have no relevance in the experiment.

# Read the data
data <- read.csv('./files/data.csv')
datatable(data, class = 'cell-border stripe')

data <- data[,-c(1,33)] # delete id and empty column

3. Exploratory data analysis

Variable type

First, a descriptive exploration of the variables was performed in order to understand them better and detect possible problems.

skim(data)

Data summary
Name	data
Number of rows	569
Number of columns	31
_______________________
Column type frequency:
character	1
numeric	30
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
diagnosis	0	1	1	1	0	2	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
radius_mean	1	14.13	3.52	6.98	11.70	13.37	15.78	28.11	▂▇▃▁▁
texture_mean	1	19.29	4.30	9.71	16.17	18.84	21.80	39.28	▃▇▃▁▁
perimeter_mean	1	91.97	24.30	43.79	75.17	86.24	104.10	188.50	▃▇▃▁▁
area_mean	1	654.89	351.91	143.50	420.30	551.10	782.70	2501.00	▇▃▂▁▁
smoothness_mean	1	0.10	0.01	0.05	0.09	0.10	0.11	0.16	▁▇▇▁▁
compactness_mean	1	0.10	0.05	0.02	0.06	0.09	0.13	0.35	▇▇▂▁▁
concavity_mean	1	0.09	0.08	0.00	0.03	0.06	0.13	0.43	▇▃▂▁▁
concave.points_mean	1	0.05	0.04	0.00	0.02	0.03	0.07	0.20	▇▃▂▁▁
symmetry_mean	1	0.18	0.03	0.11	0.16	0.18	0.20	0.30	▁▇▅▁▁
fractal_dimension_mean	1	0.06	0.01	0.05	0.06	0.06	0.07	0.10	▆▇▂▁▁
radius_se	1	0.41	0.28	0.11	0.23	0.32	0.48	2.87	▇▁▁▁▁
texture_se	1	1.22	0.55	0.36	0.83	1.11	1.47	4.88	▇▅▁▁▁
perimeter_se	1	2.87	2.02	0.76	1.61	2.29	3.36	21.98	▇▁▁▁▁
area_se	1	40.34	45.49	6.80	17.85	24.53	45.19	542.20	▇▁▁▁▁
smoothness_se	1	0.01	0.00	0.00	0.01	0.01	0.01	0.03	▇▃▁▁▁
compactness_se	1	0.03	0.02	0.00	0.01	0.02	0.03	0.14	▇▃▁▁▁
concavity_se	1	0.03	0.03	0.00	0.02	0.03	0.04	0.40	▇▁▁▁▁
concave.points_se	1	0.01	0.01	0.00	0.01	0.01	0.01	0.05	▇▇▁▁▁
symmetry_se	1	0.02	0.01	0.01	0.02	0.02	0.02	0.08	▇▃▁▁▁
fractal_dimension_se	1	0.00	0.00	0.00	0.00	0.00	0.00	0.03	▇▁▁▁▁
radius_worst	1	16.27	4.83	7.93	13.01	14.97	18.79	36.04	▆▇▃▁▁
texture_worst	1	25.68	6.15	12.02	21.08	25.41	29.72	49.54	▃▇▆▁▁
perimeter_worst	1	107.26	33.60	50.41	84.11	97.66	125.40	251.20	▇▇▃▁▁
area_worst	1	880.58	569.36	185.20	515.30	686.50	1084.00	4254.00	▇▂▁▁▁
smoothness_worst	1	0.13	0.02	0.07	0.12	0.13	0.15	0.22	▂▇▇▂▁
compactness_worst	1	0.25	0.16	0.03	0.15	0.21	0.34	1.06	▇▅▁▁▁
concavity_worst	1	0.27	0.21	0.00	0.11	0.23	0.38	1.25	▇▅▂▁▁
concave.points_worst	1	0.11	0.07	0.00	0.06	0.10	0.16	0.29	▅▇▅▃▁
symmetry_worst	1	0.29	0.06	0.16	0.25	0.28	0.32	0.66	▅▇▁▁▁
fractal_dimension_worst	1	0.08	0.02	0.06	0.07	0.08	0.09	0.21	▇▃▁▁▁

After visually analyzing the distribuitions of all the features, no abnormalities seem to be found. Other than the target variable, which is a factor, all variables are numeric. The type of variable is correctly assigned to all features.
In addition, there are no missing values.

Dataset features

Target variable

The target variable indicates whether the cancer is benign (B) or malignant (M). As seen before, there are 357 benign samples and 212 malignant samples.

ggplot(data, aes(diagnosis, fill=diagnosis)) + 
  geom_bar() +
  labs(x="Diagnosis", y="Number of patients") +
  guides(fill=FALSE) +
  scale_fill_manual( values = c("#686868","#9F2042"))

Posteriorly, boxplots of all the features were created in order to visualize its importance on classification. These boxplots are grouped in mean, standard deviation and “worse”.

`Mean` BOXPLOTS

Other than the fractal_dimension_mean boxplot, there seems to be a significant difference in the value off all features when comparing benign and malignant samples. There are many samples that seem to have outliers in the features values. In addition, the range of values varies a lot between features. For example, the mean area of a sample can have a maximum value of 2500 while the highest fractal dimension of a sample doesn’t reach 0,1.

library(gridExtra)
p <- list()
for (j in colnames(data)[2:11]) {
  p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) + 
            geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
            theme(axis.title.y = element_text(size=8)) +  
            geom_jitter(alpha = 0.2, width = 0.2) +
            scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))

`Standard Error` BOXPLOTS

The standard deviation features show less difference between classes. As before, there are many outliers present.

p <- list()
for (j in colnames(data)[12:21]) {
  p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) + 
            geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
            theme(axis.title.y = element_text(size=8)) +  
            geom_jitter(alpha = 0.2, width = 0.2) +
            scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))

`"Worst"` BOXPLOTS

The “worst” features show a similar behaviour as that of the mean features.

p <- list()
for (j in colnames(data)[22:31]) {
  p[[j]] <- ggplot(data=data, aes_string(x="diagnosis", y=j)) + 
            geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
            theme(axis.title.y = element_text(size=8)) +  
            geom_jitter(alpha = 0.2, width = 0.2) +
            scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))

Correlation

Because of the source of the features, a correlation is obviously expected. As seen on the correlation heatmap the highest correlation is found between the perimeter, area and radious.

library(pheatmap)
pheatmap(cor(data[,-1]))

4. Division of data in training and testing & data preprocessing

Stratified sampling

As seen before, there are more benign that malignan samples. Stratified sampling was used in order to guarantee the same proportion in each class than the one had in the complete data set.

library(rsample)
set.seed(123)
split_strat <- initial_split(data,prop = 0.8, strata = "diagnosis")

datos_train <- training(split_strat)
datos_test <- testing(split_strat)

prop.table(table(data$diagnosis))

## 
##         B         M 
## 0.6274165 0.3725835

table(datos_test$diagnosis) %>% prop.table()

## 
##        B        M 
## 0.626087 0.373913

table(datos_train$diagnosis) %>% prop.table()

## 
##         B         M 
## 0.6277533 0.3722467

5. Feature Engineerig

Some transformations were performed on raw data with the aim of improving the performance of the algorithms.

The recipe object was created in order to be able to pre-process data.

obj_recipe <- recipe(diagnosis~.,data = datos_train) 
obj_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30

Imputation of missing values

There are no missing values, thus there is no need to imputate.

Variables with variance close to zero

As seen on the table below, all predictors have a significant variance. Hence, no features were removed based on this criteria.

datatable(nearZeroVar(data,saveMetrics = T), class = 'cell-border stripe')

Standardization and scaling

The boxplots demonstrated the need for centering and scaling. The difference in scales can have a great impact in the model. The step_normalize function centers and scales the data.

obj_recipe <- obj_recipe %>%
  step_normalize(all_numeric()) 

obj_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Operations:
## 
## Centering and scaling for all_numeric()

Once the recipe object has been created with the preprocessing transformations, they are learned with the training data and applied to the two sets.

trained_recipe <- obj_recipe %>%  prep(training=datos_train)
trained_recipe

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Training data contained 454 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for radius_mean, texture_mean, perimeter_mean, area... [trained]

datos_train_prep <-trained_recipe %>% bake(new_data = datos_train)
datos_test_prep <- trained_recipe %>% bake(new_data = datos_test)

A “glimpse” of the values of the features after applying the transformations is shown below.

glimpse(datos_train_prep)

## Rows: 454
## Columns: 31
## $ radius_mean             <dbl> -0.1331989, -0.2666662, -0.2811735, -1.6837408…
## $ texture_mean            <dbl> -1.15699712, -0.83706049, -0.19481734, -0.5692…
## $ perimeter_mean          <dbl> -0.1500359, -0.2273717, -0.3549968, -1.6608301…
## $ area_mean               <dbl> -0.22350716, -0.35979687, -0.34861110, -1.2961…
## $ smoothness_mean         <dbl> 0.12069668, 0.80586200, -0.44098365, -0.711239…
## $ compactness_mean        <dbl> -0.39892053, 0.50036692, -1.25728657, -0.82898…
## $ concavity_mean          <dbl> -0.23484688, -0.51550068, -0.78410351, -0.9145…
## $ concave.points_mean     <dbl> 0.02249063, -0.43068270, -0.48139689, -1.11364…
## $ symmetry_mean           <dbl> 0.31700458, 0.62488714, -1.25244555, -0.118536…
## $ fractal_dimension_mean  <dbl> -0.72673063, 0.74189265, -0.59040866, 0.309035…
## $ radius_se               <dbl> -0.48714657, -0.82574532, -0.83094223, -0.9412…
## $ texture_se              <dbl> -0.785250299, -0.860917446, 2.088621247, -0.47…
## $ perimeter_se            <dbl> -0.39912768, -0.78454448, -0.90616489, -0.9495…
## $ area_se                 <dbl> -0.36505392, -0.58327249, -0.59579121, -0.7419…
## $ smoothness_se           <dbl> 0.43268936, -0.96202503, -0.88054687, 0.594367…
## $ compactness_se          <dbl> -0.5831375, -0.3325462, -1.1381571, -0.4767220…
## $ concavity_se            <dbl> -0.2349066, -0.4748685, -0.5985064, -0.5131788…
## $ concave.points_se       <dbl> 0.27878814, -0.85016740, 0.02282375, -0.947298…
## $ symmetry_se             <dbl> -0.04727782, -0.42542760, 0.81795892, 0.696500…
## $ fractal_dimension_se    <dbl> -0.5637065, -0.5160455, -0.7631200, -0.4561834…
## $ radius_worst            <dbl> -0.2101238437, -0.3388391990, -0.5920497340, -…
## $ texture_worst           <dbl> -1.03615624, -0.83677911, -0.46071819, -0.5984…
## $ perimeter_worst         <dbl> -0.19285315, -0.30330931, -0.65915563, -1.4914…
## $ area_worst              <dbl> -0.27323198, -0.41777357, -0.56930045, -1.1132…
## $ smoothness_worst        <dbl> 0.49854224, -0.05383891, -1.52930076, -0.11857…
## $ compactness_worst       <dbl> -0.4577360, 0.1763841, -1.2866442, -0.7207410,…
## $ concavity_worst         <dbl> -0.12049281, -0.36870261, -1.06701605, -0.9653…
## $ concave.points_worst    <dbl> 0.2596773, -0.6015747, -0.9508764, -1.3277223,…
## $ symmetry_worst          <dbl> 0.152282948, 0.482873080, -1.428800296, 0.3567…
## $ fractal_dimension_worst <dbl> -0.61945910, -0.12328529, -1.20477236, -0.5389…
## $ diagnosis               <fct> B, B, B, B, B, B, B, B, B, B, B, B, B, B, B, B…

Mean boxplots after centering and scaling

The boxplots bellow show the result of the transformed mean features (all other features were also transformed).

p <- list()
for (j in colnames(datos_train_prep)[1:10]) {
  p[[j]] <- ggplot(data=datos_train_prep, aes_string(x="diagnosis", y=j)) + 
            geom_boxplot(aes(fill=factor(diagnosis))) + guides(fill=FALSE) +
            theme(axis.title.y = element_text(size=8)) +  
            geom_jitter(alpha = 0.2, width = 0.2) +
            scale_fill_manual( values = c("#686868","#9F2042"))
}
do.call(grid.arrange, c(p, ncol=5))

6. Predictive models

The train control for all models was created using the cross-validation resampling method with 10 folds.

control_train<-trainControl(method = "cv", 
                           number=10, 
                           returnResamp = "all", #all resampled performance measures are saved
                           classProbs = TRUE, # class probabilities are computed
                           savePredictions = TRUE)

Model 1: Glmnet

Glmnet fits a generalized linear model via penalized maximum likelihood. The penalization parameters are alpha and lambda, the hypeparameter selection was left by default. These hyperparameters deal with correlated predictors, which is of importance in this project.

set.seed(123)
modelo_glm <- train(diagnosis ~.,
                    method="glmnet", 
                    family="binomial", 
                    trControl=control_train, 
                    data=datos_train_prep, 
                    metric="Accuracy")
modelo_glm$bestTune

##   alpha      lambda
## 2   0.1 0.007604831

ggplot(modelo_glm, highlight = T)

The most important variable is radius_worst.

vip(modelo_glm, geom = "point")

confMat(modelo_glm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  0
##          M  1 43
##                                           
##                Accuracy : 0.9913          
##                  95% CI : (0.9525, 0.9998)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9815          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9861          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9773          
##              Prevalence : 0.6261          
##          Detection Rate : 0.6174          
##    Detection Prevalence : 0.6174          
##       Balanced Accuracy : 0.9931          
##                                           
##        'Positive' Class : B               
##

Model 2: Random Forest

The Random forest is based on constructing a multitude of decision trees.

set.seed(123)
hip <- data.frame(mtry=1:30) # Randomly selected predictors
modelo_rf <-train(diagnosis ~.,
                  method="rf", 
                  trControl=control_train, 
                  data=datos_train_prep, 
                  tuneGrid=hip,  
                  metric="Accuracy")

modelo_rf$bestTune

##    mtry
## 10   10

ggplot(modelo_rf, highlight = T)

The most important variable of the model is area_worst.

vip(modelo_rf, geom = "point")

confMat(modelo_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  0
##          M  2 43
##                                           
##                Accuracy : 0.9826          
##                  95% CI : (0.9386, 0.9979)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9632          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9722          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9556          
##              Prevalence : 0.6261          
##          Detection Rate : 0.6087          
##    Detection Prevalence : 0.6087          
##       Balanced Accuracy : 0.9861          
##                                           
##        'Positive' Class : B               
##

Model 3: Support Vector Machines with Linear and Polynomial Kernel

SVM training algorithm builds a model that assigns new examples to one category or the other. The classification performed can be linear or non-linear. In this case, the hyperparameters are cost, degree and scale. The chosen values are shown on the chunk bellow. The best model was the linear model.

set.seed(123)
hip_svmP <- expand.grid(C=c(0.001, 0.01, 0.1, 0.5, 1, 10),degree=c(1,2,3),scale=1)
modelo_svmPoly <- train(diagnosis ~.,
                    method = "svmPoly",
                    trControl = control_train,
                    data = datos_train_prep,
                    tuneGrid = hip_svmP,
                    metric = "Accuracy")
modelo_svmPoly$bestTune

##   degree scale    C
## 4      1     1 0.01

ggplot(modelo_svmPoly, highlight = T)

confMat(modelo_svmPoly)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  0
##          M  1 43
##                                           
##                Accuracy : 0.9913          
##                  95% CI : (0.9525, 0.9998)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9815          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9861          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9773          
##              Prevalence : 0.6261          
##          Detection Rate : 0.6174          
##    Detection Prevalence : 0.6174          
##       Balanced Accuracy : 0.9931          
##                                           
##        'Positive' Class : B               
##

Model 4: Naive Bayes

The Naive Bayes algorithm is based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The hypermarameters were left by default. The results obtained are not as favorable compared to the other models probably because it doesn’t consider any kind of correlation between features.

set.seed(123)
modelo_nb <- train(diagnosis ~.,
                    method = "nb",
                    trControl = control_train,
                    data = datos_train_prep,
                    metric = "Accuracy")
modelo_nb$bestTune

##   fL usekernel adjust
## 2  0      TRUE      1

ggplot(modelo_nb, highlight = T)

confMat(modelo_nb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  1
##          M  4 42
##                                           
##                Accuracy : 0.9565          
##                  95% CI : (0.9015, 0.9857)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9084          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.9767          
##          Pos Pred Value : 0.9855          
##          Neg Pred Value : 0.9130          
##              Prevalence : 0.6261          
##          Detection Rate : 0.5913          
##    Detection Prevalence : 0.6000          
##       Balanced Accuracy : 0.9606          
##                                           
##        'Positive' Class : B               
##

Model 5: k-Nearest Neighbors

K-nearest neighbors is an algorithm that classifies new cases based on a similarity measure (distance between neighbors). The hyperparameters were also left by default.

set.seed(123)
modelo_kknn <- train(diagnosis ~.,
                    method = "kknn",
                    trControl = control_train,
                    data = datos_train_prep,
                    metric = "Accuracy")
modelo_kknn$bestTune

##   kmax distance  kernel
## 2    7        2 optimal

ggplot(modelo_kknn, highlight = T)

confMat(modelo_kknn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  1
##          M  1 42
##                                           
##                Accuracy : 0.9826          
##                  95% CI : (0.9386, 0.9979)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9629          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9861          
##             Specificity : 0.9767          
##          Pos Pred Value : 0.9861          
##          Neg Pred Value : 0.9767          
##              Prevalence : 0.6261          
##          Detection Rate : 0.6174          
##    Detection Prevalence : 0.6261          
##       Balanced Accuracy : 0.9814          
##                                           
##        'Positive' Class : B               
##

Comparison of first models

The accuracy and kappa metrics of the 10-folds for each model is shown on the following table.

modelos <- list(GLM=modelo_glm, RF=modelo_rf, SVM=modelo_svmPoly, NB=modelo_nb, KNN=modelo_kknn)
results_resamples <- resamples(modelos)

datatable(results_resamples$values)

The time taken to compute all the folds and the final model for each algorithm is shown on the next table.

datatable(results_resamples$timings)

Some models obtained an accuracy of 1 on one of the folds. For this reason, in order to decide which model had a better performance the mean accuracy of all folds was computed.

metricas_resamples <- results_resamples$values %>%
                         gather(key = "modelo", value = "valor", -Resample) %>%
                         separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)

metricas_resamples <- metricas_resamples %>% filter(metrica == "Accuracy") %>%
  group_by(modelo) %>% 
  mutate(media = mean(valor)) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
    geom_boxplot(alpha = 0.6, outlier.shape = NA) +
    geom_jitter(width = 0.1, alpha = 0.6) +
    theme_bw() +
    labs(title = "Validation: Mean Accuracy of the repeated-CV",
         subtitle = "Models are ordered based on the mean") +
    coord_flip()

metricas_resamples

The models with the highest accuracy are the SVM models and the GLM models. The SVM algorithm has the highest mean but takes more time to execute than the GLM algorithm, which is actually the one that takes the shortest. Their ROC curves are shown below, the AUC of both models demonstrate their perfect performance.

ROC curves of top models

library(ROCR)
predictions <- predict(modelo_glm, datos_test_prep %>% dplyr::select(-diagnosis))
pred <- prediction(as.numeric(predictions),as.numeric(datos_test_prep$diagnosis))
perf <- performance(pred,"tpr","fpr")
AUC <- as.numeric(performance(pred,"auc")@y.values)

plot(perf,colorize=TRUE)
text(0.5, 0.4, paste("AUC = ",round(AUC,4)))
title("Glmnet")
grid()

predictions <- predict(modelo_svmPoly, datos_test_prep %>% dplyr::select(-diagnosis))
pred <- prediction(as.numeric(predictions),as.numeric(datos_test_prep$diagnosis))
perf <- performance(pred,"tpr","fpr")
AUC <- as.numeric(performance(pred,"auc")@y.values)

plot(perf,colorize=TRUE)
text(0.5, 0.4, paste("AUC = ",round(AUC,4)))
title("SVM")
grid()

The models obtained are prooved to be accurate and precise. Nevertheless, the correlation problem was only addressed on the Glmnet algorithm. Next, a principal component analysis is performed in order to reduce the dimensionality of the data which also reduces multicollinearity.

7. Predictive models with PCA

The same six type of models were created after applying the PCA transformation to the data.

obj_recipe <- obj_recipe %>% step_pca(all_numeric(),num_comp = 10)

trained_recipe2 <- obj_recipe %>%  prep(training=datos_train)
trained_recipe2

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         30
## 
## Training data contained 454 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for radius_mean, texture_mean, perimeter_mean, area... [trained]
## PCA extraction with radius_mean, texture_mean, perimeter_mean, area_... [trained]

datos_train_prep2 <-trained_recipe2 %>% bake(new_data = datos_train)
datos_test_prep2 <- trained_recipe2 %>% bake(new_data = datos_test)

Comparison of models with PCA

Metrics…

datatable(results_resamples2$values)

Time…

datatable(results_resamples2$timings)

metricas_resamples2 <- results_resamples2$values %>%
                         gather(key = "modelo", value = "valor", -Resample) %>%
                         separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)

metricas_resamples2 <- metricas_resamples2 %>% filter(metrica == "Accuracy") %>%
  group_by(modelo) %>% 
  mutate(media = mean(valor)) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
    geom_boxplot(alpha = 0.6, outlier.shape = NA) +
    geom_jitter(width = 0.1, alpha = 0.6) +
    theme_bw() +
    labs(title = "Validation: Mean Accuracy of the repeated-CV",
         subtitle = "Models are ordered based on the mean") +
    coord_flip()

As seen on the graph below, the SVM model still outperforms all other models, indicating that reducing dimensionality does not necessarily improve the model regardless of the number of variables or the correlation. If execution time is taken into account the GLMpca model is much faster and also has a good performance.

metricas_resamples2

Finally, the top two algorithms (SVM and GLM) were again used to create two models that take into account the class imbalance. All the features were taken into account.

8. Best predictive models with weights

The weight of each class has to be computed before implementing the algorithm.

datos_train_prep <- datos_train_prep[order(datos_train_prep$diagnosis),]
  datos_train_prep[,order("diagnosis")]

## # A tibble: 454 × 1
##    radius_mean
##          <dbl>
##  1      -0.133
##  2      -0.267
##  3      -0.281
##  4      -1.68 
##  5      -0.148
##  6      -0.104
##  7      -0.597
##  8      -0.719
##  9      -0.275
## 10      -1.56 
## # … with 444 more rows

n1train <- length(which(datos_train_prep$diagnosis=="B"))
n2train <- length(which(datos_train_prep$diagnosis=="M"))
ntrain<- dim(data)[1]

n_classes <- 2
weight1 <- ntrain/(n_classes*n1train)
weight2 <- ntrain/(n_classes*n2train)

allWeights <- c(rep(weight1,n1train),rep(weight2,n2train))

Final comparison

modelos3 <- list.append(modelos2, GLMweights=modelo_glm3,SVMweights=modelo_svmPoly3)
results_resamples3 <- resamples(modelos3)

Metrics…

datatable(results_resamples3$values)

Time…

datatable(results_resamples3$timings)

metricas_resamples3 <- results_resamples3$values %>%
                         gather(key = "modelo", value = "valor", -Resample) %>%
                         separate(col = "modelo", into = c("modelo", "metrica"), sep = "~", remove = TRUE)

metricas_resamples3 <- metricas_resamples3 %>% filter(metrica == "Accuracy") %>%
  group_by(modelo) %>% 
  mutate(media = mean(valor)) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(modelo, media), y = valor, color = modelo)) +
    geom_boxplot(alpha = 0.6, outlier.shape = NA) +
    geom_jitter(width = 0.1, alpha = 0.6) +
    theme_bw() +
    labs(title = "Validation: Mean Accuracy of the repeated-CV",
         subtitle = "Models are ordered based on the mean") +
    coord_flip()

metricas_resamples3

Surprisingly, the SVM model that takes into account the proportion of samples in each class has the same accuracy mean as the SVM model but takes a bit less time to compute.

9. Conclusions

As seen, the algorithm that best predicted the outcome of this data set was the SVM, the addition of weights had a positive impact on the model. The GLM model also has outstanding performance and takes less time. Run time is of great importance in some applications. Therefore, if time is relevant, the Glmnet with PCA is also a suitable choice. By reducing the number of features, the execution time of the algorithm is reduced. The Glmnet model also has a high accuracy and high AUC on all performed folds. In the future, other algorithms can be tested or other hyperparameters can be chosen to find better results. Other approaches can also be adopted to deal with multicollinearity.

BREAST CANCER WISCONSIN DATA SET.

1. Introduction

2. Dataset

3. Exploratory data analysis

Variable type

Dataset features

Target variable

Mean BOXPLOTS

Standard Error BOXPLOTS

"Worst" BOXPLOTS

Correlation

4. Division of data in training and testing & data preprocessing

Stratified sampling

5. Feature Engineerig

Imputation of missing values

Variables with variance close to zero

Standardization and scaling

Mean boxplots after centering and scaling

6. Predictive models

Model 1: Glmnet

Model 2: Random Forest

Model 3: Support Vector Machines with Linear and Polynomial Kernel

Model 4: Naive Bayes

Model 5: k-Nearest Neighbors

Comparison of first models

ROC curves of top models

7. Predictive models with PCA

Comparison of models with PCA

8. Best predictive models with weights

Final comparison

9. Conclusions

`Mean` BOXPLOTS

`Standard Error` BOXPLOTS

`"Worst"` BOXPLOTS