Urinary Biomarkers for Pancreatic Cancer

Authors: Katyna Sada, Cristina Tobías and Angélica Santos

Date: March 2021

1. Pancreatic Cancer

Pancreatic ductal adenocarcinoma (PDAC) is a highly fatal disease with a 5-year survival rate of approximtely 10% in the USA, and it is becoming an increasingly common cause of cancer mortality.

However, with an early detection the odds of surviving are much better. Unfortunately, many cases of pacreatic cancer show no symptoms until the cancer has spread throughout the body.

New strategies for screening high-risk patients to detect pancreatic tumours at early stages are needed to make a clinically significant impact.

2. Objective

Validate a biomarker panel for earlier detection of PDAC in urine.

3. Dataset

The dataset evaluates 590 patients, of which, 183 are healthy individuals, 208 have a benign pancreatic tumor and the rest have pancreatic cancer in different stages.

The data has been obtained by performing 2 different studies (cohort 1 and cohort 2). In the first study, the biomarkers analysed were creatinine, LYVE1, REG1B, TFF1 and REG1A, whilst in the second, the last biomarker was not analysed.

Other parameters that have been taken into account, besides the biomarkers, are the sex and age of the patients, the origin of the sample and, for the patients with a benign tumor, the symptoms.

3.1 Loading libraries

Code
# Library import
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import style
import matplotlib.ticker as ticker
import seaborn as sns

# For creating the pipelines
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# For the metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

%matplotlib inline

3.2 Data

The key features are the aforementioned urinary biomarkers: creatinine, LYVE1, REG1B and TFF1. * Creatinine: a protein that is often used as an indicator of kidney function. * YVLE1: lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role in tumor metastasis. * REG1B: a protein that may be associated with pancreas regeneration * TFF1: trefoil factor 1, which may be related to regeneration and repair of the urinary tract.

Code
data = pd.read_csv("DebernardiDataModificado.csv")
data.head()
sample_id patient_cohort sample_origin age sex diagnosis stage benign_sample_diagnosis plasma_CA19_9 creatinine LYVE1 REG1B T001 REG1A
0 S1 Cohort1 1 33 0 1 NaN NaN 11.7 1.83222 0.893219 52.94884 654.282174 1262.000
1 S10 Cohort1 1 81 0 1 NaN NaN NaN 0.97266 2.037585 94.46703 209.488250 228.407
2 S100 Cohort2 1 51 1 1 NaN NaN 7.0 0.78039 0.145589 102.36600 461.141000 NaN
3 S101 Cohort2 1 61 1 1 NaN NaN 8.0 0.70122 0.002805 60.57900 142.950000 NaN
4 S102 Cohort2 1 62 1 1 NaN NaN 9.0 0.21489 0.000860 65.54000 41.088000 NaN

As it can be seen in the matrix above, the variables stage and benign sample diagnosis are exclusive to a particular diagnosis. This can be used for further studies, nonetheless, in this project, those columns are going to be removed since the absent values cannot be imputed.

Code
# There are some features which are exclusive for the type of diagnosis.   
# In case further studies have to be performed, that columns will be stored in these variables. 
stage = data.stage # Only for cancer patients
benign_sample_diagnosis = data.benign_sample_diagnosis # Only for benign tumors

#We remove the unusable columns
data = data.drop(columns = ['sample_id','patient_cohort','stage','benign_sample_diagnosis'])
Code
data.head()
sample_origin age sex diagnosis plasma_CA19_9 creatinine LYVE1 REG1B T001 REG1A
0 1 33 0 1 11.7 1.83222 0.893219 52.94884 654.282174 1262.000
1 1 81 0 1 NaN 0.97266 2.037585 94.46703 209.488250 228.407
2 1 51 1 1 7.0 0.78039 0.145589 102.36600 461.141000 NaN
3 1 61 1 1 8.0 0.70122 0.002805 60.57900 142.950000 NaN
4 1 62 1 1 9.0 0.21489 0.000860 65.54000 41.088000 NaN
Code
# Define numerical and categorical features
numeric_cols = ['age', 'plasma_CA19_9','creatinine', 'LYVE1', 'REG1B', 'T001', 'REG1A']
cat_cols = ['sample_origin','sex']
data[cat_cols] = data[cat_cols].astype('category')
data.diagnosis = data.diagnosis.astype('category')

4. Exploratory data analysis

The first step is to explore the data, this analysis is divided into different parts:

  • Identify missing values
  • Plot the distributions of the numerical variables
  • Analyse the correlation between features

Code
data.isna()
data.isna().sum() # Total 0s
sample_origin      0
age                0
sex                0
diagnosis          0
plasma_CA19_9    240
creatinine         0
LYVE1              0
REG1B              0
T001               0
REG1A            284
dtype: int64

As stated before, there are several missing values for the plasma and the biomarker REG1A.

Code
# Distribution graphs for numerical variables
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat

for i, colum in enumerate(numeric_cols):
    sns.histplot(
        data    = data,
        x       = colum,
        stat    = "count",
        kde     = True,
        palette   = sns.color_palette("muted", 3),
        line_kws= {'linewidth': 2},
        alpha   = 0.3,
        ax      = axes[i-2],
        hue     = 'diagnosis'
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Numerical variables distribution', fontsize = 10, fontweight = "bold");

The graphs show similar distributions among the diagnosis, therefore, a transformation will be applied afterwards in the models.

Code
# Boxplots for numerical variables
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = data,
        y       = colum,
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Numerical variables boxplot', fontsize = 10, fontweight = "bold");

Code
# Boxplot of age
sns.boxplot(x="diagnosis", y="age", data=data) 
<AxesSubplot:xlabel='diagnosis', ylabel='age'>

It is important to know the distribuition of all variables taken into account in the models, in this case, all the diagnosis have a similar number of men and women. The difference in age range is also not that significant, however the age range on the cancer pantients is higher.

Code
# Sex based on diagnosis
sns.countplot(x ='diagnosis', hue = "sex", data = data)
<AxesSubplot:xlabel='diagnosis', ylabel='count'>

Code
# Sample_origin based on diagnosis
sns.countplot(x ='diagnosis', hue = "sample_origin", data = data)
<AxesSubplot:xlabel='diagnosis', ylabel='count'>

This is not the case for the sample origin, since most of the healthy samples come from the first hospital and that all of the samples from the 4th hospital have a benign tumor.

Code

sns.pairplot(data, hue = 'diagnosis', markers=["o", "s", "D"], palette= sns.color_palette("muted", 3), corner=True)
<seaborn.axisgrid.PairGrid at 0x7f8c5ac014f0>

There is barely correlation between the features, the diagnosis does not seem to have an effect neither.

After considering the plots, it is clear that the models will have suboptimal results when predicting 3 diagnosis. The proposed solution to this problem is to create two different types of models. The first type will try to predict all 3 diagnosis, whilst the second will only predict whether a patient is healthy or not.

5. Models with 3 outcomes

For these models we will considered three possible outcomes: * 1: Healthy controls. * 2: Patients with non-cancerous pancreatic conditions, like chronic pancreatitis. * 3: Patients with pancreatic ductal adenocarcinoma.

5.1 Division of data

Stratified sampling was used in order to guarantee the same proportion in each class than the one had in the complete data set.

Code
from sklearn.model_selection import train_test_split
Code
X = data.drop(columns='diagnosis') 
y = data[['diagnosis']]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state = 42,stratify=y)
X_train, X_val, y_train, y_val  = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

5.2 Preprocessing

As stated before, transformations ought to be applied to the data in order to analyse it. The chosen transformation has been the logarithmic transform, since this operation stretches the distribution of numerical variables.

Code
def log_transform(x):
    return np.log(x + 1) # required for applying the logarithm to the numeric variables

transformer = FunctionTransformer(log_transform)
scaler = StandardScaler()
Code
 # Boxplots of data when applying the transform
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = data,
        y       = (transformer.transform(data[colum])),
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Logarithm of numerical variables boxplot', fontsize = 10, fontweight = "bold");
#

Code
plotting_data = transformer.transform(data.select_dtypes(include=['float64', 'int']))
plotting_data = pd.concat([plotting_data.reset_index(drop=True), data.diagnosis], axis=1)

sns.pairplot(plotting_data, hue = 'diagnosis', markers=["o", "s", "D"], palette= sns.color_palette("muted", 3), corner=True)
<seaborn.axisgrid.PairGrid at 0x7f8c5aed39a0>

A more significant difference can be observed between classes when applying the log transform. Furthermore, some variables seem to be correlated in this scale.

After checking the effectiveness of the transformation, the pipeline is created. All the numerical variables will undergo the previous transformation, moreover, the categorical variables will be encoded so that the models can take them into account.

The missing values will be imputed by the means of a KNN imputer in both cases.

Code
# Numerical variables
numeric_transformer = Pipeline(
                        steps=[('transformer',transformer), # Log
                               ('imputer', KNNImputer()), #Imputation for completing missing values using k-Nearest Neighbors.
                               ('scaler', scaler)]) # Standardize features by removing the mean and scaling to unit variance

# Categorical variables
categorical_transformer = Pipeline(
                            steps=[('imputer', KNNImputer()),
                                   ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Preprocessor
preprocessor = ColumnTransformer(
                    transformers=[
                        ('cat', categorical_transformer, cat_cols),
                        ('numeric', numeric_transformer, numeric_cols)],
                        remainder='passthrough')

5.3 Predictive models

Different predictive models were first validated with the default tunning parameters.

Code
key = ['KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','AdaBoostClassifier']
value = [KNeighborsClassifier(weights ='distance'), SVC(random_state=42), DecisionTreeClassifier(random_state=42), RandomForestClassifier(random_state=42), GradientBoostingClassifier(random_state=42), AdaBoostClassifier()]
models = dict(zip(key,value))
models

predicted =[]
for name,algo in models.items():
    model=algo
    pipe = Pipeline([('preprocessing', preprocessor),  ('modelo', model)])
    pipe.fit(X=X_train, y=y_train)
    predict = pipe.predict(X_val)
    acc = accuracy_score(y_val, predict)
    predicted.append(acc)
    
    
plt.figure(figsize = (10,5))
sns.barplot(x = predicted, y = key, palette='muted')
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/pipeline.py:394: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/ensemble/_gb.py:494: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
<AxesSubplot:>

Code
predictions = dict(zip(key,predicted))
print(predictions)
{'KNeighborsClassifier': 0.6565656565656566, 'SVC': 0.5959595959595959, 'DecisionTreeClassifier': 0.6464646464646465, 'RandomForestClassifier': 0.696969696969697, 'GradientBoostingClassifier': 0.696969696969697, 'AdaBoostClassifier': 0.5858585858585859}

The best predictive models are Random Forest and GradientBoostingClassifier. The chosen model to tune was GradientBoostingClassifier.

5.4 Tunning the chosen model

Gradient boosting
Code
tuned_parameters = [ {'modelo__loss': 
                      ['deviance', 'exponential'], 
                      'modelo__learning_rate': [0.1, 0.01, 0.001, 0.0001],
                      'modelo__subsample': [0.1, 0.5, 1, 2],
                      'modelo__criterion': ['friedman_mse', 'mse', 'mae']}
                   ]

pipe = Pipeline([('preprocessing', preprocessor),
                 ('modelo', GradientBoostingClassifier(random_state=42))]) 

from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(pipe,tuned_parameters,refit=True,verbose=2, cv = 5, 
                     scoring='accuracy')
import warnings
warnings.filterwarnings('ignore') 
clf.fit(X_train, y_train)
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
Code
results = pd.DataFrame(clf.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)
param_modelo__criterion param_modelo__learning_rate param_modelo__loss param_modelo__subsample mean_test_score std_test_score
1 friedman_mse 0.1 deviance 0.5 0.669209 0.061035
33 mse 0.1 deviance 0.5 0.665819 0.053922
34 mse 0.1 deviance 1 0.655424 0.012859
2 friedman_mse 0.1 deviance 1 0.655424 0.012859
9 friedman_mse 0.01 deviance 0.5 0.631864 0.036484
... ... ... ... ... ... ...
91 mae 0.0001 deviance 2 NaN NaN
92 mae 0.0001 exponential 0.1 NaN NaN
93 mae 0.0001 exponential 0.5 NaN NaN
94 mae 0.0001 exponential 1 NaN NaN
95 mae 0.0001 exponential 2 NaN NaN

96 rows × 6 columns

The best parameters found based on the accuracy are shown below.

Code
clf.best_params_
{'modelo__criterion': 'friedman_mse',
 'modelo__learning_rate': 0.1,
 'modelo__loss': 'deviance',
 'modelo__subsample': 0.5}

Before the prediction, a new validation will be performed, so as to check whether the new model has indeed improved with respect to the previous one.

Code
new_val = clf.predict(X_val)
print(accuracy_score(y_val, new_val))
0.696969696969697

Finally, the tuned model is tested.

Code
prediction = clf.predict(X_test)
Code
from sklearn.metrics import classification_report, confusion_matrix
print('Confusion matrix for 3 diagnosis: ')
print(' ')
print(confusion_matrix(y_test,prediction))
print(' ')
print(classification_report(y_test,prediction))
print(' ')
print('Accuracy of the model: ', accuracy_score(y_test, prediction))
Confusion matrix for 3 diagnosis: 
 
[[44 13  3]
 [18 36 15]
 [ 3  5 58]]
 
              precision    recall  f1-score   support

           1       0.68      0.73      0.70        60
           2       0.67      0.52      0.59        69
           3       0.76      0.88      0.82        66

    accuracy                           0.71       195
   macro avg       0.70      0.71      0.70       195
weighted avg       0.70      0.71      0.70       195

 
Accuracy of the model:  0.7076923076923077

6. Models with 2 outcomes

For these models we will only consider two possible outcomes: PADC or no-PADC. For this, the diagnosis data of healthy (1) and non-cancerous pancreas condition (2) will be merged.

Code
new_data = data
new_data.diagnosis[new_data.diagnosis == 2] = 1
new_data.diagnosis[new_data.diagnosis == 3] = 2
new_data.diagnosis = new_data.diagnosis.cat.remove_unused_categories()
Code
new_data.diagnosis
0      1
1      1
2      1
3      1
4      1
      ..
585    2
586    2
587    2
588    2
589    2
Name: diagnosis, Length: 590, dtype: category
Categories (2, int64): [1, 2]

A new exploratory analysis can be performed to assess whether this study will have a more positive result than the previous one.

The boxplots below show a more significant difference than the ones where 3 diagnosis were being evaluated. Nonetheless, it is still advisable to apply a logarithmic transform.

Code
# Boxplots for numerical variables
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = new_data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = new_data,
        y       = colum,
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Numerical variables boxplot', fontsize = 10, fontweight = "bold");

As for the densities and correlations, this separation of the data allows a better visualization of clusters.

Code
sns.pairplot(new_data, hue = 'diagnosis', markers=["o", "s"], palette= sns.color_palette("muted", 2), corner=True)
<seaborn.axisgrid.PairGrid at 0x7f8c489db400>

The same graphs are evaluated after the application of a log transform and the results show clear difference between the two groups.

Code
 # Boxplots of data when applying the transform
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = new_data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = data,
        y       = (transformer.transform(new_data[colum])),
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Logarithm of numerical variables boxplot', fontsize = 10, fontweight = "bold");
#

Code
plotting_data = transformer.transform(new_data.select_dtypes(include=['float64', 'int']))
plotting_data = pd.concat([plotting_data.reset_index(drop=True), new_data.diagnosis], axis=1)
sns.pairplot(plotting_data, hue = 'diagnosis', markers=["o", "s"], palette= sns.color_palette("muted", 2), corner=True)
<seaborn.axisgrid.PairGrid at 0x7f8c3c8b9a00>

Code
y2class = new_data.diagnosis

6.1 Division of data

The new data will be divided into train, validation and test using the same method as before.

Code
X_train2, X_test2, y_train2, y_test2 = train_test_split(X,y2class, test_size = 0.33, random_state = 42,stratify=y2class)
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_train2, y_train2, test_size=0.25, random_state=1)

6.2 Predictive models

The same models as before are validated, the tunning parameters are the ones set by default.

Code
key2 = ['LogisticRegression','KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','AdaBoostClassifier']
value2 = [LogisticRegression(),KNeighborsClassifier(weights ='distance'), SVC(random_state=42), DecisionTreeClassifier(random_state=42), RandomForestClassifier(random_state=42), GradientBoostingClassifier(random_state=42), AdaBoostClassifier()]
models2 = dict(zip(key2,value2))
models2

predicted2 =[]
for name,algo in models2.items():
    model=algo
    pipe = Pipeline([('preprocessing', preprocessor),  ('modelo', model)])
    pipe.fit(X=X_train2, y=y_train2)
    predict2 = pipe.predict(X_val2)
    acc = accuracy_score(y_val2, predict2)
    predicted2.append(acc)
    
    
plt.figure(figsize = (10,5))
sns.barplot(x = predicted2, y = key2, palette='muted')
<AxesSubplot:>

As expected, these models have significantly improved their results in this scenario.

Code
predictions2 = dict(zip(key2,predicted2))
print(predictions2)
{'LogisticRegression': 0.8383838383838383, 'KNeighborsClassifier': 0.8585858585858586, 'SVC': 0.8888888888888888, 'DecisionTreeClassifier': 0.8282828282828283, 'RandomForestClassifier': 0.8888888888888888, 'GradientBoostingClassifier': 0.8686868686868687, 'AdaBoostClassifier': 0.8484848484848485}

The best predictors are the SVC and the RandomForestClassfier. The chosen model to tune is the RF.

6.3 Tunning the chosen two class model

Random Forest Classifier

Code
from sklearn.model_selection import RandomizedSearchCV, RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
import multiprocessing

# Se combinan los pasos de preprocesado y el modelo en un mismo pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('modelo', RandomForestClassifier())])

# Hyperparameters optimization

param_distributions = {
    'modelo__n_estimators': [20, 30, 50, 100, 150, 200],
    'modelo__criterion': ["gini", "entropy"],
    'modelo__max_features': ["auto", 3, 5, 7],
    'modelo__max_depth'   : [3, 5, 10, 20]
}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'accuracy',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3),
        refit      = True, 
        verbose    = 2,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train2, y = y_train2)
Fitting 15 folds for each of 20 candidates, totalling 300 fits
RandomizedSearchCV(cv=RepeatedKFold(n_repeats=3, n_splits=5, random_state=None),
                   estimator=Pipeline(steps=[('preprocessing',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('cat',
                                                                               Pipeline(steps=[('imputer',
                                                                                                KNNImputer()),
                                                                                               ('onehot',
                                                                                                OneHotEncoder(handle_unknown='ignore'))]),
                                                                               ['sample_origin',
                                                                                'sex']),
                                                                              ('numeric',
                                                                               Pipeline(steps=[('transformer',
                                                                                                Functi...
                                                                                'creatinine',
                                                                                'LYVE1',
                                                                                'REG1B',
                                                                                'T001',
                                                                                'REG1A'])])),
                                             ('modelo',
                                              RandomForestClassifier())]),
                   n_iter=20, n_jobs=7,
                   param_distributions={'modelo__criterion': ['gini',
                                                              'entropy'],
                                        'modelo__max_depth': [3, 5, 10, 20],
                                        'modelo__max_features': ['auto', 3, 5,
                                                                 7],
                                        'modelo__n_estimators': [20, 30, 50,
                                                                 100, 150,
                                                                 200]},
                   random_state=123, return_train_score=True,
                   scoring='accuracy', verbose=2)
Code
#Parameters chosen
grid.best_params_
{'modelo__n_estimators': 200,
 'modelo__max_features': 5,
 'modelo__max_depth': 5,
 'modelo__criterion': 'gini'}
Code
#Check the validation again
new_val2 = grid.predict(X_val2)
print(accuracy_score(y_val2, new_val2))
0.8686868686868687
Code
#See the results in the test data
prediction = grid.predict(X_test2)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test2,prediction))
print(classification_report(y_test2,prediction))
[[120   9]
 [  8  58]]
              precision    recall  f1-score   support

           1       0.94      0.93      0.93       129
           2       0.87      0.88      0.87        66

    accuracy                           0.91       195
   macro avg       0.90      0.90      0.90       195
weighted avg       0.91      0.91      0.91       195

7. Conclusions

The tunned models show a high performance in both cases, nevertheless, due to the similarities between groups in the first case, a second analysis was performed in order to see if better predictions could be obtained.

The best model was the random forest classifier, which was built for only two classes. It uses the gini criterion, a maximum depth of 5 and 50 estimators. The accuracy obtained on the test set was 0,89.

These results show that the analysed biomarkers can accurately predict whether a patient has cancer or not.

Future lines of research can include the prediction of symptoms for benign tumors or the prediction of the stage of the cancer in the case of a positive diagnosis.