Urinary Biomarkers for Pancreatic Cancer

Authors: Katyna Sada, Cristina Tobías and Angélica Santos

Date: March 2021

1. Pancreatic Cancer

Pancreatic ductal adenocarcinoma (PDAC) is a highly fatal disease with a 5-year survival rate of approximtely 10% in the USA, and it is becoming an increasingly common cause of cancer mortality.

However, with an early detection the odds of surviving are much better. Unfortunately, many cases of pacreatic cancer show no symptoms until the cancer has spread throughout the body.

New strategies for screening high-risk patients to detect pancreatic tumours at early stages are needed to make a clinically significant impact.

2. Objective

Validate a biomarker panel for earlier detection of PDAC in urine.

3. Dataset

The dataset evaluates 590 patients, of which, 183 are healthy individuals, 208 have a benign pancreatic tumor and the rest have pancreatic cancer in different stages.

The data has been obtained by performing 2 different studies (cohort 1 and cohort 2). In the first study, the biomarkers analysed were creatinine, LYVE1, REG1B, TFF1 and REG1A, whilst in the second, the last biomarker was not analysed.

Other parameters that have been taken into account, besides the biomarkers, are the sex and age of the patients, the origin of the sample and, for the patients with a benign tumor, the symptoms.

3.1 Loading libraries

Code

# Library import
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from matplotlib import style
import matplotlib.ticker as ticker
import seaborn as sns

# For creating the pipelines
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# For the metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

%matplotlib inline

3.2 Data

The key features are the aforementioned urinary biomarkers: creatinine, LYVE1, REG1B and TFF1. * Creatinine: a protein that is often used as an indicator of kidney function. * YVLE1: lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role in tumor metastasis. * REG1B: a protein that may be associated with pancreas regeneration * TFF1: trefoil factor 1, which may be related to regeneration and repair of the urinary tract.

Code

data = pd.read_csv("DebernardiDataModificado.csv")
data.head()

	sample_id	patient_cohort	sample_origin	age	sex	diagnosis	stage	benign_sample_diagnosis	plasma_CA19_9	creatinine	LYVE1	REG1B	T001	REG1A
0	S1	Cohort1	1	33	0	1	NaN	NaN	11.7	1.83222	0.893219	52.94884	654.282174	1262.000
1	S10	Cohort1	1	81	0	1	NaN	NaN	NaN	0.97266	2.037585	94.46703	209.488250	228.407
2	S100	Cohort2	1	51	1	1	NaN	NaN	7.0	0.78039	0.145589	102.36600	461.141000	NaN
3	S101	Cohort2	1	61	1	1	NaN	NaN	8.0	0.70122	0.002805	60.57900	142.950000	NaN
4	S102	Cohort2	1	62	1	1	NaN	NaN	9.0	0.21489	0.000860	65.54000	41.088000	NaN

As it can be seen in the matrix above, the variables stage and benign sample diagnosis are exclusive to a particular diagnosis. This can be used for further studies, nonetheless, in this project, those columns are going to be removed since the absent values cannot be imputed.

Code

# There are some features which are exclusive for the type of diagnosis.   
# In case further studies have to be performed, that columns will be stored in these variables. 
stage = data.stage # Only for cancer patients
benign_sample_diagnosis = data.benign_sample_diagnosis # Only for benign tumors

#We remove the unusable columns
data = data.drop(columns = ['sample_id','patient_cohort','stage','benign_sample_diagnosis'])

Code

data.head()

	sample_origin	age	sex	diagnosis	plasma_CA19_9	creatinine	LYVE1	REG1B	T001	REG1A
0	1	33	0	1	11.7	1.83222	0.893219	52.94884	654.282174	1262.000
1	1	81	0	1	NaN	0.97266	2.037585	94.46703	209.488250	228.407
2	1	51	1	1	7.0	0.78039	0.145589	102.36600	461.141000	NaN
3	1	61	1	1	8.0	0.70122	0.002805	60.57900	142.950000	NaN
4	1	62	1	1	9.0	0.21489	0.000860	65.54000	41.088000	NaN

Code

# Define numerical and categorical features
numeric_cols = ['age', 'plasma_CA19_9','creatinine', 'LYVE1', 'REG1B', 'T001', 'REG1A']
cat_cols = ['sample_origin','sex']
data[cat_cols] = data[cat_cols].astype('category')
data.diagnosis = data.diagnosis.astype('category')

4. Exploratory data analysis

The first step is to explore the data, this analysis is divided into different parts:

Identify missing values
Plot the distributions of the numerical variables
Analyse the correlation between features

Code

data.isna()
data.isna().sum() # Total 0s

sample_origin      0
age                0
sex                0
diagnosis          0
plasma_CA19_9    240
creatinine         0
LYVE1              0
REG1B              0
T001               0
REG1A            284
dtype: int64

As stated before, there are several missing values for the plasma and the biomarker REG1A.

Code

# Distribution graphs for numerical variables
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat

for i, colum in enumerate(numeric_cols):
    sns.histplot(
        data    = data,
        x       = colum,
        stat    = "count",
        kde     = True,
        palette   = sns.color_palette("muted", 3),
        line_kws= {'linewidth': 2},
        alpha   = 0.3,
        ax      = axes[i-2],
        hue     = 'diagnosis'
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Numerical variables distribution', fontsize = 10, fontweight = "bold");

The graphs show similar distributions among the diagnosis, therefore, a transformation will be applied afterwards in the models.

Code

# Boxplots for numerical variables
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = data,
        y       = colum,
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Numerical variables boxplot', fontsize = 10, fontweight = "bold");

Code

# Boxplot of age
sns.boxplot(x="diagnosis", y="age", data=data)

<AxesSubplot:xlabel='diagnosis', ylabel='age'>

It is important to know the distribuition of all variables taken into account in the models, in this case, all the diagnosis have a similar number of men and women. The difference in age range is also not that significant, however the age range on the cancer pantients is higher.

Code

# Sex based on diagnosis
sns.countplot(x ='diagnosis', hue = "sex", data = data)

<AxesSubplot:xlabel='diagnosis', ylabel='count'>

Code

# Sample_origin based on diagnosis
sns.countplot(x ='diagnosis', hue = "sample_origin", data = data)

<AxesSubplot:xlabel='diagnosis', ylabel='count'>

This is not the case for the sample origin, since most of the healthy samples come from the first hospital and that all of the samples from the 4th hospital have a benign tumor.

Code


sns.pairplot(data, hue = 'diagnosis', markers=["o", "s", "D"], palette= sns.color_palette("muted", 3), corner=True)

<seaborn.axisgrid.PairGrid at 0x7f8c5ac014f0>

There is barely correlation between the features, the diagnosis does not seem to have an effect neither.

After considering the plots, it is clear that the models will have suboptimal results when predicting 3 diagnosis. The proposed solution to this problem is to create two different types of models. The first type will try to predict all 3 diagnosis, whilst the second will only predict whether a patient is healthy or not.

5. Models with 3 outcomes

For these models we will considered three possible outcomes: * 1: Healthy controls. * 2: Patients with non-cancerous pancreatic conditions, like chronic pancreatitis. * 3: Patients with pancreatic ductal adenocarcinoma.

5.1 Division of data

Stratified sampling was used in order to guarantee the same proportion in each class than the one had in the complete data set.

Code

from sklearn.model_selection import train_test_split

Code

X = data.drop(columns='diagnosis') 
y = data[['diagnosis']]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state = 42,stratify=y)
X_train, X_val, y_train, y_val  = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

5.2 Preprocessing

As stated before, transformations ought to be applied to the data in order to analyse it. The chosen transformation has been the logarithmic transform, since this operation stretches the distribution of numerical variables.

Code

def log_transform(x):
    return np.log(x + 1) # required for applying the logarithm to the numeric variables

transformer = FunctionTransformer(log_transform)
scaler = StandardScaler()

Code

 # Boxplots of data when applying the transform
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = data,
        y       = (transformer.transform(data[colum])),
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Logarithm of numerical variables boxplot', fontsize = 10, fontweight = "bold");
#

Code

plotting_data = transformer.transform(data.select_dtypes(include=['float64', 'int']))
plotting_data = pd.concat([plotting_data.reset_index(drop=True), data.diagnosis], axis=1)

sns.pairplot(plotting_data, hue = 'diagnosis', markers=["o", "s", "D"], palette= sns.color_palette("muted", 3), corner=True)

<seaborn.axisgrid.PairGrid at 0x7f8c5aed39a0>

A more significant difference can be observed between classes when applying the log transform. Furthermore, some variables seem to be correlated in this scale.

After checking the effectiveness of the transformation, the pipeline is created. All the numerical variables will undergo the previous transformation, moreover, the categorical variables will be encoded so that the models can take them into account.

The missing values will be imputed by the means of a KNN imputer in both cases.

Code

# Numerical variables
numeric_transformer = Pipeline(
                        steps=[('transformer',transformer), # Log
                               ('imputer', KNNImputer()), #Imputation for completing missing values using k-Nearest Neighbors.
                               ('scaler', scaler)]) # Standardize features by removing the mean and scaling to unit variance

# Categorical variables
categorical_transformer = Pipeline(
                            steps=[('imputer', KNNImputer()),
                                   ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Preprocessor
preprocessor = ColumnTransformer(
                    transformers=[
                        ('cat', categorical_transformer, cat_cols),
                        ('numeric', numeric_transformer, numeric_cols)],
                        remainder='passthrough')

5.3 Predictive models

Different predictive models were first validated with the default tunning parameters.

Code

key = ['KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','AdaBoostClassifier']
value = [KNeighborsClassifier(weights ='distance'), SVC(random_state=42), DecisionTreeClassifier(random_state=42), RandomForestClassifier(random_state=42), GradientBoostingClassifier(random_state=42), AdaBoostClassifier()]
models = dict(zip(key,value))
models

predicted =[]
for name,algo in models.items():
    model=algo
    pipe = Pipeline([('preprocessing', preprocessor),  ('modelo', model)])
    pipe.fit(X=X_train, y=y_train)
    predict = pipe.predict(X_val)
    acc = accuracy_score(y_val, predict)
    predicted.append(acc)
    
    
plt.figure(figsize = (10,5))
sns.barplot(x = predicted, y = key, palette='muted')

/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/neighbors/_classification.py:198: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  return self._fit(X, y)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/pipeline.py:394: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  self._final_estimator.fit(Xt, y, **fit_params_last_step)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/ensemble/_gb.py:494: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/Users/katyna/opt/anaconda3/envs/SparseGoNew/lib/python3.8/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

<AxesSubplot:>

Code

predictions = dict(zip(key,predicted))
print(predictions)

{'KNeighborsClassifier': 0.6565656565656566, 'SVC': 0.5959595959595959, 'DecisionTreeClassifier': 0.6464646464646465, 'RandomForestClassifier': 0.696969696969697, 'GradientBoostingClassifier': 0.696969696969697, 'AdaBoostClassifier': 0.5858585858585859}

The best predictive models are Random Forest and GradientBoostingClassifier. The chosen model to tune was GradientBoostingClassifier.

5.4 Tunning the chosen model

Gradient boosting

Code

tuned_parameters = [ {'modelo__loss': 
                      ['deviance', 'exponential'], 
                      'modelo__learning_rate': [0.1, 0.01, 0.001, 0.0001],
                      'modelo__subsample': [0.1, 0.5, 1, 2],
                      'modelo__criterion': ['friedman_mse', 'mse', 'mae']}
                   ]

pipe = Pipeline([('preprocessing', preprocessor),
                 ('modelo', GradientBoostingClassifier(random_state=42))]) 

from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(pipe,tuned_parameters,refit=True,verbose=2, cv = 5, 
                     scoring='accuracy')
import warnings
warnings.filterwarnings('ignore') 
clf.fit(X_train, y_train)
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']

Code

results = pd.DataFrame(clf.cv_results_)
results.filter(regex = '(param.*|mean_t|std_t)')\
    .drop(columns = 'params')\
    .sort_values('mean_test_score', ascending = False)

	param_modelo__criterion	param_modelo__learning_rate	param_modelo__loss	param_modelo__subsample	mean_test_score	std_test_score
1	friedman_mse	0.1	deviance	0.5	0.669209	0.061035
33	mse	0.1	deviance	0.5	0.665819	0.053922
34	mse	0.1	deviance	1	0.655424	0.012859
2	friedman_mse	0.1	deviance	1	0.655424	0.012859
9	friedman_mse	0.01	deviance	0.5	0.631864	0.036484
...	...	...	...	...	...	...
91	mae	0.0001	deviance	2	NaN	NaN
92	mae	0.0001	exponential	0.1	NaN	NaN
93	mae	0.0001	exponential	0.5	NaN	NaN
94	mae	0.0001	exponential	1	NaN	NaN
95	mae	0.0001	exponential	2	NaN	NaN

96 rows × 6 columns

The best parameters found based on the accuracy are shown below.

Code

clf.best_params_

{'modelo__criterion': 'friedman_mse',
 'modelo__learning_rate': 0.1,
 'modelo__loss': 'deviance',
 'modelo__subsample': 0.5}

Before the prediction, a new validation will be performed, so as to check whether the new model has indeed improved with respect to the previous one.

Code

new_val = clf.predict(X_val)
print(accuracy_score(y_val, new_val))

0.696969696969697

Finally, the tuned model is tested.

Code

prediction = clf.predict(X_test)

Code

from sklearn.metrics import classification_report, confusion_matrix
print('Confusion matrix for 3 diagnosis: ')
print(' ')
print(confusion_matrix(y_test,prediction))
print(' ')
print(classification_report(y_test,prediction))
print(' ')
print('Accuracy of the model: ', accuracy_score(y_test, prediction))

Confusion matrix for 3 diagnosis: 
 
[[44 13  3]
 [18 36 15]
 [ 3  5 58]]
 
              precision    recall  f1-score   support

           1       0.68      0.73      0.70        60
           2       0.67      0.52      0.59        69
           3       0.76      0.88      0.82        66

    accuracy                           0.71       195
   macro avg       0.70      0.71      0.70       195
weighted avg       0.70      0.71      0.70       195

 
Accuracy of the model:  0.7076923076923077

6. Models with 2 outcomes

For these models we will only consider two possible outcomes: PADC or no-PADC. For this, the diagnosis data of healthy (1) and non-cancerous pancreas condition (2) will be merged.

Code

new_data = data
new_data.diagnosis[new_data.diagnosis == 2] = 1
new_data.diagnosis[new_data.diagnosis == 3] = 2
new_data.diagnosis = new_data.diagnosis.cat.remove_unused_categories()

Code

new_data.diagnosis

0      1
1      1
2      1
3      1
4      1
      ..
585    2
586    2
587    2
588    2
589    2
Name: diagnosis, Length: 590, dtype: category
Categories (2, int64): [1, 2]

A new exploratory analysis can be performed to assess whether this study will have a more positive result than the previous one.

The boxplots below show a more significant difference than the ones where 3 diagnosis were being evaluated. Nonetheless, it is still advisable to apply a logarithmic transform.

Code

# Boxplots for numerical variables
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = new_data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = new_data,
        y       = colum,
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Numerical variables boxplot', fontsize = 10, fontweight = "bold");

As for the densities and correlations, this separation of the data allows a better visualization of clusters.

Code

sns.pairplot(new_data, hue = 'diagnosis', markers=["o", "s"], palette= sns.color_palette("muted", 2), corner=True)

<seaborn.axisgrid.PairGrid at 0x7f8c489db400>

The same graphs are evaluated after the application of a log transform and the results show clear difference between the two groups.

Code

 # Boxplots of data when applying the transform
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(9, 5))
axes = axes.flat
columnas_numeric = new_data.select_dtypes(include=['float64', 'int']).columns

for i, colum in enumerate(numeric_cols):
    sns.boxplot(
        data    = data,
        y       = (transformer.transform(new_data[colum])),
        ax      = axes[i-2],
        x       = 'diagnosis',
    )
    axes[i-2].set_title(colum, fontsize = 7, fontweight = "bold")
    axes[i-2].tick_params(labelsize = 6)
    axes[i-2].set_xlabel("")
    
    
fig.tight_layout()
plt.subplots_adjust(top = 0.9)
fig.suptitle('Logarithm of numerical variables boxplot', fontsize = 10, fontweight = "bold");
#

Code

plotting_data = transformer.transform(new_data.select_dtypes(include=['float64', 'int']))
plotting_data = pd.concat([plotting_data.reset_index(drop=True), new_data.diagnosis], axis=1)
sns.pairplot(plotting_data, hue = 'diagnosis', markers=["o", "s"], palette= sns.color_palette("muted", 2), corner=True)

<seaborn.axisgrid.PairGrid at 0x7f8c3c8b9a00>

Code

y2class = new_data.diagnosis

6.1 Division of data

The new data will be divided into train, validation and test using the same method as before.

Code

X_train2, X_test2, y_train2, y_test2 = train_test_split(X,y2class, test_size = 0.33, random_state = 42,stratify=y2class)
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_train2, y_train2, test_size=0.25, random_state=1)

6.2 Predictive models

The same models as before are validated, the tunning parameters are the ones set by default.

Code

key2 = ['LogisticRegression','KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','AdaBoostClassifier']
value2 = [LogisticRegression(),KNeighborsClassifier(weights ='distance'), SVC(random_state=42), DecisionTreeClassifier(random_state=42), RandomForestClassifier(random_state=42), GradientBoostingClassifier(random_state=42), AdaBoostClassifier()]
models2 = dict(zip(key2,value2))
models2

predicted2 =[]
for name,algo in models2.items():
    model=algo
    pipe = Pipeline([('preprocessing', preprocessor),  ('modelo', model)])
    pipe.fit(X=X_train2, y=y_train2)
    predict2 = pipe.predict(X_val2)
    acc = accuracy_score(y_val2, predict2)
    predicted2.append(acc)
    
    
plt.figure(figsize = (10,5))
sns.barplot(x = predicted2, y = key2, palette='muted')

<AxesSubplot:>

As expected, these models have significantly improved their results in this scenario.

Code

predictions2 = dict(zip(key2,predicted2))
print(predictions2)

{'LogisticRegression': 0.8383838383838383, 'KNeighborsClassifier': 0.8585858585858586, 'SVC': 0.8888888888888888, 'DecisionTreeClassifier': 0.8282828282828283, 'RandomForestClassifier': 0.8888888888888888, 'GradientBoostingClassifier': 0.8686868686868687, 'AdaBoostClassifier': 0.8484848484848485}

The best predictors are the SVC and the RandomForestClassfier. The chosen model to tune is the RF.

6.3 Tunning the chosen two class model

Random Forest Classifier

Code

from sklearn.model_selection import RandomizedSearchCV, RepeatedKFold
from sklearn.ensemble import RandomForestClassifier
import multiprocessing

# Se combinan los pasos de preprocesado y el modelo en un mismo pipeline.
pipe = Pipeline([('preprocessing', preprocessor),
                 ('modelo', RandomForestClassifier())])

# Hyperparameters optimization

param_distributions = {
    'modelo__n_estimators': [20, 30, 50, 100, 150, 200],
    'modelo__criterion': ["gini", "entropy"],
    'modelo__max_features': ["auto", 3, 5, 7],
    'modelo__max_depth'   : [3, 5, 10, 20]
}

# Random grid search
grid = RandomizedSearchCV(
        estimator  = pipe,
        param_distributions = param_distributions,
        n_iter     = 20,
        scoring    = 'accuracy',
        n_jobs     = multiprocessing.cpu_count() - 1,
        cv         = RepeatedKFold(n_splits = 5, n_repeats = 3),
        refit      = True, 
        verbose    = 2,
        random_state = 123,
        return_train_score = True
       )

grid.fit(X = X_train2, y = y_train2)

Fitting 15 folds for each of 20 candidates, totalling 300 fits

RandomizedSearchCV(cv=RepeatedKFold(n_repeats=3, n_splits=5, random_state=None),
                   estimator=Pipeline(steps=[('preprocessing',
                                              ColumnTransformer(remainder='passthrough',
                                                                transformers=[('cat',
                                                                               Pipeline(steps=[('imputer',
                                                                                                KNNImputer()),
                                                                                               ('onehot',
                                                                                                OneHotEncoder(handle_unknown='ignore'))]),
                                                                               ['sample_origin',
                                                                                'sex']),
                                                                              ('numeric',
                                                                               Pipeline(steps=[('transformer',
                                                                                                Functi...
                                                                                'creatinine',
                                                                                'LYVE1',
                                                                                'REG1B',
                                                                                'T001',
                                                                                'REG1A'])])),
                                             ('modelo',
                                              RandomForestClassifier())]),
                   n_iter=20, n_jobs=7,
                   param_distributions={'modelo__criterion': ['gini',
                                                              'entropy'],
                                        'modelo__max_depth': [3, 5, 10, 20],
                                        'modelo__max_features': ['auto', 3, 5,
                                                                 7],
                                        'modelo__n_estimators': [20, 30, 50,
                                                                 100, 150,
                                                                 200]},
                   random_state=123, return_train_score=True,
                   scoring='accuracy', verbose=2)

Code

#Parameters chosen
grid.best_params_

{'modelo__n_estimators': 200,
 'modelo__max_features': 5,
 'modelo__max_depth': 5,
 'modelo__criterion': 'gini'}

Code

#Check the validation again
new_val2 = grid.predict(X_val2)
print(accuracy_score(y_val2, new_val2))

0.8686868686868687

Code

#See the results in the test data
prediction = grid.predict(X_test2)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test2,prediction))
print(classification_report(y_test2,prediction))

[[120   9]
 [  8  58]]
              precision    recall  f1-score   support

           1       0.94      0.93      0.93       129
           2       0.87      0.88      0.87        66

    accuracy                           0.91       195
   macro avg       0.90      0.90      0.90       195
weighted avg       0.91      0.91      0.91       195

7. Conclusions

The tunned models show a high performance in both cases, nevertheless, due to the similarities between groups in the first case, a second analysis was performed in order to see if better predictions could be obtained.

The best model was the random forest classifier, which was built for only two classes. It uses the gini criterion, a maximum depth of 5 and 50 estimators. The accuracy obtained on the test set was 0,89.

These results show that the analysed biomarkers can accurately predict whether a patient has cancer or not.

Future lines of research can include the prediction of symptoms for benign tumors or the prediction of the stage of the cancer in the case of a positive diagnosis.