Bladder cancer
is one of the most common cancers
worldwide, with transitional cell carcinoma (TCC) being the predominant
form. It has being estimated 386,300 new cases and 150,200 deaths in the
coming years.
Previous studies showed that bladder cancer represents a
heterogeneous disease, with two distinct subtypes- superficial and
invasive-showing variable clinical presentation and genetic background
such us mutations in chromatin-remodeling genes
(UTX, MLL-MLL3, CREBBPEP300, NCOR1, ARID1A and CHD6
).
Nevertheless, the comprehension of bladder cancer alterations is far
from complete, and the key drivers of TCC tumor genesis remain poorly
understood.
The objective of this project is to compare sequence data variants between tumor and normal patient samples from
whole genome sequence data
(WGS).
For this, the fastaq files of 28 samples were downloaded, the genome was indexed and the corresponding BAMs were generated. The variant calling (vc) process was applied using the GATK (Genome Analysis Toolkit), the results of which were analyzed with Ensembl.
The bioinformatics pipeline
for a typical DNA sequencing
strategy involves aligning the raw sequence reads from a FASTQ or
unaligned BAM file against the human reference genome. The aligned
sequences and the related metadata are stored in a Sequence Alignment
Mapping (SAM/BAM) or CRAM file format. Downstream algorithms consume the
BAM file to identify a range of genetic alterations, including
single nucleotide
variants
,
insertions and deletions (indels)
, and
tumor mutation burden
. (AACC).
Different important paths for the realization of this pipeline are defined below: In pathSamples the directory in which the downloaded samples were saved is defined. On the other hand, pathGenome contains the directory to the fasta file of the GRCh38 reference human genome (hg38). For more information about this genome see NCBI. In path_gak the directory to GATK is saved, which is one of the most powerful and well-documented tools avaliable for calling variants. Finally, in pathOutput we will indicate the directory in which we will save our results.
<-"/datos/mgimeno/NGS/Samples"
pathSamples<-"/datos/mgimeno/NGS/FinalAssessment/hg38.fa"
pathGenome<-"/datos/mgimeno/NGS/FinalAssessment/DNA_analysis/"
pathOutput<- '/datos/mgimeno/NGS/FinalAssessment/gatk-4.1.6.0/gatk' path_gatk
Below is a table with the 28 genome sequence data (WGS) samples obtained from FTP. As indicated in column X17 and X24, for each “Bxx” patient there is a Cancer sample and a Normal sample, and the samples have paired-end reads (both ends of the fragments are sequenced which allows users to generate high-quality, alignable sequence data). Sequencing was obtained using the Illumina Genome Analyzer II instrument from the Illumina brand.
library(readr)
#selected<-read_delim("/datos/mgimeno/NGS/FinalAssessment/DNA_analysis/TablaFinal_DNA.csv",
# ";", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
<-read_delim("TablaFinal_DNA.csv",
selected";", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
<-selected[-29,] selected
library(DT)
::datatable(selected) DT
FASTA
: a text-based format to represent nucleotide or
protein sequences.FASTAQ
: An extension of the FASTA format is FASTQ
format. This format is designed to handle base quality metrics output
from sequencing machines. In this format, both the sequence and quality
scores are represented as single ASCII characters. The compressed paired
fastqs were downloaded using the axel command, which is equivalent to
wget but it can open more than just one connection. In many cases,
download using axel is faster than using wget.<-selected$X14
namesoutput
<- paste0("axel -n 8 -a http://ftp.ebi.ac.uk/vol1/fastq/",substring(namesoutput,1,6),"/",namesoutput,"/",namesoutput,"_1.fastq.gz .")
command1 <- paste0("axel -n 8 -a http://ftp.ebi.ac.uk/vol1/fastq/",substring(namesoutput,1,6),"/",namesoutput,"/",namesoutput,"_2.fastq.gz .")
command2
<- as.vector(rbind(command1,command2))
commandAll system(commandAll)
Where namesoutput is a list with the names of the samples appeared in column 14 of the previous table.
We have to build three indexes as follows…
<- paste0("samtools faidxx ",pathGenome) # Reference genome index
commandIndex system(commandIndex)
<- paste0("java -jar /datos/mgimeno/NGS/FinalAssessment/picard.jar CreateSequenceDictionary REFERENCE=",pathGenome) # Creates .dict
commandIndex2 system(commandIndex2)
<- paste0("bwa index ",pathGenome) # Creates .fai
commandIndex3 system(commandIndex3)
Picard
is a set of command line tools for manipulating high-throughput sequencing (HTS) data. Starting with version 4.0, gatk contains a copy of the Picard toolkit, so all Picard tools are available from within GATK itself.
The 28 samples were analyzed by different members of our team as follows :
<- 1:7
tomas <- 8:14
katyna <- 15:21
joseba <- 22:28
cris <- katyna # Define the samples to analyze samplesToAnalyze
First, folders for all samples were created.
for (i in samplesToAnalyze){
<- paste0('mkdir ', pathOutput, namesoutput[i])
commandFolders system(commandFolders)
}
The Sequence Alignment Map Format (SAM) is a tab-delimited text file that contains sequence alignment data. A BAM file (. bam) is the binary version of a SAM file, which takes up much less space.
In this step of the pipeline the .bam files were created from the
.sam files of the samples. - With the command commandSAM
we create the .sam file
from the fastaq
files.
- Read groups are aligned
to the reference genome using BWA
software (mean read length should be greater or equal to 70bp, otherwise
bwa aln must be used). The bwa mem can tag -and group- sets of reads
generated from a single run of a sequencing instrument. In our case, bwa
mem performs the mapping using 10 threads. - With commandBAM and
commandSAMremove we create the .bam files
and
remove the .sam
files, respectively.
for (i in samplesToAnalyze){
<-dir(path = pathSamples,pattern = namesoutput[i])
samples<-paste0("nice bwa mem -t 10 ", pathGenome, " ", pathSamples, '/',samples[1], " ", pathSamples, '/',samples[2], " > ",
commandSAM".sam " ) # Create .sam
pathOutput,namesoutput[i],system(commandSAM)
<-paste0("nice samtools view -@ 10 -S -b ", pathOutput,namesoutput[i],".sam > ", pathOutput,namesoutput[i],".bam") # Create .bam
commandBAM<-paste0("rm ", pathOutput,namesoutput[i],".sam") # Remove .sam (many GB)
commandSAMremovesystem(commandBAM)
system(commandSAMremove)
}
We then sort the .bam
files with samtools.
for (i in samplesToAnalyze){
<-paste0("nice samtools sort -@ 10 ",pathOutput,namesoutput[i],".bam ",pathOutput,namesoutput[i]," ") # SORT
commandSortsystem(commandSort)
}
The sorted .bam replaces the older ones, so there’s no need to remove them.
Finally, the index the .bam
files is created using
samtools.
for (i in samplesToAnalyze){
<-paste0("samtools index ", pathOutput,namesoutput[i], ".bam" ) # Creates .bai
commandBAIsystem(commandBAI)
}
Adding read group information. It
assigns all the reads in a file to a single new read-group
.
for (i in samplesToAnalyze){
<-paste0("java -jar /datos/mgimeno/NGS/FinalAssessment/picard.jar AddOrReplaceReadGroups I=",
commandReadGroup".bam O=", pathOutput,namesoutput[i], ".rg.bam SO=coordinate RGID=",
pathOutput,namesoutput[i], " RGLB=LB1 RGPL=Illumina RGPU=unit2 RGSM=", pathOutput,namesoutput[i])
pathOutput,namesoutput[i],system(commandReadGroup)
<-paste0("samtools index ", pathOutput,namesoutput[i], ".rg.bam" )
commandReadGroupIndexsystem(commandReadGroupIndex)
}
This step locates and tags duplicate
reads in our BAM
file. Duplicates can arise during sample preparation e.g. library
construction using PCR and can also result from a single amplification
cluster, incorrectly detected as multiple clusters by the optical sensor
of the sequencing instrument.
MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.
for (i in samplesToAnalyze){
<-paste0("java -jar /datos/mgimeno/NGS/FinalAssessment/picard.jar MarkDuplicates I=",
commandMarkDup".rg.bam O=", pathOutput,namesoutput[i], ".dupmark.bam METRICS_FILE=",
pathOutput,namesoutput[i], ".dupmark.stats REMOVE_DUPLICATES=FALSE")
pathOutput,namesoutput[i],system(commandMarkDup)
}
Folders permissions are given so that all users can access the files…
for (i in samplesToAnalyze){
<- paste0('mv ', pathOutput,'*',namesoutput[i],'* ', pathOutput,namesoutput[i]) # Move files to their folders (if they are not already in them)
commandMove system(commandMove)
<- paste0('chmod a+rwx ', pathOutput,namesoutput[i])
commandPermissions1 system(commandPermissions1)
<- paste0('chmod a+rwx ', pathOutput,namesoutput[i],'/',namesoutput[i],'*')
commandPermissions2 system(commandPermissions2)
}
Variant calling is the process by which we
identify variants from sequence data
. It can be a somatic
vc (where the reference is a related tissue from the same individual) or
a germline vc (where the reference genome is the standard for the
species of interest). Germline vc allows us to identify genotypes.
As most genomes are diploid, we expect to see that at any given locus, either all reads have the same base, indicating homozygosity, or approximately half of all reads have one base and half have another, indicating heterozygosity. In this pipeline we will be using HaplotypeCaller, which is a straightforward technique that identifies where the aligned reads (sample) differ from the reference genome and write to a vcf file.
system('sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64') # Change the java version (not necessary)
for (i in samplesToAnalyze){
<-paste0(pathOutput,namesoutput[i])
path_dup<-paste0(path_gatk, ' HaplotypeCaller ',
command_vcf'--native-pair-hmm-threads 4 ',
'-I ', path_dup,'/',namesoutput[i],'.dupmark.bam ',
'-R ', pathGenome, ' ',
'-O ', path_dup,'/',namesoutput[i],'_snps_indels_g.vcf')
system(command_vcf)
}
As an input argument the
.bam files with marked duplicates
and the reference genome
were introduced, and as output a vcf file was generated using
HaplotypeCaller. Four threads were used for this task.
Through this process systematic errors in base
quality scores
are detected and a recalibration table is
generated based on various user-specified covariates (such as read
group, reported quality score, machine cycle, and nucleotide
context).
As an input argument the .bam files with marked duplicates, the reference genome and the regions of interest through the vcf file were introduced, and a table with the results of the calibration was generated as an output.
for (i in samplesToAnalyze){
<-paste0(path_gatk, ' BaseRecalibrator ',
command_vcfreports '-R ', pathGenome, ' ',
'-I ', path_dup,'/',namesoutput[i],'.dupmark.bam ',
'--known-sites ', path_dup,'/',namesoutput[i],'_snps_indels_g.vcf ',
'-O ', path_dup,'/',namesoutput[i],'_report.table')
system(command_vcfreports)
}
#install.packages("gsalib")
library(gsalib)
for (i in samplesToAnalyze){
<- paste0(namesoutput[i],'report <- gsa.read.gatkreport("',path_dup,'/',namesoutput[i],'_report.table','")')
commandVal eval(parse(text=(commandVal)))
}
Base quality score recalibration (BQSR) is a process in which we
apply machine learning to model quality score errors
empirically and adjust the quality scores accordingly. This allows us to
get more accurate base qualities, which in turn improves the accuracy of
our variant calls.
For each of the samples we introduce as input the same .bam files, and the report table of the previous vcf, and we obtain as output the recalibrated .bam files.
for (i in samplesToAnalyze){
<- paste0(path_gatk, ' ApplyBQSR ',
commandBSRecal '-I ', path_dup,'/',namesoutput[i],'.dupmark.bam ',
'--bqsr ', path_dup,'/',namesoutput[i],'_report.table ',
'-O ', path_dup,'/',namesoutput[i],'_output_recalibrated.bam')
system(commandBSRecal)
}
Next, a new HaplotypeCaller is performed to identify where the aligned reads differ from the reference genome using the recalibrated .bams and we again obtain the corresponding vcf files.
for (i in samplesToAnalyze){
<- paste0(path_gatk, ' HaplotypeCaller ',
commandRecal '-I ', path_dup,'/',namesoutput[i],'_output_recalibrated.bam ',
'-R ', pathGenome, ' ',
'-O ', path_dup,'/',namesoutput[i],'_snps_indels_g_rec.vcf')
system(commandRecal)
}
for (i in samplesToAnalyze){
<- paste0(path_gatk, ' BaseRecalibrator ',
commandRecalReport '-I ', path_dup,'/',namesoutput[i],'_output_recalibrated.bam ',
'-R ', pathGenome, ' ',
'--known-sites ', path_dup,'/',namesoutput[i],'_snps_indels_g.vcf ', #TENEMOS QUE VER PORQUE USA EL VCF ORIGINAL Y NO EL REC
'-O ', path_dup,'/',namesoutput[i],'_recalibrated_report.table')
system(commandRecalReport)
}
As shown on the next chunk, gatk AnalyzeCovariates creates a pdf report with the recalibration result and statidistics.
for (i in samplesToAnalyze){
<- paste0(path_gatk, ' AnalyzeCovariates ',
commandRecalReport '--before ', path_dup,'/',namesoutput[i],'_report.table ',
'--after ', path_dup,'/',namesoutput[i],'_recalibrated_report.table ',
'--plots ', path_dup,'/',namesoutput[i],'_recalibrated_report.pdf')
system(commandRecalReport)
}
system('sudo update-java-alternatives --set /usr/lib/jvm/java-1.11.0-openjdk-amd64') # Go back to the standard version of java