tar zxf SMUG.tgz
.
Detection of base substitutions: For each patient, we screen the genome for sites that are homozygous in the normal sample and have variants in the tumor sample. To adjust for depth variation, an empirical Bayes (EB) estimate of tumor variant rate is calculated. To minimize the effect of sequencing/alignment errors, the tumor EB-adjusted rate is further adjusted using the normal sample variant rate as baseline. (A site may be called homozygous with a high GQ in the normal sample and heterozygous with a high GQ in the tumor sample when both samples appear to carry similar fractions of variant alleles.)
The results will be consolidated across patients. The user can control how much to output. For example, the user may choose to output sites with mutants in at least two patients. Please see below for details.
Detection of LOH: For each patient, we screen the genome for sites that are heterozygous in the normal sample and have a significant departure from 50:50 in allele counts in the tumor sample. An LOH score is calculated for such a site. An LOH region is defined to be a region containing contiguous sites with high LOH scores. The user can specify the criteria for defining an LOH region. Please see below for details.
Please follow the steps blow to download and build GATK and SMUG.
mkdir GATK git clone git://github.com/broadgsa/gatk.git GATK
wget http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SMUG/SMUG.tgz tar xzf SMUG.tgz
cp -r SMUG/SMUGwalkers/ GATK/public/java/src/org/broadinstitute/sting/gatk/walkers/
cd GATK git checkout 1.6-2-gc2b74ec ant clean ant
git checkout
command ensures the posted version of SMUG can run successfully. To update GATK if you already have a clone of GATK:
cd GATK git pullIf you experience problems running GATK after updating via
git pull
,
try cleaning your clone and recompiling GATK following step 5 above.
NOTE: A walker needs to be run separately for each patient.
java -jar GATK/dist/GenomeAnalysisTK.jar -T SMUG -h java -jar GATK/dist/GenomeAnalysisTK.jar -T SMUG_BaseSub -h java -jar GATK/dist/GenomeAnalysisTK.jar -T SMUG_LOH -h
-R,--reference_sequence reference sequence file -I,--input_file list of SAM or BAM files (see below) -D,--dbsnp dbSNP file; optional but recommended -o,--out log file; default to screen -smT,--SampleName_Tumor sample ID of tumor tissue; it is the SN tag in sam/bam files -smN,--SampleName_Normal sample ID of normal tissue; it is the SN tag in sam/bam files -typeT,--SampleType_Tumor sample type of tumor tissue, user defined name for tumor sample, default = 'Tumor' -typeN,--SampleType_Normal sample type of normal tissue, user defined name for normal sample, default = 'Normal' -minBQ,--minBaseQ minimum base quality, default = 20 -minMQ,--minMappingQ minimum mapping quality, default = 20 -minAllN,--minAllAlleles_Normal minimum base count in Normal sample after filtering by BQ and MQ, default = 10 -minAllT,--minAllAlleles_Tumor minimum base count in Tumor sample after filtering by BQ and MQ, default = 5 -minRestT,--minMinorAlleles_Tumor minimum total base count for the minor alleles in Tumor sample after filtering by BQ and MQ, default = 2; Not applicable to the SMUG_LOH walker -minHRT,--minHeteroRate_Tumor minimun fraction of minor alleles in Tumor sample (RestT/AllT), default = 0.1; Not applicable to the SMUG_LOH walker -stand_call_conf minimum confidence for genotype calling (genotyping quality, GQ), default = 30In addition, some arguments of GATK UnifiedGenotypers can also be specified (please consult GATK documentations for details). For example, if the user wants to run our walkers only on a few regions, the
-L
option can be
used:
-L,--intervals <intervals>Detailed explanations of the options:
-D
are required, but most of them have default values if unspecified. Although -D
is optional, it is highly recommended.
-I
option contains the path to bam/sam files, one file per line. The order of files is not important as long as there are data matching normal and tumor sample IDs specified by the options -smN
and -smT
.
AlignmentContextUtils.splitContextBySampleName(context, sampleID)
-typeN
and -typeT
are defined by the user, the default values are "Normal" and "Tumor". They can be any string without space to help the user to describe the data. They will be used as part of the output file names to easy identification of files.
java -jar GATK/dist/GenomeAnalysisTK.jar -T SMUG \ -R human_b36_male.fa \ -D:VCF dbsnp_132.b36.vcf \ -I normal_and_tumor_samples_bam.list \ -o SMUG.txt \ -smN normal_sample_ID \ -smT tumor_sample_ID \ -typeN Normal \ -typeT Tumor \ -minBQ 20 \ -minMQ 20 \ -minAllN 10 \ -minAllT 10 \ -minRestT 2 \ -minHRT 0.01
-o
contains the information on
parameter values, result files, and number of loci reported. If the
user does not specify -o
, the information will be printed to the
screen. Here is an example log file:
The SMUG_BaseSub walker will generate result files for further analysis by the SMUG_BaseSub.pl script (see section III). The file names will follow the following pattern:
Here is an example result file of the SMUG_BaseSub walker:
loc chr:position rsID rs# or '.' ref reference allele gt_N normal genotype: '.' no call, '%' no variation, 'N/N' (N=A,C,G,T), * means ref gt_T tumor genotype gtQ_N normal genotype quality, -10 means '.', -100 means '%' gtQ_T tumor genotype quality, -10 means '.', -100 means '%' N_hR normal variant rate T_hR tumor variant rate N_hR_C normal variant rate in 'variant count/all allele count' format T_hR_C tumor variant rate in 'variant count/all allele count' format N_aC normal sample base counts: a;c;g;t T_aC tumor sample base counts: a;c;g;t N_+- normal sample base counts by strand: a+,a-;c+,c-;g+,g-;t+,t- T_+- tumor sample base counts by strand: a+,a-;c+,c-;g+,g-;t+,t- N_filtered normal sample read counts with format sampleID(sampleType):rawReads->fillteredReads=(reads+,reads-); for example: 5QC011(Normal):94->92=(57+,35-), meaning sample '5QC011' has 94 reads covering the site, 92 passing MQ/BQ filtering (57 on '+', 35 on '-' strand) T_filtered tumor sample read counts
Note that some columns may have information overlap.
The SMUG_LOH walker will generate result files for further analysis by the SMUG_LOH.pl script (see section IV). The file names will follow the following pattern:
Here is an example result file of the SMUG_LOH walker:
It has 16 columns:
loc chr:position rsID rs# or '.' ref reference allele gt_N normal genotype: '.' no call, '%' no variation, 'N/N' (N=A,C,G,T), * means ref gt_T tumor genotype gtQ_N normal genotype quality, -10 means '.', -100 means '%' gtQ_T tumor genotype quality, -10 means '.', -100 means '%' LOH_Score LOH score for this site N_aC normal sample base counts: a;c;g;t T_aC tumor sample base counts: a;c;g;t T_Major_aC major allele counts in tumor sample T_Minor_aC minor allele counts in tumor sample N_+- normal sample base counts by strand: a+,a-;c+,c-;g+,g-;t+,t- T_+- tumor sample base counts by strand: a+,a-;c+,c-;g+,g-;t+,t- N_filtered normal sample read counts with format sampleID(sampleType):rawReads->fillteredReads=(reads+,reads-); for example: 5QC011(Normal):94->92=(57+,35-), meaning sample '5QC011' has 94 reads covering the site, 92 passing MQ/BQ filtering (57 on '+', 35 on '-' strand) T_filtered tumor sample read counts
-L file containing names of SMUG_BaseSub output files, one on each line -restEB minimum number of alternative alleles for EB parameter estimation, default = 3 -allEB minimum depth for EB parameter estimation, default = 10 -nVote minimum number of patients carrying base substitution, default = 2 -restT minimum number of alternative alleles in Tumor for reporting, default = 2 -allT minimum depth in Tumor for reporting, default = 10 -maxT_hR maximum tumor variant rate, default = 0.9The
-maxT_hR
option is for filtering out sites with very high tumor variants
rate. These are likely due to alignment artifacts rather than real
mutations.
The -restEB
and -allEB
options are for estimating the EB-parameters
alpha and beta. We have found the default values work well for our data.
perl SMUG_BaseSub.pl -L BaseSub.list \ -restEB 3 -allEB 10 \ -nVote 2 \ -restT 2 -allT 10
There is a main output file (one file for all patients) with file name following the pattern:
Here is an example result file:
It has 8 columns plus one column for each patient:
loc chr:position rsID rs# or '.' samples No. of samples supporting this site as somatic mutated sumDiff sum of differences between empirical Bayes adjusted tumor and normal variation rates sum% sum of percentage of top positions of supporting samples meanDiff mean of differences between empirical Bayes adjusted tumor and normal variation rates, = sumDiff/samples mean% sum of percentage of top positions of supporting samples, = meanDiff/samples gt.N->gt.T genotype of normal -> genotype of tumor, could be multiple pairs and sperated with ';' P1.N->T patient 1: a,c,g,t of Normal -> a,c,g,t of Tumor; counts are after MQ and BQ filtering; if the sample doesn't support the site as somatic mutated, then '-' ... additional patients
In addition, there are three intermediate files for each patient. Their file names follow the patterns:
-f a SMUG_LOH output file (*.ToAnalyzeLOH.txt) -minLOH minimum LOH score for LOH region detection, default = 30 -minSNPs minimum number of continuous SNPs required to define a LOH region, default = 2
perl SMUG_LOH.region.pl -f 1055QC0011.Normal.1055QC0031.Tumor.ToAnalyzeLOH.txt \ -minLOH 30.0 \ -minSNPs 2
There is one output file for each patient, with file name following the pattern:
Here is an example result file:
It has 3 columns:
region chr:startPosition-endPosition nSNPs Number of contiguous sites supporting the region regionLength length of region in bp