SampleSeq Instructions *************** I. Introduction *************** SampleSeq is a probability-based algorithm for selecting samples for a targeted resequencing experiment. It can 1. Increase the yield of rare disease alleles substantially over random sampling of cases or controls. 2. Allow for smaller sample sizes for resequencing experiments. 3. Allow the capture of rarer risk alleles. 4. Select subjects with an even representation for multiple regions. 5. Be used to calculate the sample size needed for the resequencing. The paper is: Edwards TL, Song Z, Li C (2011) Enriching targeted sequencing experiments for rare disease alleles. Bioinformatics 27:2112-2118 (PMID 21700677; PMC3137214) *************** II. Requirement *************** The program requires Perl and its Math-MatrixReal library, and the statistical software R. ************************* III. Files in the package ************************* SampleSeq.pl (a Perl program) sampleseq.r (an R program called by the Perl program) SampleSeq.README (this file) example/ (example files; see below for details) ************** IV. Input file ************** The input file should contain genotypes at associated SNPs in regions where targeted resequencing will be performed. Each SNP represents a region, and each region has only one SNP in the input file. The SNP genotype will be used for calculating the expected number of disease variants in the region a subject carries. The input file should be in PED format (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped). It is a white-space (space or tab) delimited file without header row. The first 6 columns are mandatory: Family ID Sample ID Paternal ID Maternal ID Sex (1=male; 2=female; other=unknown) Phenotype (1=case, 0=control) Start from column 7 are genotypes. They can be any characters (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0, which is by default the missing genotype character. All markers should be biallelic. All SNPs (whether haploid or not) must have two alleles specified. Either both alleles are missing (i.e. 0) or neither. For SampleSeq, different SNPs can use different character sets. For example, the following is allowed: FAM001 1 0 0 1 2 A G 1 1 G C FAM002 2 0 0 1 2 A A 1 2 0 0 ... ******************** V. Running SampleSeq ******************** There are 7 parameters for the program: input (string) Input file nloci (integer) Number of regions (number of SNPs) riskallele (comma separated string) Disease associated alleles variantMAF (comma separated string) Projected rare variant MAFs prevalence (decimal number) Disease prevalence nselect (integer) Number of subjects to be selected ntotalvar (integer) Number of total rare variants per region For example, suppose we follow up three regions for a disease with prevalence 0.10. The associated SNPs have risk alleles coded as a, G, C, respectively, and the projected rare disease variant MAFs are 0.01, 0.01, 0.005, respectively. We wish to select 100 subjects, with each region having at least 10 copies of the rare disease variants in the selected subjects. The command will be: perl SampleSeq.pl -input input.ped -nloci 3 -riskalleles a,G,C \ -varaintMAF 0.01,0.01,0.005 -prevalence 0.1 -nselect 100 \ -ntotalvar 10 To save the output, redirect the output by adding "> filename" to the end of command: perl SampleSeq.pl -input input.ped -nloci 3 -riskalleles a,G,C \ -varaintMAF 0.01,0.01,0.005 -prevalence 0.1 -nselect 100 \ -ntotalvar 10 > output.log **************** VI. Output files **************** 1) expected.txt contains the following information for each subject: familyID sampleID isCase 1=case, 0=control a_sum total number of risk alleles over all regions (- for missing) ED_sum total expected number of rare variants over all regions a_1 number of risk alleles in region 1 (- for missing) ED_1 number of expected rare variants in region 1 etc. The program further processes the information in expected.txt using the R script. 2) topxxID.txt (where xx is replaced by the -nselect number) contains the IDs for the selected subjects and their ED_sum values. ****************** VII. Example files ****************** The example/ directory contains an input file, an example command file, and three output files: example/examplecommand.txt (example command) example/exampleinput.txt (example input file) example/expected.txt (example intermediate output file) example/top50ID.txt (example output of selected subjects) example/sampleseq.log (example output of log) The following command was used to generate these output files: perl ../SampleSeq.pl -input exampleinput.txt -nloci 3 -riskallele 1,1,1 \ -variantMAF 0.01,0.01,0.01 -prevalence 0.02 -nselect 50 -ntotalvar 15 \ > sampleseq.log