SampleSeq Instructions


***************
I. Introduction
***************

SampleSeq is a probability-based algorithm for selecting samples for a
targeted resequencing experiment.  It can

1. Increase the yield of rare disease alleles substantially over
random sampling of cases or controls.

2. Allow for smaller sample sizes for resequencing experiments.

3. Allow the capture of rarer risk alleles.

4. Select subjects with an even representation for multiple regions.

5. Be used to calculate the sample size needed for the resequencing.

The paper is:

Edwards TL, Song Z, Li C (2011) Enriching targeted sequencing
experiments for rare disease alleles. Bioinformatics 27:2112-2118
(PMID 21700677; PMC3137214)


***************
II. Requirement
***************

The program requires Perl and its Math-MatrixReal library, and the
statistical software R.


*************************
III. Files in the package
*************************

SampleSeq.pl          (a Perl program)
sampleseq.r           (an R program called by the Perl program)
SampleSeq.README      (this file)
example/              (example files; see below for details)


**************
IV. Input file
**************

The input file should contain genotypes at associated SNPs in regions
where targeted resequencing will be performed.  Each SNP represents a
region, and each region has only one SNP in the input file.  The SNP
genotype will be used for calculating the expected number of disease
variants in the region a subject carries.

The input file should be in PED format
(http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped).  It is a
white-space (space or tab) delimited file without header row.  The
first 6 columns are mandatory:

Family ID
Sample ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype (1=case, 0=control)

Start from column 7 are genotypes.  They can be any characters
(e.g. 1,2,3,4 or A,C,G,T or anything else) except 0, which is by
default the missing genotype character.  All markers should be
biallelic.  All SNPs (whether haploid or not) must have two alleles
specified.  Either both alleles are missing (i.e. 0) or neither.

For SampleSeq, different SNPs can use different character sets.  For
example, the following is allowed:

FAM001  1  0 0  1  2  A G  1 1  G C 
FAM002  2  0 0  1  2  A A  1 2  0 0 
...


********************
V. Running SampleSeq
********************

There are 7 parameters for the program:

input       (string)                  Input file
nloci       (integer)                 Number of regions (number of SNPs)
riskallele  (comma separated string)  Disease associated alleles
variantMAF  (comma separated string)  Projected rare variant MAFs
prevalence  (decimal number)          Disease prevalence
nselect     (integer)                 Number of subjects to be selected
ntotalvar   (integer)                 Number of total rare variants per region

For example, suppose we follow up three regions for a disease with
prevalence 0.10.  The associated SNPs have risk alleles coded as a, G,
C, respectively, and the projected rare disease variant MAFs are
0.01, 0.01, 0.005, respectively.  We wish to select 100 subjects, with
each region having at least 10 copies of the rare disease variants in
the selected subjects.  The command will be:

perl SampleSeq.pl -input input.ped -nloci 3 -riskalleles a,G,C \
     -varaintMAF 0.01,0.01,0.005 -prevalence 0.1 -nselect 100 \
     -ntotalvar 10


To save the output, redirect the output by adding "> filename" to the
end of command:

perl SampleSeq.pl -input input.ped -nloci 3 -riskalleles a,G,C \
     -varaintMAF 0.01,0.01,0.005 -prevalence 0.1 -nselect 100 \
     -ntotalvar 10 > output.log


****************
VI. Output files
****************

1) expected.txt contains the following information for each subject:

familyID
sampleID
isCase	1=case, 0=control
a_sum	total number of risk alleles over all regions (- for missing)
ED_sum	total expected number of rare variants over all regions
a_1	number of risk alleles in region 1 (- for missing)
ED_1	number of expected rare variants in region 1
etc.

The program further processes the information in expected.txt using
the R script.

2) topxxID.txt (where xx is replaced by the -nselect number) contains
the IDs for the selected subjects and their ED_sum values.


******************
VII. Example files
******************

The example/ directory contains an input file, an example command
file, and three output files:

example/examplecommand.txt  (example command)
example/exampleinput.txt    (example input file)
example/expected.txt        (example intermediate output file)
example/top50ID.txt         (example output of selected subjects)
example/sampleseq.log       (example output of log)

The following command was used to generate these output files:

perl ../SampleSeq.pl -input exampleinput.txt -nloci 3 -riskallele 1,1,1 \
  -variantMAF 0.01,0.01,0.01 -prevalence 0.02 -nselect 50 -ntotalvar 15 \
  > sampleseq.log