SampleSeq2 Instructions


***************
I. Introduction
***************

SampleSeq2 requires a squared matrix of pairwise distance among all
subjects that are candidates for selection.  The matrix can be
generated by PLINK using GWAS data or by Idcoefs using pedigree
information.  These are explained in detail in Sections II and III.
In Section IV, we describe how to use SampleSeq2 for subject
selection.  In Section V, we describe how to estimate the number of
independent genomes in a subset of subjects.


******************************
II. IBD estimation using Plink
******************************

Plink (http://pngu.mgh.harvard.edu/~purcell/plink/) is a program
written by Purcell.

Here is a procedure we used when using Plink to calculate IBD sharing.

1. Remove all nonautosomal SNPs from .bim file, which is supposed to
   have been through quality control filtering.  Example code:

awk '{if($1>22) print}' <file>.bim > nonautosomal.SNPs

plink --bfile <file> --exclude nonautosomal.SNPs --make-bed --out <file2>


2. Prune SNPs.  There are two approaches.  The simpler one is to
   randomly choose 5% of the SNPs (which is probably almost as good as LD
   pruning); example code:

plink --bfile <file2> --thin 0.05 --make-bed --out <file3>

   Alternatively, you can LD-prune.  The parameters for the pruning are
   arbitrary.  We have found that 50K autosomal SNPs give almost exactly
   the same estimate of IBD sharing as 1M SNPs, although these
   observations are anecdotal.  Also this affects run time greatly.
   Example code:

plink --bfile <file2> --indep-pairwise 50 5 0.1 --make-bed --out <file3>


3. Estimate pairwise IBD.  Example code (output file gzipped):

plink --bfile <file3> --Z-genome --out <outfile>


4. Extract IBD sharing information.  Example code:

zcat <outfile>.genome.gz | awk '{print $2, $4, $10}' > <IBDfile>


**************************************
III. Kinship calculation using Idcoefs
**************************************

Idcoefs (http://code.google.com/p/idcoefs/) is a program written by
Abney and described in the following paper.

Abney M. 2009. A graphical algorithm for fast computation of identity
coefficients and generalized kinship coefficients. Bioinformatics
25(12):1561-3.

Here are the steps we used to calculation kinship coefficient:

0. Prepare input files.  Idcoefs requires two input files.  One is a
   pedigree file with three columns: subject ID, father ID, mother
   ID. (Extra columns will be ignored.)  The other is a study file, which
   consists of a list of subject IDs for which pairwise kinship
   coefficent will be calculated.  Please consult Idcoefs documentations
   for details.


1. Run idcoefs to get 9 condensed identity coefficients for each
   subject pair.  Example code:

idcoefs -p file.ped -s file.study -o file.out -r 5000


2. Calculate the kinship coefficient for each pair.  Example code:

awk '{kk=$3+($5+$7+$9)*.5+$10*.25; print $1,$2,1-2*kk}' file.out > <IBDfile>


*******************************************
IV. Calling SampleSeq2() to select subjects
*******************************************

1. Prepare a matrix of pairwise IBD sharing.  Example R code (assuming
   subject IDs are 1-N):

ibd = read.table("IBDfile", sep=" ")
MIBD = matrix(0, N, N)
for(i in 1:N) {
  aa = ibd[i,1]; bb = ibd[i,2]; cc = ibd[i,3]
  MIBD[aa, bb] = MIBD[bb, aa] = cc
}

## Set distance with self to zero
diag(MIBD) = 0


2. Call SampleSeq2().

SampleSeq2 = function(M, K, N = NULL, REPL = 10)

Arguments:

  M:    A squared matrix of pairwise distance with zero diagonal.
  K:    The number of subjects to be selected, or a vector of the
        numbers of subjects to be selected in subgroups.
  N:    A vector of the numbers of subjects in subgroups.  Default
	is NULL, indicating there is a single group.  If not NULL,
	sum(N) should be the number of rows and columns in M,
	length(K) should equal length(N), and K should be less than N
	for all subgroups. 
  REPL: number of repeated runs.  Default is 10.

Details:

  When there are multiple subgroups, the subjects should be ordered
  when creating the matrix M, so that the N[1] subjects in subgroup 1
  are followed by the N[2] subjects in subgroup 2, which are followed
  by the N[3] subjects in subgroup 3, etc.

Values: 

  maxtpd: total pairwise distance (TPD) of the selected subjects
  idx:    indices for the selected subjects


******************************************************************
V. Calling GT() to estimate the total number of indepedent genomes
******************************************************************

1. Prepare the matrix of pairwise IBD sharing for the selected
   subjects.  Example R code:

aa = SampleSeq2(M, K)
Mselect = M[aa$idx, aa$idx]


2. Call GT(), which has one argument.

GT = function(Mselect)

Argument:

  Mselect:  a squared matrix of pairwise distance for the selected subjects

Value:

  A number of estimated total number of indepedent genomes among the
  selected subjects.