SampleSeq2 Instructions *************** I. Introduction *************** SampleSeq2 requires a squared matrix of pairwise distance among all subjects that are candidates for selection. The matrix can be generated by PLINK using GWAS data or by Idcoefs using pedigree information. These are explained in detail in Sections II and III. In Section IV, we describe how to use SampleSeq2 for subject selection. In Section V, we describe how to estimate the number of independent genomes in a subset of subjects. ****************************** II. IBD estimation using Plink ****************************** Plink (http://pngu.mgh.harvard.edu/~purcell/plink/) is a program written by Purcell. Here is a procedure we used when using Plink to calculate IBD sharing. 1. Remove all nonautosomal SNPs from .bim file, which is supposed to have been through quality control filtering. Example code: awk '{if($1>22) print}' .bim > nonautosomal.SNPs plink --bfile --exclude nonautosomal.SNPs --make-bed --out 2. Prune SNPs. There are two approaches. The simpler one is to randomly choose 5% of the SNPs (which is probably almost as good as LD pruning); example code: plink --bfile --thin 0.05 --make-bed --out Alternatively, you can LD-prune. The parameters for the pruning are arbitrary. We have found that 50K autosomal SNPs give almost exactly the same estimate of IBD sharing as 1M SNPs, although these observations are anecdotal. Also this affects run time greatly. Example code: plink --bfile --indep-pairwise 50 5 0.1 --make-bed --out 3. Estimate pairwise IBD. Example code (output file gzipped): plink --bfile --Z-genome --out 4. Extract IBD sharing information. Example code: zcat .genome.gz | awk '{print $2, $4, $10}' > ************************************** III. Kinship calculation using Idcoefs ************************************** Idcoefs (http://code.google.com/p/idcoefs/) is a program written by Abney and described in the following paper. Abney M. 2009. A graphical algorithm for fast computation of identity coefficients and generalized kinship coefficients. Bioinformatics 25(12):1561-3. Here are the steps we used to calculation kinship coefficient: 0. Prepare input files. Idcoefs requires two input files. One is a pedigree file with three columns: subject ID, father ID, mother ID. (Extra columns will be ignored.) The other is a study file, which consists of a list of subject IDs for which pairwise kinship coefficent will be calculated. Please consult Idcoefs documentations for details. 1. Run idcoefs to get 9 condensed identity coefficients for each subject pair. Example code: idcoefs -p file.ped -s file.study -o file.out -r 5000 2. Calculate the kinship coefficient for each pair. Example code: awk '{kk=$3+($5+$7+$9)*.5+$10*.25; print $1,$2,1-2*kk}' file.out > ******************************************* IV. Calling SampleSeq2() to select subjects ******************************************* 1. Prepare a matrix of pairwise IBD sharing. Example R code (assuming subject IDs are 1-N): ibd = read.table("IBDfile", sep=" ") MIBD = matrix(0, N, N) for(i in 1:N) { aa = ibd[i,1]; bb = ibd[i,2]; cc = ibd[i,3] MIBD[aa, bb] = MIBD[bb, aa] = cc } ## Set distance with self to zero diag(MIBD) = 0 2. Call SampleSeq2(). SampleSeq2 = function(M, K, N = NULL, REPL = 10) Arguments: M: A squared matrix of pairwise distance with zero diagonal. K: The number of subjects to be selected, or a vector of the numbers of subjects to be selected in subgroups. N: A vector of the numbers of subjects in subgroups. Default is NULL, indicating there is a single group. If not NULL, sum(N) should be the number of rows and columns in M, length(K) should equal length(N), and K should be less than N for all subgroups. REPL: number of repeated runs. Default is 10. Details: When there are multiple subgroups, the subjects should be ordered when creating the matrix M, so that the N[1] subjects in subgroup 1 are followed by the N[2] subjects in subgroup 2, which are followed by the N[3] subjects in subgroup 3, etc. Values: maxtpd: total pairwise distance (TPD) of the selected subjects idx: indices for the selected subjects ****************************************************************** V. Calling GT() to estimate the total number of indepedent genomes ****************************************************************** 1. Prepare the matrix of pairwise IBD sharing for the selected subjects. Example R code: aa = SampleSeq2(M, K) Mselect = M[aa$idx, aa$idx] 2. Call GT(), which has one argument. GT = function(Mselect) Argument: Mselect: a squared matrix of pairwise distance for the selected subjects Value: A number of estimated total number of indepedent genomes among the selected subjects.