Cross-Validation

TODO

  • Refactor Distance calculation.
    • Swap loops so the cross-validation (kFold) loop is inside the TopNManager (c) loop.
      • Separate cross-validation Scores by fold before the c loop.
      • Setup cross-validation DataGroupings by fold before the c loop.
    • Modify summaries to handle the ArrayList of DistanceResults.
  • Add testing DataSet and DataGrouping.
    • DistanceRun parameters.
    • TopNManager will use all winners for each N and cross-validation fold.
    • CutoffManager will use winners that appear in 80% of cross-validation cutoffs.

Score Generation

Generate K scores as normal. For the normal situation, there will be 7 scores (11 columns). sam, asam, info, tvalue (ttest), prob_t (ttest), wga, cvalue (wilcoxon), prob_c (wilcoxon), fvalue (fisher), prob_f (fisher), dvalue (ks), prob_d (ks). Use scoreName[--kn] where n = 1 to K. If K > 9 pad with leading 0’s.

Ex:
  • sam--k01, wga--k01, etc
  • sam--k02, wga--k02, etc
  • ...
  • sam--k12, wga--k12, etc

Stratification

When dividing data into folds before generating scores, it is a good idea to make sure a proportionate amount of columns from each group are placed into each fold. Currently, stratification only occurs on a binary level according to the group (1,2) used. In some cases, more categories might be needed.

Randomization

Randomly reordering the columns within each group (1,2) of the grouping is also a good idea before dividing the data into folds. This occurs early in the process so the same random permutation is used to generate all scores. Otherwise, the scores are invalid. The current method is fairly simple and can be toggled before scores are generated.

Score Ranking

Generate ranks for scores as normal. Keep all ranks as normal. For a normal situation, there will be 6 scores that get ranked. (asam, prob_t, wga, prob_w, prob_f, prob_ks). 6 * K + K ranking columns will be added. For the final rank use rank[--kn] where n = 1 to K. If K > 9 pad with leading 0’s. Each k-Fold's scores are divided into separate objects and saved with the --kn stripped off.

Meta Rank

Generate super rank from rank--kn in the normal method of summing rank--kn and ordering the sum. Tie breaking will have to be discussed with Dr. Shyr. No such overall rank across all folds will be needed.

Distance

For each fold, run topN. Combine and average results across folds for each run.

Distance Prefilter

Do a normal prefilter expanded to encompass all the scores. Take the minimum criteria needed and apply it to the generated scores.

Distance Analysis

Do all folds for dataset k. For each run, keep the excluded patients group assignment, and the winner ids. Treat the dataset as the traning set, and the excluded patients as the testing set. Generate the patient scores for the training then calculate the means for the group 1 and group 2. Then generate the patient scores for the testing (exclusion) dataset. Pass the means to the testing set and store the assignment. Update the winner id list with the winners from this dataset.

Blind Data

Cross-validation takes training data and uses part (10%, single column, or whatever) for testing each time until all parts are used once. In this model, none of the training data is truly blind, but the testing dataset never gets used. For the testing dataset, [Yu Shyr] has come up with different methods for TopN and Cutoff criteria.

TopN

TopN will use the testing dataset for each fold, then take an average for each run. For example, run 25 of top200 with 10-fold cross-validation: the entire blind set will be tested 10 times, once for each fold, then an average will be reported for run 25. The scores used will be from either the k-dataset for the training dataset.

Cutoff

Cutoff will use a subset of the winners based on an 80 percent threshold. For example, if marker 10 was a winner for 9 of 10 folds, so it is used for testing, but if marker 256 was a winner for only 7 of 10 folds, it will not be used. The scores used will be from either the training dataset or an average of the k-datasets.

Patient Assignment Report

For each run i keep a flag for each patient that tells what its group assignment is. This will be used to create the graphic.

Ex:
run p1 p2 p3 p4
1 1 1 2 2
2 1 1 2 2
3 1 1 2 1
4 1 2 2 1

Winner Id Count Report

For each run, keep a list of ids, and a count of the number of times it appeared as a winner for all data sets. If the id does not exist, add the id to the list, otherwise, increment the count for that id. The individual count should not exceed K. For topN the sum of the counts for that run should equal K * run#.

Examples:
runID 1, k = 7
id count
245 7

runID 5, k = 7
id count
245 7
246 3
484 3
510 3
977 3
95 2
430 2
639 2
904 2
205 1
238 1
256 1
385 1
399 1
473 1
895 1
1103 1

runID 10, k = 7
id count
245 7
246 5
430 5
531 5
904 5
977 5
484 4
1103 4
95 3
510 3
473 2
639 2
30 1
63 1
205 1
223 1
236 1
238 1
256 1
340 1
342 1
362 1
385 1
399 1
443 1
530 1
540 1
573 1
661 1
856 1
895 1

Special Considerations

Cross-validation requires up to K more computing time to generate scores. Cross-validation distance generates a large number of DistanceResult objects, so memory is a problem.

Time

  • Each Score is written to handle the cross-validation on its own. This saves time/computing power by reusing as much data as possible. In this case, it is worth the memory to save time because the percentage of time saved increases with higher values of K.

Memory

  • Clear the FilteredScore from memory -- This may also improve performance for non-cross-validation distance. Store the number of winners (markers) in a new member of DistanceResult. This number is needed for summarization later, but the rest of the FilteredScore is not. Changing the DistanceResult class should not affect the saved program state because none of these objects ever get serialized.
  • Summarize results as data is collected, keeping only what is needed -- The current loop structure may prevent using this extensively. Currently, each of the K folds is run to completion. Averaging across folds occurs during the summary, after calculations are complete.
    • SOLUTION: Reorganize so each TopN run is completed for each fold. This way, a winner report and output can be written for each run and the data

Edit | Attach | Print version | History: r14 | r12 < r11 < r10 < r9 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r11 - 25 Jan 2005, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback