Cross-Validation

TODO

  • DONE Refactor Distance calculation.
    • Swap loops so the cross-validation (kFold) loop is inside the TopNManager (c) loop.
      • Separate cross-validation Scores by fold before the c loop.
      • Setup cross-validation DataGroupings by fold before the c loop.
    • DONE Modify summaries to handle the ArrayList of DistanceResults.
  • Add testing DataSet and DataGrouping.
    • DONE DistanceRun parameters.
    • DONE TopNManager will use all winners for each N and cross-validation fold.
    • CutoffManager will use winners that appear in 80% of cross-validation cutoffs.
      • DONE Have the winner list report writer return an X% list.

Score Generation

Generate K scores as normal. For a normal situation, there will be 7 scores (11 columns)--sam, asam, info, tvalue (ttest), prob_t (ttest), wga, cvalue (wilcoxon), prob_c (wilcoxon), fvalue (fisher), prob_f (fisher), dvalue (ks), prob_d (ks). Use scoreName[--kn] where n = 1 to K. If K > 9 pad with leading 0’s.

Ex:
  • sam--k01, wga--k01, etc
  • sam--k02, wga--k02, etc
  • ...
  • sam--k12, wga--k12, etc

Stratification

When dividing data into folds before generating scores, it is a good idea to make sure a proportionate amount of columns from each group are placed into each fold. Currently, stratification only occurs on a binary level according to the group (1,2) used. In some cases, more categories might be needed.

Randomization

Randomly reordering the columns within each group (1,2) of the grouping is also a good idea before dividing the data into folds. This occurs early in the process so the same random permutation is used to generate all scores. Otherwise, the scores are invalid. The current method is fairly simple and can be toggled before scores are generated.

Score Ranking

Generate ranks for scores as normal. Keep all ranks as normal. For a normal situation, there will be 6 scores that get ranked. (asam, prob_t, wga, prob_w, prob_f, prob_ks). 6 * K + K ranking columns will be added. For the final rank use rank[--kn] where n = 1 to K. If K > 9 pad with leading 0’s. Each k-Fold's scores are divided into separate objects and saved with the --kn stripped off.

Meta Rank

Generate super rank from rank--kn in the normal method of summing rank--kn and ordering the sum. Tie breaking will have to be discussed with Dr. Shyr. No such overall rank across all folds will be needed.

Distance

For each fold, run topN. Combine and average results across folds for each run.

Distance Prefilter

Do a normal prefilter expanded to encompass all the scores. Take the minimum criteria needed and apply it to the generated scores. This can be skipped for a CutoffManager with fewer than 10 combinations.

Distance Analysis

Do all folds for dataset k. For each run, keep the excluded patients group assignment, and the winner ids. Treat the dataset as the traning set, and the excluded patients as the testing set. Generate the patient scores for the training then calculate the means for the group 1 and group 2. Then generate the patient scores for the testing (exclusion) dataset. Pass the means to the testing set and store the assignment. Update the winner id list with the winners from this dataset.

Blind Data

Cross-validation takes training data and uses part (10%, single patient/spectrum, etc.) for testing each time until all parts are used once. In this model, none of the training data is truly blind, but the testing dataset never gets used. For the testing dataset, [Yu Shyr] has come up with different methods for TopN and Cutoff criteria.

TopN

TopN will use the testing dataset for each fold, then take an average for each run. For example, run 25 of top200 with 10-fold cross-validation: the entire blind set will be tested 10 times, once for each fold, using at most 25 features, then an average will be reported for run 25. The training data, scores, and grouping used will be from fold k and the training dataset and grouping will be for everything.

Cutoff

Cutoff will use a subset of the winners based on an 80 percent threshold. For example, if marker 10 was a winner for 9 of 10 folds, it is used for testing, but if marker 256 was a winner for only 7 of 10 folds, it will not be used. The userValues used will be an average across all folds. The averaging method used will include the userValues for the 80 percent winners across all folds.

Patient Assignment Report

For each run, keep a flag for each patient that tells what its group assignment is. This will be used to create the graphic.

Ex:
run p1 p2 p3 p4
1 1 1 2 2
2 1 1 2 2
3 1 1 2 1
4 1 2 2 1

Winner Id Count Report

For each run, keep a list of ids, and a count of the number of times it appeared as a winner for all data sets. If the id does not exist, add the id to the list, otherwise, increment the count for that id. The individual count will not exceed K. For topN the sum of the counts for that run will not exceed K * run#.

Examples:
runID 1, k = 7
id count
245 7

runID 5, k = 7
id count
245 7
246 3
484 3
510 3
977 3
95 2
430 2
639 2
904 2
205 1
238 1
256 1
385 1
399 1
473 1
895 1
1103 1

runID 10, k = 7
id countSorted ascending
30 1
63 1
205 1
223 1
236 1
238 1
256 1
340 1
342 1
362 1
385 1
399 1
443 1
530 1
540 1
573 1
661 1
856 1
895 1
473 2
639 2
95 3
510 3
484 4
1103 4
246 5
430 5
531 5
904 5
977 5
245 7

Special Considerations

Cross-validation requires up to K more computing time to generate scores. Cross-validation distance generates a large number of DistanceResult objects, so memory is a problem.

Time

  • Each Score is written to handle the cross-validation on its own. This saves time/computing power by reusing as much data as possible. In this case, it is worth the memory to save time because the percentage of time saved increases with higher values of K.

Memory

  • Clear the FilteredScore from memory. (This may also improve performance for non-cross-validation distance.) Store the number of winners (markers) in a new member of DistanceResult. This number is needed in a summary later, but the rest of the FilteredScore is not. Changing the DistanceResult class should not affect the saved program state because none of these objects ever get serialized.
  • Summarize results as data is collected, keeping only what is needed -- The current loop structure may prevent using this extensively. Currently, each of the K folds is run to completion. Averaging across folds occurs during the summary, after calculations are complete.
    • Reorganize so each TopN run is completed for each fold. This requires more memory, storing k DataSets, k Scores, 2*k DataGroupings, and possibly k blind DataSets in several ArrayLists. At this point, the memory is more important as their is no speed advantage or disadvantage.

Topic revision: r14 - 10 Feb 2005, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback