You are here:
Vanderbilt Biostatistics Wiki
>
Main Web
>
Projects
>
MicroArrayMassSpec
>
GeneralWfccmDesign
>
WfccmLargeDataSetCapability
(23 May 2005,
WillGray
)
(raw view)
E
dit
A
ttach
<noautolink> ---+ Large Data Set Capability %TOC% ---+++ %BLUE%Detailed Time Line and Tasks%ENDCOLOR% *Large DataSet Implementaiton* ==Apr18-May 06== %GRAY%3 weeks%ENDCOLOR% 1. Implement objects. 1. Build the interface. *Large DataSet Testing* ==May 09-May 20== %GRAY%2 weeks%ENDCOLOR% 1. Test to ensure complete, correct product. ---++ Goals * Ability to read very large datasets. * Ability to generate scores on very large datasets. * Ability to perform full cross validitaion on very large datasets. * Transparent modifiations to current system. ---++ Obstructions * 1.2 GB per process memory limit. * MemoryMap library: What are the limits? ---+ Methods ---++ Data Set MemoryMap ---+++ Read only * Read through file at load to determine row begining locations. * Store row begining locations in memory. * Index by name, id, index, or all three? * Keep last 20 rows accessed in a buffer queue. * Index by name, id, index, or all three? * Include a DiscardBuffer method? * When accessing a point not in memory, queue it. * MemoryMap mode specified at load. * DataSet.Load or program start? * Since some functionality will be disabled/lost, it may be better to use a subclass (inheritance & polymorphism) than modify DataSet. * [[Wfccm Large DataSet SubClass ProsCons]] - [[WfccmIDataSet][IDataSet]] * Could be done with FileStream and StreamReader. * Use StreamReader.ReadLine().Length + 2 (DOS) or 1 (UNIX/Mac) to get the number of characters/bytes(ASCII) in a single line. * Combine with Partioning to save on seeking and reading. ---+++ Read/Write * Must know the size of the data set. * Create memory map on disk. * Save to a standard file. * Must have free space to hold memory map. ---+++ Large File abilities * Merge/Append two large files on disk with error checking. ---++ Data partitioning ---+++ DataSet Partitioning to Save Memory * Emulate SAS behavior of loading only part of a file * Implement in the application when DataSets are loaded and used * Make multiple calls to generate scores for a single test * Exception: SAM * Add a constructor/method for loading a partial DataSet * Using IEnumerable, a foreach loop works:%BR% PartitionedDataSet pds = new PartitionedDataSet(filename);%BR% %BLUE%foreach%ENDCOLOR% (DataSet ds %BLUE%in%ENDCOLOR% pds)%BR% {%BR% %GREEN%// do some work%ENDCOLOR%%BR% } * DataSet has a method LoadPart for loading a set number of rows. * Score has a method MergeSave for saving the stats from the parts of the DataSet loaded with LoadPart. * These two could be combined with some wrapper object like PartitionedDataSet that would determine a good number of rows for each partition. * Add methods to DataSet/Score for an Append and/or Merge Write. * Combine with MemoryMap for large data? ---+++ Questions/Problems * How many parts should the data be divided into? * Fixed number of rows (variables). * Number of rows based on number of columns (observations) so as not to exceed memory limits. * Solution: [[WfccmMemoryMapDataSet][MemoryMapDataSet]] * SAM uses the entire DataSet to compute s0 (fudge factor). * Distance requires the entire DataSet. ---+++ Other Memory Considerations * How can we save memory within the program? * Write scores to file and load when needed. ---++ Run partitioning </noautolink>
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r13
<
r12
<
r11
<
r10
|
B
acklinks
|
V
iew topic
|
Edit
w
iki text
|
M
ore topic actions
Topic revision: r13 - 23 May 2005,
WillGray
Main
Department Home Page
Biostatistics Graduate Program
Vanderbilt University Medical Center
Main Web
Main Web Home
Search
Recent Changes
Changes
Topic list
Biostatistics Webs
Archive
Main
Sandbox
System
Register
|
Log In
Copyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki?
Send feedback