Large Data Set Capability

Detailed Time Line and Tasks

Large DataSet Implementaiton Apr18-May 06 3 weeks
  1. Implement objects.
  2. Build the interface.

Large DataSet Testing May 09-May 20 2 weeks
  1. Test to ensure complete, correct product.

Network Specification May 23-Jun 03 2 week
  1. Create specification and place on wiki.

Network Design Jun 06-Jul 01 4 week
  1. Design objects needed from requirments.
  2. Create wiki pages containing objcets, methods, members and algorithms.

Network Implementaiton Jul 04-Aug 05 5 weeks
  1. Implement objects.

Network Testing Aug 08-Aug 26 3 weeks
  1. Test to ensure complete, correct product.

Goals

  • Ability to read very large datasets.
  • Ability to generate scores on very large datasets.
  • Ability to perform full cross validitaion on very large datasets.
  • Transparent modifiations to current system.

Obstructions

  • 1.2 GB per process memory limit.
  • MemoryMap library: What are the limits?

Methods

Data Set MemoryMap

Read only

  • Read through file at load to determine row begining locations.
  • Store row begining locations in memory.
    • Index by name, id, index, or all three?
  • Keep last 20 rows accessed in a buffer queue.
    • Index by name, id, index, or all three?
    • Include a DiscardBuffer method?
  • When accessing a point not in memory, queue it.
  • MemoryMap mode specified at load.
  • Could be done with FileStream and StreamReader.
    • Use StreamReader.ReadLine().Length + 2 (DOS) or 1 (UNIX/Mac) to get the number of characters/bytes(ASCII) in a single line.
    • Combine with Partioning to save on seeking and reading.

Read/Write

  • Must know the size of the data set.
  • Create memory map on disk.
  • Save to a standard file.
  • Must have free space to hold memory map.

Large File abilities

  • Merge/Append two large files on disk with error checking.

Network Client

  • Farm out jobs to machines. Using run partitioning, split the runs up and send to the clients to do work.
    • Node: Running waiting for Jobs
    • Controller: Has a list of IP addresses of all clients
    • C: Sends single object to each client, consisting of
      1. the controller's IP address and open port for jobs
      2. the complete filtered data set in order of gene ranks (ie first gene is the top ranked one, etc)
      3. the data grouping associated with this data
      4. the run numbers that this node is expected to do
    • N: Accepts upto n controller connection, spawns a new thread to handle each connection. When finished running distance, each thread will:
      1. create an object holding the distance result(s)
      2. send distance results back to the controller via the IP address and port specified
    • C: Will have an open array list waiting for distance results. (Maybe for DistanceResults?) and will spawn thread to handle actual distance result processing
      • will need some method to handle a down/slow node other than blocking and waiting for all nodes to finish.

Division of work

  • Score
    • Could give each node a Score, but some take longer to computer
    • Could divide DataSet with PartitionDataSet to level the time, but SAM and HuWright won't work that way.
  • Distance
    • Divide by run
    • Divide by cross-validation folds

Queing system

Description

Currently, the system only allows running Distance calculations for a single DataSet. Recently, multiple CriteriaManagers was enabled, but being able to set more calculations to run will allow the system to run with more data. This is great for overnight and weekend jobs that will take a long time.

Break Point

If the system is left to run with a large queue, it may cut into productive hours. Should we allow a safe break or just force it out manually?

Multiple DataSets/Saved States

What should we do about using multiple DataSets that would normally exists in different directories and saved states? Using a lot of DataSets means setting up the DataGroupings, adding Scores, and setting up CriteriaManagers. Having to do some of this twice is a lot of extra work.

Distance Run

An object for holding references to the TreeNodes for the DataSets, DataGroupings, and CriteriaManagers for Distance.

Queue

A basic array of DistanceRun objects.

Multi-core support (threading)

  • Make DataSet thread safe
  • Seperate each score generation to its own thread
  • Seperate each to thread
  • Can we determine the number of processors that we have available?

Data partitioning

DataSet Partitioning to Save Memory

  • Emulate SAS behavior of loading only part of a file
  • Implement in the application when DataSets are loaded and used
    • Make multiple calls to generate scores for a single test
      • Exception: SAM
  • Add a constructor/method for loading a partial DataSet
  • Using IEnumerable, a foreach loop works:
        PartitionedDataSet pds = new PartitionedDataSet(filename);
        foreach (DataSet ds in pds)
        {
              // do some work
        }
    • DataSet has a method LoadPart for loading a set number of rows.
    • Score has a method MergeSave for saving the stats from the parts of the DataSet loaded with LoadPart.
    • These two could be combined with some wrapper object like PartitionedDataSet that would determine a good number of rows for each partition.
  • Add methods to DataSet/Score for an Append and/or Merge Write.
  • Combine with MemoryMap for large data?

Questions/Problems

  • How many parts should the data be divided into?
    • Fixed number of rows (variables).
    • Number of rows based on number of columns (observations) so as not to exceed memory limits.
  • Solution: MemoryMapDataSet
    • SAM uses the entire DataSet to compute s0 (fudge factor).
    • Distance requires the entire DataSet.

Other Memory Considerations

  • How can we save memory within the program?
    • Write scores to file and load when needed.

Run partitioning

Edit | Attach | Print version | History: r13 | r12 < r11 < r10 < r9 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r11 - 02 May 2005, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback