You are here: Vanderbilt Biostatistics Wiki>Main Web>Projects>MicroArrayMassSpec>GeneralWfccmDesign>WfccmDistributedComputing (07 Dec 2005, JeremyRoberts)Edit Attach

Distributed Computing

Detailed Time Line and Tasks

Network Specification May 23-Jun 03 2 week

Create specification and place on wiki.

Network Design Jun 06-Jul 01 4 week

Design objects needed from requirments.
Create wiki pages containing objcets, methods, members and algorithms.

Network Implementaiton Jul 04-Aug 05 5 weeks

Implement objects.

Network Testing Aug 08-Aug 26 3 weeks

Test to ensure complete, correct product.

Distributed computing

General synopsis

The distributed computing aspect of the project will allow the program to harness other machines to assist it in getting large, long running project done in a more timely manner. The master will farm out jobs to machines. Using run partitioning, split the runs up and send to the clients to do work. The nodes will run and wait for jobs to be sent to it. The Controller will keep IP addresses and ports to all clients. The Controller will send objects to the Nodes. The nodes will return completed work when finished. The Controller will assemble the returned work.

Controller

Display Available machines
Select nodes to use for current analysis
Create work sets
- By run
- By cross vas
Send work
Get results
- Update status through process
  - How can we do this while generating Scores other than the change in method?
  - Update Distance run number
- Data when finished
  - Score - txt file or serialized
  - DistanceResults - serialized
Error checking
- Detect unresponsive node
- Redistribute unfinished work to remaining active nodes at end
- Listen for a node to "cancel" the work it is doing
Will have an open array list waiting for distance results. (Maybe for DistanceResults?) and will spawn thread to handle actual distance result processing

Node

Run work that is sent to it
Ruturn results to the master
Keep a queue of work to be done
Show current work that is being done in a log style window
Accepts upto n controller connection, spawns a new thread to handle each connection. When finished running distance, each thread will:
1. create an object holding the distance result(s)
2. send distance results back to the controller via the IP address and port specified
Ability to cancel work that is being done.

Node List

List will contain for each node
- Name/Description
- IP
- TCP port
- Priority
- Reliability
- Last known good
- Save as txt file
Propogate that list of nodes to other workstations
- Send copy of txt file

Work unit

Data
- DataSet - txt file
- Score - txt file or serialized
- CriteriaManager - serialized
- DataGrouping - serialized
Instructions
- based on type
Scores
- Could give each node a Score, but some take longer to compute
- Could divide DataSet with PartitionDataSet to level the time, but SAM and HuWright won't work that way.
- Run cross-validation folds separately. All nodes run the default case and an assigned range of folds.
Distance
- Divide by run (combination) number
- Divide by cross-validation folds

Namespace heirarchy

Wfccm2
- Distributed Computing
  - Jobs
  - WorkRequests
- Networking
- Forms

Classes

SlavePanel - Panel to display the information from the distance.
- Log display
- output log to file
- Dislpay work to be done (work queue)
- Delete work in queue - send delete notification to master
- Change slave name

MasterPanel - Panel to display the farming out of jobs.
- display living nodes (node list)
- Select nodes to participate
- Display work log

Network - Full class description.
- Construct (ip, port) - starts connection, and communicationThread after connection established.
- Construct (port) - waits for connection on port, uses temp TcpListener
- GetObject()
- GetFile()
- SendObject()
- SendFile()
- Disconnect()
- TimeDate LastComm - for keeping track of slow/down nodes

NetworkListener - listens for and accepts/rejects incoming work
- Thread
- TcpListener
- incoming IP address and port
- Start() - listens on the assigned port and generates events with Network objects
- Stop() - stops listening

WorkManager - manages network connections and work, assigns threads, and sends results
- master connections
- work waiting queue
- working queue
- manager thread - checks working status, sends results, starts new work
- QueueItem
  - Network master - Connection to send results back on.
  - WorkRequest - Work to be done.

DispatchManager
- SlaveSetList - List of slaves.
- Connections - List of connections by themselves.
- JobList - List of jobs to do.
- Stop - Tells everything to stop.
- StopLock - Object to lock on to stop.
- Network_ObjectWaiting - Grab the object, pass to job, Send new work request.
- Network_FileWaiting - Begin save file.
- SlaveSet
  - Connection
  - SentWorkDetails
  - CompletedWorkDetails
- WorkDetail
  - JobId
  - WorkId
  - DispatchTime
  - ReceievedTime

Job
- JobId
- CompletedWork - Queue to place completed work.
- WaitingForCompletedWork - Flag to denote that work that has been sent out is needed to complete.
- Complete - Bool to tell whether or not the job is complete.
- RequiredFiles - List of files required to complete the job. Send to all slaves.
- AddCompletedWork(WorkRequest)
- NewSlaveWorkRequest() - May not be used for all jobs.
- Prepare() - A place to set up all the stuff that might be needed for the job.
- NextWorkRequest() - Generate the next work request.
- Finalize() - Finish work e.g. write to disk etc.

Score:Job
DistanceJob:Job
TopNJob:DistanceJob
CutoffJob:DistanceJob

WorkRequest - contains data and runs computations
- unique id - Built with ip, process, time to millisecond, job id.
- job id - Used to idenify what job a work object is associated with, also the directory that will be used.
- Status - Waiting, Started, Aborted, Complete.
- ~~Multithreaded - Tells whether or not .work will take advantage of threads.~~ Send enough work for the machine (based on # of cores) and let the WorkManager manage all threads.
- AllowedThreads - Tells the object how many threads it can use.
- Stop() - Tells the Work function that it needs to stop regardless of whether or not it's complete.
- Work() - abstract

FileInformation - To Be sent with the file
- Size - in bytes
- Name - File name including extension. Not the path, just the files name.
- Tag - object to tag additional information along.

WorkTag
- JobId - Job Id the file is associated with.
- WorkId - Work object that generated the file.
- SubDirectory - Subdirectory that the file should reside in.

ScoreRequest:WorkRequest - not sure how to break up into separate scores, maybe send a list like the dialog generates or use subclasses.
- DataSet filename
- DataGrouping
- score list
- Work() - generates scores

CvScoreRequest:ScoreRequest
- kFold total
- kFold start
- kFold end (or span)

DistanceRequest:WorkRequest
- DataSet filename (training and testing)
- DataGrouping (training and testing)
- Score
- CriteriaManager
- model number start
- model number end (or span)
- Work() - generates distance for the given models

CvDistanceRequest:DistanceRequest
- kFold total
- kFold start
- kFold end (or span)
- Work()

CvScoreDistanceRequest:WorkRequest
- DataSet filename (training and testing)
- DataGrouping (training and testing)
- score list
- CriteriaManager
- model number start
- model number end (or span)
- kFold total
- kFold start
- kFold end (or span)
- Work()

PropogateNodeRequest:WorkRequest - not sure if this will work or not...

Changes to current classes

The function that currently runs distance needs to change to take work objects.
- This may not need to happen until we're ready to implement the "service" architecture.
- Instead, WorkRequest handles the computation
Somewhere, the master has to divide work....

Dialog Sample

Master to Slave
- Slave to Master
Connect
- Complete
Sending file
- Go
Send file
- Received file
[Repeat file if needed]
Sending work
- Go
Send work
- Received work
Idle
- Sending results
Go
- Send results
Received results
- Idle
[Repeat work & results as needed]
Disconnect
- Goodbye

Communication Messages

Connection start
Connected
Disconnect
Goodbye
Idle
Sending file
Sending work
Sending results
Received
Cancel

Open questions

Should we use many different ports? In some situations where someone has a tight firewall, this may present a problem or at least an annoyance.
- Keep the number of ports necessary to a minimum, using one for slave-slave communication (propgate node list), one for master-slave communication (send/receive work and results), and one for slaves to send status (load, available, work progress)
- It seems to me that we should be able to do all this on one port. Do you know of a technical reason that we can not send all of this information on the same port?
  - We will need different ports for sending data, receiving results, and updating status.
- Different ports for different uses. How else will the node know the difference in the data being sent? Messages!
  - Deserialization will give the object in its correct type. Using inheritance/polymorphism, we can differentiate by tasks.
- We can include more information then just the data. We could have a descriptor to denote the data type that is being sent. Or something else like that.
Will need some method to handle a down/slow node other than blocking and waiting for all nodes to finish.
If we have seperate programs for Controller/Node, how are we going to propogate the work list?
Should we consider simply extending the current application to double as the node? E.g. it would live in two different "modes", controller or node?
- Use an extra command line parameter/switch to determine the mode - i.e. "wfccm2.exe -s" for slave

Data File Propogation

Sending DataSet by serialization would work
Large data has to be transfered as a file to all nodes
- To finish in the fastest time, use the nodes to continue propogation
- NodeN (1 - total nodes used, Node0=Master)
  - for i = ceiling(ln n / ln 2) to ceiling(ln t / ln 2), send to node N+2^i

Detecting Dead Slaves

Monitor
- Run in a separate thread (current job manager is event-driven)
  - Check for slow/unresponsive node
  - Keep track of all times to completion
  - Compare time span to an acceptable limit based on others completed
- Network self-monitor
  - Periodic "ping"
Action
- Try to re-establish connection
  - Disconnect
  - Reconnect
- If disconnected, remove from list of connected nodes (events)
- Re-send work requests (before or after getting all out of job?)

Future Goals

Apply network node paradigm to current system
- Register Wfccm processing engine as a service
- GUI application sends work to self and/or other node(s)

Queing system

Description

Currently, the system only allows running Distance calculations for a single DataSet. Recently, multiple CriteriaManagers was enabled, but being able to set more calculations to run will allow the system to run with more data. This is great for overnight and weekend jobs that will take a long time.

Break Point

If the system is left to run with a large queue, it may cut into productive hours. Should we allow a safe break or just force it out manually?

Multiple DataSets/Saved States

What should we do about using multiple DataSets that would normally exists in different directories and saved states? Using a lot of DataSets means setting up the DataGroupings, adding Scores, and setting up CriteriaManagers. Having to do some of this twice is a lot of extra work.

Distance Run

An object for holding references to the TreeNodes for the DataSets, DataGroupings, and CriteriaManagers for Distance.

Queue

A basic array of DistanceRun objects.

Multi-core support (threading)

Make DataSet thread safe
- Add a "writing" MUTEX?
Seperate each score generation to its own thread
Seperate each to thread
Can we determine the number of processors that we have available?

Times

5000x1000 single score (no cross validation) 3.2GHz
- Info 48s
- KS 41s
- Sam 36s
- T-Test 36s
- WGA 15:11
- Wilcox 44s
- Fisher 35s
- Hu-Wright 3:30

Topic revision: r41 - 07 Dec 2005, JeremyRoberts

Main

Department Home Page

Biostatistics Graduate Program

Vanderbilt University Medical Center

Biostatistics Webs
- Archive
- Main
- Sandbox
- System

Copyright &© 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback