Distributed Computing

Detailed Time Line and Tasks

Network Specification May 23-Jun 03 2 week
  1. Create specification and place on wiki.

Network Design Jun 06-Jul 01 4 week
  1. Design objects needed from requirments.
  2. Create wiki pages containing objcets, methods, members and algorithms.

Network Implementaiton Jul 04-Aug 05 5 weeks
  1. Implement objects.

Network Testing Aug 08-Aug 26 3 weeks
  1. Test to ensure complete, correct product.

Distributed computing

General synopsis

  • The distributed computing aspect of the project will allow the program to harness other machines to assist it in getting large, long running project done in a more timely manner. The master will farm out jobs to machines. Using run partitioning, split the runs up and send to the clients to do work. The nodes will run and wait for jobs to be sent to it. The Controller will keep IP addresses and ports to all clients. The Controller will send objects to the Nodes. The nodes will return completed work when finished. The Controller will assemble the returned work.

Controller

  • Display Available machines
  • Select nodes to use for current analysis
  • Create work sets
    • By run
    • By cross vas
  • Send work
  • Get results
    • Update status through process
      • How can we do this while generating Scores other than the change in method?
      • Update Distance run number
    • Data when finished
      • Score - txt file or serialized
      • DistanceResults - serialized
  • Error checking
    • Detect unresponsive node
    • Redistribute unfinished work to remaining active nodes at end
    • Listen for a node to "cancel" the work it is doing
  • Will have an open array list waiting for distance results. (Maybe for DistanceResults?) and will spawn thread to handle actual distance result processing

Node

  • Run work that is sent to it
  • Ruturn results to the master
  • Keep a queue of work to be done
  • Show current work that is being done in a log style window
  • Accepts upto n controller connection, spawns a new thread to handle each connection. When finished running distance, each thread will:
    1. create an object holding the distance result(s)
    2. send distance results back to the controller via the IP address and port specified
  • Ability to cancel work that is being done.

Node List

  • List will contain for each node
    • Name/Description
    • IP
    • TCP port
    • Priority
    • Reliability
    • Last known good
    • Save as txt file
  • Propogate that list of nodes to other workstations
    • Send copy of txt file

Work unit

  • Data
    • DataSet - txt file
    • Score - txt file or serialized
    • CriteriaManager - serialized
    • DataGrouping - serialized
  • Instructions
    • based on type
  • Scores
    • Could give each node a Score, but some take longer to compute
    • Could divide DataSet with PartitionDataSet to level the time, but SAM and HuWright won't work that way.
    • Run cross-validation folds separately. All nodes run the default case and an assigned range of folds.
  • Distance
    • Divide by run (combination) number
    • Divide by cross-validation folds

Namespace heirarchy

  • Wfccm2
    • Distributed Computing
      • Jobs
      • WorkRequests
    • Networking
    • Forms

Classes

  • SlavePanel - Panel to display the information from the distance.
    • Log display
    • output log to file
    • Dislpay work to be done (work queue)
    • Delete work in queue - send delete notification to master
    • Change slave name

  • MasterPanel - Panel to display the farming out of jobs.
    • display living nodes (node list)
    • Select nodes to participate
    • Display work log

  • Network - Full class description.
    • Construct (ip, port) - starts connection, and communicationThread after connection established.
    • Construct (port) - waits for connection on port, uses temp TcpListener
    • GetObject()
    • GetFile()
    • SendObject()
    • SendFile()
    • Disconnect()
    • TimeDate LastComm - for keeping track of slow/down nodes

  • NetworkListener - listens for and accepts/rejects incoming work
    • Thread
    • TcpListener
    • incoming IP address and port
    • Start() - listens on the assigned port and generates events with Network objects
    • Stop() - stops listening

  • WorkManager - manages network connections and work, assigns threads, and sends results
    • master connections
    • work waiting queue
    • working queue
    • manager thread - checks working status, sends results, starts new work
    • QueueItem
      • Network master - Connection to send results back on.
      • WorkRequest - Work to be done.

  • DispatchManager
    • SlaveSetList - List of slaves.
    • Connections - List of connections by themselves.
    • JobList - List of jobs to do.
    • Stop - Tells everything to stop.
    • StopLock - Object to lock on to stop.
    • Network_ObjectWaiting - Grab the object, pass to job, Send new work request.
    • Network_FileWaiting - Begin save file.
    • SlaveSet
      • Connection
      • SentWorkDetails
      • CompletedWorkDetails
    • WorkDetail
      • JobId
      • WorkId
      • DispatchTime
      • ReceievedTime

  • Job
    • JobId
    • CompletedWork - Queue to place completed work.
    • WaitingForCompletedWork - Flag to denote that work that has been sent out is needed to complete.
    • Complete - Bool to tell whether or not the job is complete.
    • RequiredFiles - List of files required to complete the job. Send to all slaves.
    • AddCompletedWork(WorkRequest)
    • NewSlaveWorkRequest() - May not be used for all jobs.
    • Prepare() - A place to set up all the stuff that might be needed for the job.
    • NextWorkRequest() - Generate the next work request.
    • Finalize() - Finish work e.g. write to disk etc.

  • Score:Job
  • DistanceJob:Job
  • TopNJob:DistanceJob
  • CutoffJob:DistanceJob

  • WorkRequest - contains data and runs computations
    • unique id - Built with ip, process, time to millisecond, job id.
    • job id - Used to idenify what job a work object is associated with, also the directory that will be used.
    • Status - Waiting, Started, Aborted, Complete.
    • Multithreaded - Tells whether or not .work will take advantage of threads. Send enough work for the machine (based on # of cores) and let the WorkManager manage all threads.
    • AllowedThreads - Tells the object how many threads it can use.
    • Stop() - Tells the Work function that it needs to stop regardless of whether or not it's complete.
    • Work() - abstract

  • FileInformation - To Be sent with the file
    • Size - in bytes
    • Name - File name including extension. Not the path, just the files name.
    • Tag - object to tag additional information along.

  • WorkTag
    • JobId - Job Id the file is associated with.
    • WorkId - Work object that generated the file.
    • SubDirectory - Subdirectory that the file should reside in.

  • ScoreRequest:WorkRequest - not sure how to break up into separate scores, maybe send a list like the dialog generates or use subclasses.
    • DataSet filename
    • DataGrouping
    • score list
    • Work() - generates scores

  • CvScoreRequest:ScoreRequest
    • kFold total
    • kFold start
    • kFold end (or span)

  • DistanceRequest:WorkRequest
    • DataSet filename (training and testing)
    • DataGrouping (training and testing)
    • Score
    • CriteriaManager
    • model number start
    • model number end (or span)
    • Work() - generates distance for the given models

  • CvDistanceRequest:DistanceRequest
    • kFold total
    • kFold start
    • kFold end (or span)
    • Work()

  • CvScoreDistanceRequest:WorkRequest
    • DataSet filename (training and testing)
    • DataGrouping (training and testing)
    • score list
    • CriteriaManager
    • model number start
    • model number end (or span)
    • kFold total
    • kFold start
    • kFold end (or span)
    • Work()

  • PropogateNodeRequest:WorkRequest - not sure if this will work or not...

Changes to current classes

  • The function that currently runs distance needs to change to take work objects.
    • This may not need to happen until we're ready to implement the "service" architecture.
    • Instead, WorkRequest handles the computation
  • Somewhere, the master has to divide work....

Dialog Sample

  • Master to Slave
    • Slave to Master
  • Connect
    • Complete
  • Sending file
    • Go
  • Send file
    • Received file
  • [Repeat file if needed]
  • Sending work
    • Go
  • Send work
    • Received work
  • Idle
    • Sending results
  • Go
    • Send results
  • Received results
    • Idle
  • [Repeat work & results as needed]
  • Disconnect
    • Goodbye

Communication Messages

  • Connection start
  • Connected
  • Disconnect
  • Goodbye
  • Idle
  • Sending file
  • Sending work
  • Sending results
  • Received
  • Cancel

Open questions

  • Should we use many different ports? In some situations where someone has a tight firewall, this may present a problem or at least an annoyance.
    • Keep the number of ports necessary to a minimum, using one for slave-slave communication (propgate node list), one for master-slave communication (send/receive work and results), and one for slaves to send status (load, available, work progress)
    • It seems to me that we should be able to do all this on one port. Do you know of a technical reason that we can not send all of this information on the same port?
      • We will need different ports for sending data, receiving results, and updating status.
    • Different ports for different uses. How else will the node know the difference in the data being sent? Messages!
      • Deserialization will give the object in its correct type. Using inheritance/polymorphism, we can differentiate by tasks.
    • We can include more information then just the data. We could have a descriptor to denote the data type that is being sent. Or something else like that.
  • Will need some method to handle a down/slow node other than blocking and waiting for all nodes to finish.
  • If we have seperate programs for Controller/Node, how are we going to propogate the work list?
  • Should we consider simply extending the current application to double as the node? E.g. it would live in two different "modes", controller or node?
    • Use an extra command line parameter/switch to determine the mode - i.e. "wfccm2.exe -s" for slave

Data File Propogation

  • Sending DataSet by serialization would work
  • Large data has to be transfered as a file to all nodes
    • To finish in the fastest time, use the nodes to continue propogation
    • NodeN (1 - total nodes used, Node0=Master)
      • for i = ceiling(ln n / ln 2) to ceiling(ln t / ln 2), send to node N+2^i

Detecting Dead Slaves

  • Monitor
    • Run in a separate thread (current job manager is event-driven)
      • Check for slow/unresponsive node
      • Keep track of all times to completion
      • Compare time span to an acceptable limit based on others completed
    • Network self-monitor
      • Periodic "ping"
  • Action
    • Try to re-establish connection
      • Disconnect
      • Reconnect
    • If disconnected, remove from list of connected nodes (events)
    • Re-send work requests (before or after getting all out of job?)

Future Goals

  • Apply network node paradigm to current system
    • Register Wfccm processing engine as a service
    • GUI application sends work to self and/or other node(s)

Queing system

Description

Currently, the system only allows running Distance calculations for a single DataSet. Recently, multiple CriteriaManagers was enabled, but being able to set more calculations to run will allow the system to run with more data. This is great for overnight and weekend jobs that will take a long time.

Break Point

If the system is left to run with a large queue, it may cut into productive hours. Should we allow a safe break or just force it out manually?

Multiple DataSets/Saved States

What should we do about using multiple DataSets that would normally exists in different directories and saved states? Using a lot of DataSets means setting up the DataGroupings, adding Scores, and setting up CriteriaManagers. Having to do some of this twice is a lot of extra work.

Distance Run

An object for holding references to the TreeNodes for the DataSets, DataGroupings, and CriteriaManagers for Distance.

Queue

A basic array of DistanceRun objects.

Multi-core support (threading)

  • Make DataSet thread safe
    • Add a "writing" MUTEX?
  • Seperate each score generation to its own thread
  • Seperate each to thread
  • Can we determine the number of processors that we have available?
Edit | Attach | Print version | History: r41 < r40 < r39 < r38 | Backlinks | View wiki text | Edit WikiText | More topic actions...
Topic revision: r40 - 09 Aug 2005, WillGray
 

This site is powered by FoswikiCopyright © 2013-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Vanderbilt Biostatistics Wiki? Send feedback