Documentation

Version 1.2.0 Beta 1

6. Distributed Computing with NanoHive-1

NanoHive-1 is able to distribute the computation of simulations across computing networks. The basic idea of distributed computing centers around dividing a large computation into smaller work units which are transferred to a number of computers for simultaneous processing. In a NanoHive-1 distributed computing network, each computer in the network has a NanoHive-1 instance running on it. Each N-H instance can be both master and slave. Masters:

Slaves respond to master commands with respect to running work units.

The following is a schematic of an example network.

Figure 4.3. Example NanoHive-1 Distributed Computing Network

Example NanoHive-1 Distributed Computing Network


This network uses the network file system (NFS), specifically, the /scratch_global/ mount, to transfer work units between the N-H Master and Slaves.

An important issue in distributed computing is determining the best cell size. If the cells are too small, more time can be spent packaging, transferring, and un-packaging them than on calculating them. If the cells are too big, it may take too long to calculate them, and no performance benefit is gained (unless the simulation simply needs to be broken up into cells just to be computable.) The following figure illustrates this issue.

Figure 4.4. Effects of Cell Size on Distributed Computation

Effects of Cell Size on Distributed Computation


Ideally, one can perform timing experiments to determine the best cell size and appropriate number of computers for the network. You may, however, be limited in terms of the number of atoms your chosen Physical Interaction Plugin can handle (AIREBO has such limits in the tens of thousands), or how many computers you have in your network.

The following table shows some timing experiment results for a ~20k atom, relatively sparse, hydrocarbon system calculated with the AIREBO plugin. It indicates that breaking up the system into 6 cells, computed across 6 computers is best.

Table 4.4. Simulation Parameters

Cells in SystemComputers in NetworkAverage Iteration Processing Time (s)
2223.35
3321.11
4416.35
5515.90
6615.07
7716.15


6.1. Configuration

A key to successful N-H distributed computing is configuration of the network. Each N-H instance, both master and slave, has a configuration file that describes its function and capabilities. The following is an explanation of the configuration files for master and slave instances in the above example network. The explanation focuses on DC-specific features of the configuration files.

6.1.1. The Master's Configuration File

  nanohive.baseDirectory=/home/bhelfric/hp-ux2/share/NanoHive-1
  nanohive.stagingDirectory=/scratch_local/bhelfric/wutemp
The nanohive.stagingDirectory specifies in which directory to initially write work units that are to be processed. This should be a fast, local directory to the master, not an NFS mount.
  # Logging
  logging.outputDirectory=/home/bhelfric/.NanoHive-1/log
  logging.outputLevel=700
  
  simspec.schema=data/simulation-1.0.1.xsd
We specify into which directory to write the log file, to emit debugging log entries, and to use the latest simulation specification schema. Nothing specific to DC here.
  # Simulation Control Plugins
  commandqueue.plugin.0=ConsoleCommand
  ConsoleCommand.resultCodesFile=data/local/en_resultCodes.txt
  commandqueue.plugin.1=SocketsControl
  SocketsControl.ipAddress=192.168.18.113
  SocketsControl.port=3000
  SocketsControl.clientTimeout=30000
We're choosing the ConsoleCommand plugin so we can control N-H directly from the console, and the SocketsControl plugin to control N-H remotely with a client such as HiveKeeper, or nhClient.py.
  # Entity Management Plugin
  entityManager.plugin=RAMEntityManager
  
  # Data Import/Export Plugins
  entityManager.importExport.0.plugin=nanoML_ImportExport
  entityManager.importExport.0.importFormats=nanoML
  entityManager.importExport.0.exportFormats=nanoML
  entityManager.importExport.1.plugin=OpenBabelImportExport
  entityManager.importExport.1.importFormats=OpenBabel
  entityManager.importExport.1.exportFormats=ALC,BS,CACCRT,CACINT,CACHE,CT,CSSR,
  BOX,DMOL,FEAT,FH,GAMIN,INP,GCART,GAU,MM1GP,GR96A,GR96N,HIN,JIN,BIN,MMD,MMOD,
  OUT,DAT,SDF,SD,MDL,MOL,MOPCRT,BGF,CSR,NW,PDB,REPORT,QCIN,SMI,FIX,MOL2,TXYZ,
  TXT,UNIXYZ,XED,XYZ,CML
  entityManager.importExport.2.plugin=nanorexMMP_ImportExport
  entityManager.importExport.2.importFormats=nanorexMMP
  entityManager.importExport.2.exportFormats=nanorexMMP
We use the in-memory entity manager, and specify a wide range of file formats which can be read/written.

Note

Beta version: For large, long running simulations, depending on a single master to maintain state in memory is not desirable because if it dies, recovering and restarting the simulation can be delayed and at worse, problematic. We're working on a Postgres-powered entity manager plugin that will maintain state in a database and allow for both a fail-over master, and automatic recovery and restarting of simulations.

In the meantime, we recommend using the NH_SimStateImportExport plugin to write out the simulation's state periodically. The simulation can then be manually recovered and restarted from the last state save. Details on how to do this are in the section describing simulation configuration.

  # Physical Interaction Calculators (PICs)
  pic.0.name=neu-farm
  pic.0.type=distributed
  pic.0.picPlugin=SocketsPIC_Control
The name of the Physical Interaction Calculator (PIC) is neu-farm, it's distributed, as opposed to running locally, and the DistributedPIC_ControlPlugin implementation is the SocketsPIC_Control plugin.
  pic.0.communicationTimeout=5
  pic.0.badSlaveRepoolInterval=60
  pic.0.transportMethod=NFS
  pic.0.slaveDescriptors=
  192.168.18.114;3001;/scratch_global/bhelfric/wutemp/slave1,
  192.168.18.115;3002;/scratch_global/bhelfric/wutemp/slave2,
  192.168.18.116;3003;/scratch_global/bhelfric/wutemp/slave3,
  192.168.18.117;3004;/scratch_global/bhelfric/wutemp/slave4,
  192.168.18.119;3005;/scratch_global/bhelfric/wutemp/slave5
These key/value pairs are parameters to the SocketsPIC_Control plugin. It gives N-H slaves 5 seconds to respond to commands, re-checks slaves that appear to be down or otherwise unavailable every 60 seconds, and uses the network file system (NFS) to transfer work units (as opposed to FTP.)

The pic.0.slaveDescriptors describes each of the slaves with:

  • the slave's IP address
  • and the port that the slave's SocketsControl plugin is listening to
  • the receiving directory for work units

  pic.0.pipPlugin.0=AIREBO
  pic.0.pipPlugin.1=MPQC_SClib
These pic.0.pipPlugin lines specify which plugins are available on the slaves.

6.1.2. The Slave Configuration Files

Slave configuration files are essentially the same as standalone N-H instance configuration files, except for the following things:
  • they must specify a remotely accessible nanohive.stagingDirectory
  • they must specify a remote control capable Simulation Control plugin implementation, SocketsControl, for example
  • and they usually don't need any Data Import/Export plugins because they normally only load work units, which have a single, internal file format
Here is the configuration file for Slave 1 (cousteau):
  nanohive.baseDirectory=/home/bhelfric/hp-ux2/share/NanoHive-1
  nanohive.stagingDirectory=/scratch_global/bhelfric/wutemp/slave1

  # Logging
  logging.outputDirectory=/home/bhelfric/.NanoHive-1/log/slave1
  logging.outputLevel=700

  simspec.schema=data/simulation-1.0.1.xsd

  # Simulation Control Plugins
  commandqueue.plugin.0=SocketsControl
  SocketsControl.ipAddress=192.168.18.114
  SocketsControl.port=3001
  SocketsControl.clientTimeout=30000

  # Entity Management Plugin
  entityManager.plugin=RAMEntityManager

  # Physical Interaction Calculators (PICs)
  pic.0.name=cousteau
  pic.0.type=local
  pic.0.pipPlugin.0=AIREBO
  pic.0.pipPlugin.1=MPQC_SClib

6.2. Simulation Specification and Execution

There is an element of the simulation specification that needs to be compatible with the creation and distribution of work units, namely the Entity Traversal plugin. A plugin that can be used for distributed computing is the BasicCellTraverser, for example, which simply divides the simulation space up into sub-cells.

Once the N-H Master and Slaves are up and running, simulations can be loaded and run per usual. The preceding section describes how to load and run simulations.


Last Modified: 5/17/2006