NanoHive-1 is able to distribute the computation of simulations across computing networks. The basic idea of distributed computing centers around dividing a large computation into smaller work units which are transferred to a number of computers for simultaneous processing. In a NanoHive-1 distributed computing network, each computer in the network has a NanoHive-1 instance running on it. Each N-H instance can be both master and slave. Masters:
Slaves respond to master commands with respect to running work units.
The following is a schematic of an example network.
This network uses the network file system (NFS), specifically, the /scratch_global/ mount, to transfer work units between the N-H Master and Slaves.
An important issue in distributed computing is determining the best cell size. If the cells are too small, more time can be spent packaging, transferring, and un-packaging them than on calculating them. If the cells are too big, it may take too long to calculate them, and no performance benefit is gained (unless the simulation simply needs to be broken up into cells just to be computable.) The following figure illustrates this issue.
Ideally, one can perform timing experiments to determine the best cell size and appropriate number of computers for the network. You may, however, be limited in terms of the number of atoms your chosen Physical Interaction Plugin can handle (AIREBO has such limits in the tens of thousands), or how many computers you have in your network.
The following table shows some timing experiment results for a ~20k atom, relatively sparse, hydrocarbon system calculated with the AIREBO plugin. It indicates that breaking up the system into 6 cells, computed across 6 computers is best.
Table 4.4. Simulation Parameters
| Cells in System | Computers in Network | Average Iteration Processing Time (s) |
|---|---|---|
| 2 | 2 | 23.35 |
| 3 | 3 | 21.11 |
| 4 | 4 | 16.35 |
| 5 | 5 | 15.90 |
| 6 | 6 | 15.07 |
| 7 | 7 | 16.15 |
nanohive.baseDirectory=/home/bhelfric/hp-ux2/share/NanoHive-1 nanohive.stagingDirectory=/scratch_local/bhelfric/wutempThe
nanohive.stagingDirectory specifies in which directory to initially write work units that are to be processed. This should be a fast, local directory to the master, not an NFS mount.
# Logging logging.outputDirectory=/home/bhelfric/.NanoHive-1/log logging.outputLevel=700 simspec.schema=data/simulation-1.0.1.xsdWe specify into which directory to write the log file, to emit debugging log entries, and to use the latest simulation specification schema. Nothing specific to DC here.
# Simulation Control Plugins commandqueue.plugin.0=ConsoleCommand ConsoleCommand.resultCodesFile=data/local/en_resultCodes.txt commandqueue.plugin.1=SocketsControl SocketsControl.ipAddress=192.168.18.113 SocketsControl.port=3000 SocketsControl.clientTimeout=30000We're choosing the ConsoleCommand plugin so we can control N-H directly from the console, and the SocketsControl plugin to control N-H remotely with a client such as HiveKeeper, or nhClient.py.
# Entity Management Plugin entityManager.plugin=RAMEntityManager # Data Import/Export Plugins entityManager.importExport.0.plugin=nanoML_ImportExport entityManager.importExport.0.importFormats=nanoML entityManager.importExport.0.exportFormats=nanoML entityManager.importExport.1.plugin=OpenBabelImportExport entityManager.importExport.1.importFormats=OpenBabel entityManager.importExport.1.exportFormats=ALC,BS,CACCRT,CACINT,CACHE,CT,CSSR, BOX,DMOL,FEAT,FH,GAMIN,INP,GCART,GAU,MM1GP,GR96A,GR96N,HIN,JIN,BIN,MMD,MMOD, OUT,DAT,SDF,SD,MDL,MOL,MOPCRT,BGF,CSR,NW,PDB,REPORT,QCIN,SMI,FIX,MOL2,TXYZ, TXT,UNIXYZ,XED,XYZ,CML entityManager.importExport.2.plugin=nanorexMMP_ImportExport entityManager.importExport.2.importFormats=nanorexMMP entityManager.importExport.2.exportFormats=nanorexMMPWe use the in-memory entity manager, and specify a wide range of file formats which can be read/written.
In the meantime, we recommend using the NH_SimStateImportExport plugin to write out the simulation's state periodically. The simulation can then be manually recovered and restarted from the last state save. Details on how to do this are in the section describing simulation configuration.
# Physical Interaction Calculators (PICs) pic.0.name=neu-farm pic.0.type=distributed pic.0.picPlugin=SocketsPIC_ControlThe name of the Physical Interaction Calculator (PIC) is
neu-farm, it's distributed, as opposed to running locally, and the DistributedPIC_ControlPlugin implementation is the SocketsPIC_Control plugin.
pic.0.communicationTimeout=5 pic.0.badSlaveRepoolInterval=60 pic.0.transportMethod=NFS pic.0.slaveDescriptors= 192.168.18.114;3001;/scratch_global/bhelfric/wutemp/slave1, 192.168.18.115;3002;/scratch_global/bhelfric/wutemp/slave2, 192.168.18.116;3003;/scratch_global/bhelfric/wutemp/slave3, 192.168.18.117;3004;/scratch_global/bhelfric/wutemp/slave4, 192.168.18.119;3005;/scratch_global/bhelfric/wutemp/slave5These key/value pairs are parameters to the SocketsPIC_Control plugin. It gives N-H slaves 5 seconds to respond to commands, re-checks slaves that appear to be down or otherwise unavailable every 60 seconds, and uses the network file system (NFS) to transfer work units (as opposed to FTP.)
The pic.0.slaveDescriptors describes each of the slaves with:
pic.0.pipPlugin.0=AIREBO pic.0.pipPlugin.1=MPQC_SClibThese
pic.0.pipPlugin lines specify which plugins are available on the slaves.
nanohive.stagingDirectorynanohive.baseDirectory=/home/bhelfric/hp-ux2/share/NanoHive-1 nanohive.stagingDirectory=/scratch_global/bhelfric/wutemp/slave1 # Logging logging.outputDirectory=/home/bhelfric/.NanoHive-1/log/slave1 logging.outputLevel=700 simspec.schema=data/simulation-1.0.1.xsd # Simulation Control Plugins commandqueue.plugin.0=SocketsControl SocketsControl.ipAddress=192.168.18.114 SocketsControl.port=3001 SocketsControl.clientTimeout=30000 # Entity Management Plugin entityManager.plugin=RAMEntityManager # Physical Interaction Calculators (PICs) pic.0.name=cousteau pic.0.type=local pic.0.pipPlugin.0=AIREBO pic.0.pipPlugin.1=MPQC_SClib
Once the N-H Master and Slaves are up and running, simulations can be loaded and run per usual. The preceding section describes how to load and run simulations.