5. Running a Coco/Amber Workload

5.1. Introduction

CoCo (“Complementary Coordinates”) uses a PCA-based method to analyse trajectory data and identify potentially under sampled regions of conformational space. For an introduction to CoCo, including how it can be used as a stand-alone tool, see here.

The ExTASY workflow uses cycles of CoCo and MD simulation to rapidly generate a diverse ensemble of conformations of a chosen molecule. A typical use could be in exploring the conformational flexibility of a protein’s ligand binding site and the generation of diverse conformations for docking studies. The basic workflow is as follows:

  1. The input ensemble (typically the trajectory file from a short preliminary simulation, but potentially just a single structure) is analysed by CoCo and N possible, but so far unsampled, conformations identified.
  2. N independent short MD simulations are run, starting from each of these points.
  3. The resulting trajectory files are added to the input ensemble, and CoCo analysis performed on them all, identifying N new points.
  4. Steps 2 and 3 are repeated, building up an ensemble of an increasing number of short, but diverse, trajectory files, for as many cycles as the user choses.

In common with the other ExTASY workflows, a user prepares the necessary input files and ExTASY configuration files on their local workstation, and launches the job from there, but the calculations are then performed on the execution host, which is typically an HPC resource.

This release of ExTASY has a few restrictions:

  1. The MD simulations can only be performed using AMBER or GROMACS.
  2. The system to be simulated cannot contain any non-standard residues (i.e., any not found in the default AMBER residue library).

5.2. Required Input files

The Amber/CoCo workflow requires the user to prepare four AMBER-style files, and two ExTASY configuration files. For more information about Amber-specific file formats see here.

  1. A topology file for the system (Amber .top format).
  2. An initial structure (Amber .crd format) or ensemble (any Amber trajectory format).
  3. A simple minimisation input script (.in format). This will be used to refine each structure produced by CoCo before it is used for MD.
  4. An MD input script (.in format).
  5. An ExTASY Resource configuration (.rcfg) file.
  6. An ExTASY Workload configuration (.wcfg) file.

Here is an example of a typical minimisation input script (min.in):

Basic minimisation, weakly restraining backbone so it does not drift too far
from CoCo-generated conformation
    &cntrl
        imin=1, maxcyc=500,
        ntpr=50,
        ntr=1,
        ntb=0, cut=25.0, igb=2,
    &end
Atoms to be restrained
0.1
FIND
CA * * *
N * * *
C * * *
O * * *
SEARCH
RES 1 999
END
END

Here is an example of a typical MD input script (mdshort.in):

0.1 ns GBSA sim
    &cntrl
        imin=0, ntx=1,
        ntpr=1000, ntwr=1000, ntwx=500,
        ioutfm=1,
        nstlim=50000, dt=0.002,
        ntt=3, ig=-1, gamma_ln=5.0,
        ntc=2, ntf=2,
        ntb=0, cut=25.0, igb=2,
    &end

The resource and workload configuration files are discussed specific to the resource in the forthcoming sections. In section 5.3, we discuss execution on Stampede and in section 5.4, we discuss execution on Archer.

5.3. Running on Stampede

This section is to be done entirely on your laptop. The ExTASY tool expects two input files:

  1. The resource configuration file sets the parameters of the HPC resource we want to run the workload on, in this case Stampede.
  2. The workload configuration file defines the CoCo/Amber workload itself. The configuration file given in this example is strictly meant for the coco-amber usecase only.

Step 1: Create a new directory for the example,

mkdir $HOME/extasy-tutorial/
cd $HOME/extasy-tutorial/

Step 2: Download the config files and the input files directly using the following link.

wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/coam-on-stampede.tar
tar xf coam-on-stampede.tar
cd coam-on-stampede

Step 3: In the coam-on-stampede folder, a resource configuration file stampede.rcfg exists. Details and modifications required are as follows:

Note

For the purposes of this example, you require to change only:

  • UNAME
  • ALLOCATION

The other parameters in the resource configuration are already set up to successfully execute the workload in this example.

REMOTE_HOST = 'xsede.stampede'  # Label/Name of the Remote Machine
UNAME       = 'username'                  # Username on the Remote Machine
ALLOCATION  = 'TG-MCB090174'              # Allocation to be charged
WALLTIME    = 20                          # Walltime to be requested for the pilot
PILOTSIZE   = 16                          # Number of cores to be reserved
WORKDIR     = None                        # Working directory on the remote machine
QUEUE       = 'development'                    # Name of the queue in the remote machine

DBURL       = 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot'          #MongoDB link to be used for coordination purposes

Step 4: In the coam-on-stampede folder, a workload configuration file cocoamber.wcfg exists. Details and modifications required are as follows:

#-------------------------Applications----------------------
simulator                = 'Amber'          # Simulator to be loaded
analyzer                 = 'CoCo'           # Analyzer to be loaded

#-------------------------General---------------------------
num_iterations          = 2                 # Number of iterations of Simulation-Analysis
start_iter              = 0                 # Iteration number with which to start
num_CUs 		= 16                # Number of tasks or Compute Units
nsave			= 2		    # Iterations after which output is transferred to local machine

#-------------------------Simulation-----------------------
num_cores_per_sim_cu    = 2                 # Number of cores per Simulation Compute Units
md_input_file           = './inp_files/mdshort.in'    # Entire path to MD Input file - Do not use $HOME or the likes
minimization_input_file = './inp_files/min.in'        # Entire path to Minimization file - Do not use $HOME or the likes
initial_crd_file        = './inp_files/penta.crd'     # Entire path to Coordinates file - Do not use $HOME or the likes
top_file                = './inp_files/penta.top'     # Entire path to Topology file - Do not use $HOME or the likes
ref_file		= './inp_files/penta.pdb'     # Path to file with reference coordinates that will be used as an auxiliary file to read the trajectory files
logfile                 = 'coco.log'        # Name of the log file created by pyCoCo
atom_selection          = 'protein'

#-------------------------Analysis--------------------------
grid                    = '5'               # Number of points along each dimension of the CoCo histogram
dims                    = '3'               # The number of projections to consider from the input pcz file

Note

All the parameters in the above example file are mandatory for amber-coco. There are no other parameters currently supported.

Step 5: You can find the executable script `extasy_amber_coco.py` in the coam-on-stampede folder.

Now you are can run the workload using :

python extasy_amber_coco.py --RPconfig stampede.rcfg --Kconfig cocoamber.wcfg

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Note

Time to completion: ~240 seconds (from the time job goes through LRMS)

5.4. Running on Archer

This section is to be done entirely on your laptop. The ExTASY tool expects two input files:

  1. The resource configuration file sets the parameters of the HPC resource we want to run the workload on, in this case Archer.
  2. The workload configuration file defines the CoCo/Amber workload itself. The configuration file given in this example is strictly meant for the coco-amber usecase only.

Step 1: Create a new directory for the example,

mkdir $HOME/extasy-tutorial/
cd $HOME/extasy-tutorial/

Step 2: Download the config files and the input files directly using the following link.

wget https://bitbucket.org/extasy-project/extasy-workflows/downloads/coam-on-archer.tar
tar xf coam-on-archer.tar
cd coam-on-archer

Step 3: In the coam-on-archer folder, a resource configuration file archer.rcfg exists. Details and modifications required are as follows:

Note

For the purposes of this example, you require to change only:

  • UNAME
  • ALLOCATION

The other parameters in the resource configuration are already set up to successfully execute the workload in this example.

REMOTE_HOST = 'epsrc.archer'  # Label/Name of the Remote Machine
UNAME       = 'username'                  # Username on the Remote Machine
ALLOCATION  = 'e290'              # Allocation to be charged
WALLTIME    = 20                          # Walltime to be requested for the pilot
PILOTSIZE   = 24                          # Number of cores to be reserved
WORKDIR     = None                        # Working directory on the remote machine
QUEUE       = 'standard'                    # Name of the queue in the remote machine

DBURL       = 'mongodb://extasy:extasyproject@extasy-db.epcc.ed.ac.uk/radicalpilot'          #MongoDB link to be used for coordination purposes

Step 4: In the coam-on-archer folder, a resource configuration file cocoamber.wcfg exists. Details and modifications required are as follows:

#-------------------------Applications----------------------
simulator                = 'Amber'          # Simulator to be loaded
analyzer                 = 'CoCo'           # Analyzer to be loaded

#-------------------------General---------------------------
num_iterations          = 2                 # Number of iterations of Simulation-Analysis
start_iter              = 0                 # Iteration number with which to start
num_CUs 		= 16                # Number of tasks or Compute Units
nsave			= 2		    # Iterations after which output is transferred to local machine

#-------------------------Simulation-----------------------
num_cores_per_sim_cu    = 2                 # Number of cores per Simulation Compute Units
md_input_file           = './inp_files/mdshort.in'    # Entire path to MD Input file - Do not use $HOME or the likes
minimization_input_file = './inp_files/min.in'        # Entire path to Minimization file - Do not use $HOME or the likes
initial_crd_file        = './inp_files/penta.crd'     # Entire path to Coordinates file - Do not use $HOME or the likes
top_file                = './inp_files/penta.top'     # Entire path to Topology file - Do not use $HOME or the likes
ref_file		= './inp_files/penta.pdb'     # Path to file with reference coordinates that will be used as an auxiliary file to read the trajectory files
logfile                 = 'coco.log'        # Name of the log file created by pyCoCo
atom_selection          = 'protein'

#-------------------------Analysis--------------------------
grid                    = '5'               # Number of points along each dimension of the CoCo histogram
dims                    = '3'               # The number of projections to consider from the input pcz file

Note

All the parameters in the above example file are mandatory for amber-coco. There are no other parameters currently supported.

Step 5: You can find the executable script `extasy_amber_coco.py` in the coam-on-archer folder.

Now you are can run the workload using :

python extasy_amber_coco.py --RPconfig archer.rcfg --Kconfig cocoamber.wcfg

Note

Environment variable RADICAL_ENMD_VERBOSE is set to REPORT in the python script. This specifies the verbosity of the output. For more verbose output, you can use INFO or DEBUG.

Note

Time to completion: ~600 seconds (from the time job goes through LRMS)

5.5. Running on localhost

The above two sections describes execution on XSEDE.Stampede and EPSRC.Archer, assuming you have access to these machines. This section describes the changes required to the EXISTING scripts in order to get CoCo-Amber running on your local machines (label to be used = local.localhost as in the generic examples).

Step 1: You might have already guessed the first step. You need to create a SingleClusterEnvironment object targetting the localhost machine. You can either directly make changes to the extasy_amber_coco.py script or create a separate resource configuration file and provide it as an argument.

Step 2: The MD tools require some tool specific environment variables to be setup (AMBERHOME, PYTHONPATH, GCC, GROMACS_DIR, etc). Along with this, you would require to set the PATH environment variable to point to the binary file (if any) of the MD tool. Once you determine all the environment variables to be setup, set them on the terminal and test it by executing the MD command (possibly for a sample case). For example, if you have amber installed in $HOME as $HOME/amber14. You probably have to setup AMBERHOME to $HOME/amber14 and append $HOME/amber14/bin to PATH. Please check official documentation of the MD tool.

Step 3: There are three options to proceed.

  • Once you tested the environment setup, next you need to add it to the particular kernel definition. You need to, first, locate the particular file to be modified. All the files related to Ensemble Toolkit are located within the virtualenv (say “myenv”). Go into the following path: myenv/lib/python-2.7/site-packages/radical/ensemblemd/kernel_plugins/md. This path contains all the kernels used for the MD examples. You can open the amber.py file and add an entry for local.localhost (in "machine_configs") as follows:
..
..
"machine_configs":
{

    ..
    ..

    "local.localhost":
    {
        "pre_exec"    : ["export AMBERHOME=$HOME/amber14", "export PATH=$HOME/amber14/bin:$PATH"],
        "executable"  : ["sander"],
        "uses_mpi"    : False       # Could be True or False
    },

    ..
    ..

}
..
..

This would have to be repeated for all the kernels.

  • Another option is to perform the same above steps. But leave the "pre_exec" value as an empty list and set all the environment variables in your bashrc ($HOME/.bashrc). Remember that you would still need to set the executable as above.
  • The third option is to create your own kernel plugin as part of your user script. These avoids the entire procedure of locating the existing kernel plugin files. This would also get you comfortable in using kernels other than the ones currently available as part of the package. Creating your own kernel plugins are discussed here

5.6. Understanding the Output of the Examples

In the local machine, a “output” folder is created and at the end of every checkpoint intervel (=nsave) an “iter*” folder is created which contains the necessary files to start the next iteration.

For example, in the case of CoCo-Amber on stampede, for 4 iterations with nsave=2:

coam-on-stampede$ ls
output/  cocoamber.wcfg  mdshort.in  min.in  penta.crd  penta.top  stampede.rcfg

coam-on-stampede/output$ ls
iter1/  iter3/

The “iter*” folder will not contain any of the initial files such as the topology file, minimization file, etc since they already exist on the local machine. In coco-amber, the “iter*” folder contains the NetCDF files required to start the next iteration and a logfile of the CoCo stage of the current iteration.

coam-on-stampede/output/iter1$ ls
1_coco.log    md_0_11.ncdf  md_0_14.ncdf  md_0_2.ncdf  md_0_5.ncdf  md_0_8.ncdf  md_1_10.ncdf  md_1_13.ncdf  md_1_1.ncdf  md_1_4.ncdf  md_1_7.ncdf
md_0_0.ncdf   md_0_12.ncdf  md_0_15.ncdf  md_0_3.ncdf  md_0_6.ncdf  md_0_9.ncdf  md_1_11.ncdf  md_1_14.ncdf  md_1_2.ncdf  md_1_5.ncdf  md_1_8.ncdf
md_0_10.ncdf  md_0_13.ncdf  md_0_1.ncdf   md_0_4.ncdf  md_0_7.ncdf  md_1_0.ncdf  md_1_12.ncdf  md_1_15.ncdf  md_1_3.ncdf  md_1_6.ncdf  md_1_9.ncdf

It is important to note that since, in coco-amber, all the NetCDF files of previous and current iterations are transferred at each checkpoint, it might be useful to have longer checkpoint intervals. Since smaller intervals would lead to heavy data transfer of redundant data.

On the remote machine, inside the pilot-* folder you can find a folder called “unit.00000”. This location is used to exchange/link/move intermediate data. The shared data is kept in “unit.00000/” and the iteration specific inputs/outputs can be found in their specific folders (=”unit.00000/iter*”).

$ cd unit.00000/
$ ls
iter0/  iter1/  iter2/  iter3/  mdshort.in  min.in  penta.crd  penta.top  postexec.py