Building and Running GROMACS on Vesta/Mira

The Gromacs Molecular Dynamics package has a large number of executables. Some of them, such as luck, are just utilities that do not need to be built for the back end.

Begin by building the serial version of Gromacs (i.e., the version that can run within one processor, with or without more than one thread) for the front end and then build the parallel version (i.e., with MPI) for the back end. This way, a full set of executables is available for the front end and another full set for the back end. Before presenting the steps to building Gromacs, a demonstration of how to build with the IBM [mpi]xl<c | cxx | f77>_r compilers is provided. Also, a build with double precision enabled is provided.

Step Zero:


Download gromacs-4.5.5.tar.gz, untar/unzip it and in the gromacs-4.5.5 directory create the following directories: BGP/fen, BGP/ben, BGP/scaling.

For BG/Q, download gromacs-4.6.1.tar.gz, untar/unzip it, and verify that you have Cmake version 2.8.x or later in /soft/buildtools/cmake/ .

Step One (Blue Gene/P only): Building the serial version of Gromacs for the front end nodes (fen)


Get into the gromacs-4.5.5/BGP/fen directory and issue the configure command: ../../configure --prefix=`pwd` --disable-ppc-altivec --with-fft=fftw3 --without-x CC=xlc_r -q64 CFLAGS=-O3 -qarch=auto -qtune=auto CXX=xlC_r -q64 CXXFLAGS=-O3 -qarch=auto -qtune=auto CPPFLAGS=-I/soft/apps/fftw-3.1.2-double/include F77=xlf_r -q64 FFLAGS=-O3 -qnoprefetch -qarch=auto -qtune=auto LDFLAGS=-L/soft/apps/ibmcmp/xlmass/bg/4.4/lib -L/soft/apps/fftw-3.1.2-double/lib --enable-double --program-prefix=BGP_fen_ --program-suffix=_serial_d --enable-all-static

It is important to correct any errors in the above configure step at this point. Once the configure step completes satisfactorily, issue the following commands:

gmake; gmake install gmake mdrun; gmake install-mdrun


BGP_fen_do_dssp_serial_d        BGP_fen_g_options_serial_d

BGP_fen_editconf_serial_d       BGP_fen_g_order_serial_d

BGP_fen_eneconv_serial_d        BGP_fen_g_pme_error_serial_d

BGP_fen_g_anadock_serial_d      BGP_fen_g_polystat_serial_d

BGP_fen_g_anaeig_serial_d       BGP_fen_g_potential_serial_d

BGP_fen_g_analyze_serial_d      BGP_fen_g_principal_serial_d

BGP_fen_g_angle_serial_d        BGP_fen_g_protonate_serial_d

BGP_fen_g_bar_serial_d          BGP_fen_g_rama_serial_d

BGP_fen_g_bond_serial_d         BGP_fen_g_rdf_serial_d

BGP_fen_g_bundle_serial_d       BGP_fen_g_rmsdist_serial_d

BGP_fen_g_chi_serial_d          BGP_fen_g_rmsf_serial_d

BGP_fen_g_cluster_serial_d      BGP_fen_g_rms_serial_d

BGP_fen_g_clustsize_serial_d    BGP_fen_grompp_serial_d

BGP_fen_g_confrms_serial_d      BGP_fen_g_rotacf_serial_d

BGP_fen_g_covar_serial_d        BGP_fen_g_rotmat_serial_d

BGP_fen_g_current_serial_d      BGP_fen_g_saltbr_serial_d

BGP_fen_g_density_serial_d      BGP_fen_g_sas_serial_d

BGP_fen_g_densmap_serial_d      BGP_fen_g_select_serial_d

BGP_fen_g_densorder_serial_d    BGP_fen_g_sgangle_serial_d

BGP_fen_g_dielectric_serial_d   BGP_fen_g_sham_serial_d

BGP_fen_g_dih_serial_d          BGP_fen_g_sigeps_serial_d

BGP_fen_g_dipoles_serial_d      BGP_fen_g_sorient_serial_d

BGP_fen_g_disre_serial_d        BGP_fen_g_spatial_serial_d

BGP_fen_g_dist_serial_d         BGP_fen_g_spol_serial_d

BGP_fen_g_dos_serial_d          BGP_fen_g_tcaf_serial_d

BGP_fen_g_dyndom_serial_d       BGP_fen_g_traj_serial_d

BGP_fen_genbox_serial_d         BGP_fen_g_tune_pme_serial_d

BGP_fen_genconf_serial_d        BGP_fen_g_vanhove_serial_d

BGP_fen_g_enemat_serial_d       BGP_fen_g_velacc_serial_d

BGP_fen_g_energy_serial_d       BGP_fen_g_wham_serial_d

BGP_fen_genion_serial_d         BGP_fen_g_wheel_serial_d

BGP_fen_genrestr_serial_d       BGP_fen_g_x2top_serial_d

BGP_fen_g_filter_serial_d       BGP_fen_make_edi_serial_d

BGP_fen_g_gyrate_serial_d       BGP_fen_make_ndx_serial_d

BGP_fen_g_h2order_serial_d      BGP_fen_mdrun_serial_d

BGP_fen_g_hbond_serial_d        BGP_fen_mk_angndx_serial_d

BGP_fen_g_helixorient_serial_d  BGP_fen_pdb2gmx_serial_d

BGP_fen_g_helix_serial_d        BGP_fen_tpbconv_serial_d

BGP_fen_g_hydorder_serial_d     BGP_fen_trjcat_serial_d

BGP_fen_g_kinetics_serial_d     BGP_fen_trjconv_serial_d

BGP_fen_g_lie_serial_d          BGP_fen_trjorder_serial_d

BGP_fen_g_luck_serial_d         BGP_fen_xpm2ps_serial_d

BGP_fen_g_mdmat_serial_d        completion.bash

BGP_fen_g_membed_serial_d       completion.csh

BGP_fen_g_mindist_serial_d      completion.zsh


BGP_fen_g_msd_serial_d          GMXRC

BGP_fen_gmxcheck_serial_d       GMXRC.bash

BGP_fen_gmxdump_serial_d        GMXRC.csh

BGP_fen_g_nmeig_serial_d        GMXRC.zsh



[1017] Sun 01.Apr.2012 16:30:50 []

Once you’ve finished building and installing, the various executables will be available for download. These executables take some time to become visible.

NOTE: These executables may look unfamiliar because the BGP_fen_ program prefix has been added to demonstrate that the executables are for the BGP fen. The _serial_d program-suffix denotes that the executables have not been built with MPI and they have double precision enabled.

To confirm that the executables have been built correctly, select one—such as  BGP_fen_g_luck_serial_d—and  type the following at the prompt sign:

and if it says


you will know the executables were built correctly.

[1021] Sun 01.Apr.2012 16:37:32 []

Step Two:


Building gromacs for the back end requires going to the gromacs-4.5.5/BGP/ben directory and typing:

../../configure --prefix=`pwd` --host=ppc --build=ppc64 --enable-bluegene --enable-fortran --enable-mpi --with-fft=fftw3 --without-x --program-prefix=BGP_ --program-suffix=_mpi_d CC=mpixlc_r CFLAGS=-O3 -qarch=450d -qtune=450 MPICC=mpixlc_r CXX=mpixlC_r CXXFLAGS=-O3 -qarch=450d -qtune=450 CPPFLAGS=-I/soft/apps/fftw-3.1.2-double/include F77=mpixlf77_r FFLAGS=-O3 -qarch=auto -qtune=auto LDFLAGS=-L/soft/apps/fftw-3.1.2-double/lib --enable-double

Building Gromacs for the back end nodes on BGQ, use cmake:

/soft/buildtools/cmake/current/gnu/fen/bin/cmake -DGMX_CPU_ACCELERATION=None -DFFTWF_INCLUDE_DIR=/soft/libraries/alcf/current/xl/FFTW3/include -DFFTWF_LIBRARY=/soft/libraries/alcf/current/xl/FFTW3/lib/libfftw3f.a -DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C -DBUILD_SHARED_LIBS=OFF -DCMAKE_C_FLAGS="-O3 -qsmp=omp -qarch=qp -qtune=qp" -DCMAKE_INSTALL_PREFIX=/home/username/gromacs-4.6.1/exe

Once the configure stage is completed, type:

make mdrun –j8;

make install-mdrun;

make install

At this stage it takes even longer to build all the executables. Once completed, the following executables in the BGP/ben/bin directory should be visible:


BGP_do_dssp_mpi_d       BGP_g_hbond_mpi_d        BGP_g_sgangle_mpi_d

BGP_editconf_mpi_d      BGP_g_helix_mpi_d        BGP_g_sham_mpi_d

BGP_eneconv_mpi_d       BGP_g_helixorient_mpi_d  BGP_g_sigeps_mpi_d

BGP_g_anadock_mpi_d     BGP_g_hydorder_mpi_d     BGP_g_sorient_mpi_d

BGP_g_anaeig_mpi_d      BGP_g_kinetics_mpi_d     BGP_g_spatial_mpi_d

BGP_g_analyze_mpi_d     BGP_g_lie_mpi_d          BGP_g_spol_mpi_d

BGP_g_angle_mpi_d       BGP_g_luck_mpi_d         BGP_g_tcaf_mpi_d

BGP_g_bar_mpi_d         BGP_g_mdmat_mpi_d        BGP_g_traj_mpi_d

BGP_g_bond_mpi_d        BGP_g_membed_mpi_d       BGP_g_tune_pme_mpi_d

BGP_g_bundle_mpi_d      BGP_g_mindist_mpi_d      BGP_g_vanhove_mpi_d

BGP_g_chi_mpi_d         BGP_g_morph_mpi_d        BGP_g_velacc_mpi_d

BGP_g_cluster_mpi_d     BGP_g_msd_mpi_d          BGP_g_wham_mpi_d

BGP_g_clustsize_mpi_d   BGP_gmxcheck_mpi_d       BGP_g_wheel_mpi_d

BGP_g_confrms_mpi_d     BGP_gmxdump_mpi_d        BGP_g_x2top_mpi_d

BGP_g_covar_mpi_d       BGP_g_nmeig_mpi_d        BGP_make_edi_mpi_d

BGP_g_current_mpi_d     BGP_g_nmens_mpi_d        BGP_make_ndx_mpi_d

BGP_g_density_mpi_d     BGP_g_nmtraj_mpi_d       BGP_mdrun_mpi_d

BGP_g_densmap_mpi_d     BGP_g_options_mpi_d      BGP_mk_angndx_mpi_d

BGP_g_densorder_mpi_d   BGP_g_order_mpi_d        BGP_pdb2gmx_mpi_d

BGP_g_dielectric_mpi_d  BGP_g_pme_error_mpi_d    BGP_tpbconv_mpi_d

BGP_g_dih_mpi_d         BGP_g_polystat_mpi_d     BGP_trjcat_mpi_d

BGP_g_dipoles_mpi_d     BGP_g_potential_mpi_d    BGP_trjconv_mpi_d

BGP_g_disre_mpi_d       BGP_g_principal_mpi_d    BGP_trjorder_mpi_d

BGP_g_dist_mpi_d        BGP_g_protonate_mpi_d    BGP_xpm2ps_mpi_d

BGP_g_dos_mpi_d         BGP_g_rama_mpi_d         completion.bash

BGP_g_dyndom_mpi_d      BGP_g_rdf_mpi_d          completion.csh

BGP_genbox_mpi_d        BGP_g_rmsdist_mpi_d      completion.zsh

BGP_genconf_mpi_d       BGP_g_rmsf_mpi_d

BGP_g_enemat_mpi_d      BGP_g_rms_mpi_d          GMXRC

BGP_g_energy_mpi_d      BGP_grompp_mpi_d         GMXRC.bash

BGP_genion_mpi_d        BGP_g_rotacf_mpi_d       GMXRC.csh

BGP_genrestr_mpi_d      BGP_g_rotmat_mpi_d       GMXRC.zsh

BGP_g_filter_mpi_d      BGP_g_saltbr_mpi_d

BGP_g_gyrate_mpi_d      BGP_g_sas_mpi_d

BGP_g_h2order_mpi_d     BGP_g_select_mpi_d

[1027] Sun 01.Apr.2012 16:46:21                                                 []

These executables may look unfamiliar because the program-prefix and program-suffix have been set differently.

Step Three:


Once you have built the executables, do a trial run of the d.dppc benchmark. Download the tarball from and untar/unzip it in the BGP/scaling directory. The following directories in BGP/scaling should be visible:


d.dppc  d.lzm  d.poly-ch2  d.villin  src

[1029] Sun 01.Apr.2012 17:44:52                                                 []

Select the d.dppc directory and change nsteps in the grompp.mdp file from 50,000 to 150,000 time steps.

Next, run BGP_fen_grompp_serial_d from your prompt to generate the topo.tpr file:

<your_prompt>./BGP_fen_grompp_serial_d -v

Now you are set to run the BGP_mdrun_mpi_d executable on 128, 256, 512 and 1024 cores (not nodes) in vn mode on the BG/P.

While in the BGP/scaling/d.dppc directory, issue the following command from your prompt:

<your_prompt>qsub -t 60 -n 32 --proccount 128 --mode vn -A <your_project> --env GMX_MAXBACKUP=-1 /your/path/to/gromacs-4.5.5/BGP/ben/bin/BGP_mdrun_mpi_d. On BGQ, the command line is: qsub -t 60 -n 8 --mode c16 -A <your_project> --env OMP_NUM_THREADS=4 /your/path/to/gromacs-4.6.1/exe/bin/mdrun. Note the A2 core of BGQ has 4 hardware threads and therefore a pure MPI run doesn’t show acceptable performance. The ‘OMP_NUM_THREADS=4 ‘ means that 4 OpenMP threads are used per MPI rank. Once it runs, take a look at the md.log fileon 128 cores, after 150,000 time steps, and you should see something like this in the last 100 (or so) lines in your md.log file:

-------------------------MEGA-FLOPS ACCOUNTING

RF=Reaction-Field  FE=Free Energy  SCFE=Soft-Core/Free Energy

T=Tabulated        W3=SPC/TIP3p    W4=TIP4p (single or pairs)

NF=No Forces

Computing:                               M-Number         M-Flops  % Flops


 LJ                                  411833.412957    13590502.628    11.1

 Coulomb                             358444.507842     9678001.712     7.9

 Coulomb [W3]                         44714.234925     3577138.794     2.9

 Coulomb [W3-W3]                      69260.780760    16207022.698    13.2

 Coulomb + LJ                        209386.799862     7956698.395     6.5

 Coulomb + LJ [W3]                    84552.748183     7694300.085     6.3

 Coulomb + LJ [W3-W3]                167815.302964    41114749.226    33.6

 Outer nonbonded loop                 77700.700763      777007.008     0.6

 1,4 nonbonded interactions            4454.429696      400898.673     0.3

 NS-Pairs                            706395.089088    14834296.871    12.1

 Reset In Box                           906.300416        2718.901     0.0

 CG-CoM                                1828.083712        5484.251     0.0

 Angles                                8755.258368     1470883.406     1.2

 Propers                               2611.217408      597968.786     0.5

 Impropers                              460.803072       95847.039     0.1

 RB-Dihedrals                          3686.424576      910546.870     0.7

 Virial                                1914.367616       34458.617     0.0

 Stop-CM                               1827.961856       18279.619     0.0

 Calc-Ekin                            18278.643712      493523.380     0.4

 Lincs                                12554.228420      753253.705     0.6

 Lincs-Mat                           170431.189692      681724.759     0.6

 Constraint-V                         33424.167548      267393.340     0.2

 Constraint-Vir                        2201.252999       52830.072     0.0

 Settle                                3886.174208     1255234.269     1.0


 Total                                               122470763.103   100.0



av. #atoms communicated per step for force:  2 x 483463.3

av. #atoms communicated per step for LINCS:  2 x 30272.7

Average load imbalance: 2.2 %

Part of the total run time spent waiting due to load imbalance: 0.6 %

Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %


Computing:         Nodes     Number     G-Cycles    Seconds     %


 Domain decomp.       128      15001    21397.683    25173.8     5.8

 DD comm. load        128      15000      174.248      205.0     0.0

 DD comm. bounds      128      15000      943.465     1110.0     0.3

 Comm. coord.         128     150001    12460.434    14659.4     3.4

 Neighbor search      128      15001   130229.923   153211.8    35.4

 Force                128     150001   122351.990   143943.7    33.3

 Wait + Comm. F       128     150001    29369.389    34552.3     8.0

 Write traj.          128          4       10.849       12.8     0.0

 Update               128     150001     7760.715     9130.3     2.1

 Constraints          128     150001    37378.869    43975.2    10.2

 Comm. energies       128      15002     1224.683     1440.8     0.3

 Rest                 128                4093.892     4816.3     1.1


 Total                128              367396.140   432231.2   100.0


Parallel run - timing based on wallclock.

               NODE (s)   Real (s)      (%)

       Time:   3376.807   3376.807    100.0


               (Mnbf/s)   (GFlops)   (ns/day)  (hour/ns)

Performance:   1036.823     36.268      7.676      3.127

Finished mdrun on node 0 Sun Apr  1 07:16:11 2012


Now, do the same runs on 256, 512 and 1024 cores (all for 150,000 time steps). You should see something like this:

#cores      (GFlops)     Real(s)     (ns/day)   (hour/ns)


128         36.268     3376.807       7.676      3.127

256         61.994     1979.547      13.094      1.833

512        106.085     1159.337      22.358      1.073

1024       174.744      700.908      36.981      0.649