Changes from Tukey to Cooley

For most cases, Cooley should be substantially similar to Tukey with the following exceptions noted below.

 

Hardware

The Cooley hardware specifications are detailed on the main Cooley page; please note that Cooley has 12 cores per node, compared to 16 cores per node on Tukey.  If you intend to run one process per core, you will need to adjust your mpirun/mpiexec arguments appropriately.

Software

User software should be recompiled
While the new environment may run user code as it currently exists, it is strongly recommended that all users recompile their code with newer libraries wherever possible.
 
Intel Compilers
The *-intel versions of mpich2 and mvapich2 are compiled against the latest version of the Intel compilers (+intel-composer-xe in softenv).  If you have a reference in your .soft.cooley to the older compilers (e.g. +intel-11.0), we recommend you replace this with +intel-composer-xe.Please note, we have removed any hardcoded paths to the compilers in the mvapich2-intel  and mpich2-intel softenv definitions, so if you wish to use these versions of the MPI, you will need to add both the key for the MPI version you wish to use and the +intel-composer-xe key to your .soft.cooley file.
 
SoftEnv configuration
Similar to Tukey, Cooley requires a specific SoftEnv configuration file, which will be located at $HOME/.soft.cooley. In order to facilitate the transition from Tukey to Cooley, login scripts check to see if you have $HOME/.soft.cooley. If not, it will first try to copy the user's existing $HOME/.soft.tukey, or if that doesn't exist, will install a default.  In addition, any references to +mvapich2-1.8 will be automatically changed to +mvapich2 since MVAPICH2 1.8 is not available on Cooley (instead, the newer verison 2.1 is installed).
 
Installed software
Cooley will have its own copy of the /soft partition, separate from Tukey and Mira. It is strongly recommended that all users recompile their code with newer libraries wherever possible. With the following exceptions Cooley's /soft is a copy of Tukey's /soft as of April 16, 2015:
 
  • +mvapich2-2.1 is available; the older +mvapich2-1.8 is not available due to IB incompatibilities
  • +cuda-6.5.14 and +cuda-7.0.28 are available (newer versions than installed on Tukey)
  • Visit (2.9.1) and ParaView (4.1.0 and 4.3.1) are now available on Cooley
 
Job Scheduling
There are some minor changes to the job scheduling system compared to Tukey:
  • The maximum job size has been raised to 124 nodes to reflect the current size of the cluster (the remaining two nodes are dedicated to the debug queue)
  • /usr/bin/qsubi has been deprecated in favor of the officially supported interface, /usr/bin/qsub -I
MVAPICH2 quirks
MVAPICH2 requires setting the environment variable MV2_IBA_HCA to "mlx5_0". This will be automatically set for you if you are using the +mvapich2 softenv key.  If this variable is unset, mvapich2 attempts to round-robin to the Mellanox 10GbE interface for MPI communication, which causes it to segfault:
 
cc122:~> mpirun -np 4 -f $COBALT_NODEFILE ./mpihello
[cc122:mpi_rank_2][error_sighandler] Caught error: Segmentation fault (signal 11)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 121233 RUNNING AT cc122.cooley
=   EXIT CODE: 139 
=   CLEANING UP REMAINING PROCESSES 
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES 
===================================================================================
[proxy:0:1@cc062] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:912): assert (!closed) failed
[proxy:0:1@cc062] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@cc062] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@cc122] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@cc122] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@cc122] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@cc122] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
 
X11
The X server on Tukey's compute nodes was not run with "-nolisten tcp" and as a result was open to the rest of the ALCF network. This has been corrected on Cooley, so the X servers running on the Cooley compute nodes will not accept direct connections from outside of the node.  We do not expect this to affect usage of the nodes, since the DISPLAY is normally set to a local-only address (:0.0 or localhost:0.0) when accessing the GPU X server in a compute job.