Getting Started on Sunspot

Help Desk

Aurora

*** ACCESS TO SUNSPOT IS ENABLED FOR ESP AND ECP TEAMS ONLY ***

Overview

  • The Sunspot Test and Development System (TDS) consists of 2 racks, each with 64 nodes, for a total of 128 nodes
  • Each node consists of 2x Intel Xeon CPU Max Series (codename Sapphire Rapids or SPR) and 6x Intel Data Center GPU Max Series (codename Ponte Vecchio or PVC).
    • Each Xeon has 52 physical cores supporting 2 hardware threads per core
  • Interconnect is provided via 8x HPE Slingshot-11 NICs per node.

Sharing of any results from Sunspot publicly no longer requires a review or approval from Intel. However, anyone publishing these results should include the following in their materials: "This work was done on a pre-production supercomputer with early versions of the Aurora software development kit." In addition, users should acknowledge the ALCF. Refer to the acknowledgement policy page for details : https://docs.alcf.anl.gov/policies/alcf-acknowledgement-policy/#alcf-only-acknowledgement. Please note that certain information on Sunspot hardware and software is considered NDA and cannot be shared publicly.

Sunspot is a Test and Development System and it is extremely early in the deployment of the system - do not expect a production environment ! 

Expect to experience:

  • Hardware instabilities – possible frequent downtimes
  • Software instabilities – non-optimized compilers, libraries, and tool; frequent software updates
  • Non-final configurations (e.g. storage, OS versions, etc.)
  • Short notice for downtimes (scheduled downtimes will be with 4 hr notice, but sometimes downtimes may occur with just an email notice). Notices go to the sunspot-notify@alcf.anl.gov email list. All users with access are added to the list initially.

Prerequisites for Access to Sunspot/Aurora

*** ACCESS TO SUNSPOT (and AURORA) IS ENABLED FOR ESP AND ECP TEAMS ONLY ***

ECP:

Exascale Computing Project (ECP) team members must:

  1. Request Aurora early hardware/software access through ECP by filling out the Jira* form: https://jira.exascaleproject.org/servicedesk/customer/portal/10/create/254. If you have already put in a request and it was not rejected or you did not change institutions, please skip this step as you do not need to put in a 2nd request. Note that access to the ECP Atlassian/Jira tool ends for users ends on December 31, 2023. After December 31st, the ECP project office will no longer accept Sunspot account requests. All requests must be submitted before December 31, 2023.

    If you don’t have an ECP Atlassian/Jira account, follow the steps below. Questions regarding ECP Jira account or access should be emailed to ecp-support@exascaleproject.org. Proceed to step 2 once you have submitted the ECP Jira form.

    1. Ask your PI or his/her representative to complete the onboard form https://jira.exascaleproject.org/servicedesk/customer/portal/20/create/189 and be sure they select “Jira Project” in the tools access list (Optional to also select “Confluence”).
    2. Once submitted, notifications are sent to initiate the ECP Atlassian account creation process. PI approval and PAS (personnel access system) approval must be completed before the account is created. PAS processing for foreign nationals can take 7-10 days or more after receipt of required materials.
    3. Requestor will be notified when the ECP Atlassian account is created.
  2. Please read and acknowledge the latest Terms of Use by filling out the form below. You are responsible for ensuring you are authorized by your institution to read and acknowledge the TOU: https://events.cels.anl.gov/event/147/surveys/7.
  3. Have an active ALCF account and be a member of all the appropriate ECP project on Polaris.
    1. Request for an account if none:  https://accounts.alcf.anl.gov/#/accountRequest. Search for your project(s) by the WBS number (for ECP) or name with the right PI. Do not choose projects ending in _CNDA.
    2. Re-activate if your account is inactive: https://accounts.alcf.anl.gov/#/accountReactivate.  Search for your project by the WBS (for ECP) number or name with the right PI. Do not choose projects ending in _CNDA.
    3. If you have an active account but you are not on all the ESP/ECP projects on Theta/Polaris, request to join the projects that are missing: https://accounts.alcf.anl.gov/#/joinProject. Search for your project by the WBS number (for ECP) or name with the right PI. Do not choose projects ending in _CNDA.

Team members that satisfy all the pre-requisites listed above should then email support@alcf.anl.gov requesting access to Sunspot/Aurora.

ESP:

Refer to this page for instructions: https://docs.alcf.anl.gov/aurora/getting-started-on-aurora/#for-aurora-early-science-program-esp-team-members

Getting Help:

  • Email ALCF Support : support@alcf.anl.gov for bugs, technical questions, software requests, reservations, priority boosts, etc.
    • ALCF’s user support team will triage and forward the tickets to the appropriate technical SME as needed
    • Expect turnaround times to be slower than on a production system as the technical team will be focused on stabilizing and debugging the system
  • For faster assistance, consider contacting your project’s POC at ALCF (project catalyst or liaison)
    • They are an excellent source of assistance during this early period and will be aware of common bugs and known issues
  • ECP and ESP users will be added to a CNDA Slack workspace, where CNDA discussions may occur. An invite to the slack workspace will be sent when a user is added to the Sunspot resource.

Known Issues

A known issues page can be found in the JLSE Wiki space used for NDA content. Note that this page requires JLSE Aurora early hw/sw resource account for access : https://wiki.jlse.anl.gov/display/inteldga/Known+Issues

Logging into Sunspot user access nodes

You will be able to access the system via SSH'ing to 'bastion.alcf.anl.gov'. This bastion is merely a pass-through erected for security purposes and is not meant to host files. Once on the bastion, SSH to 'sunspot.alcf.anl.gov'. It is round robin to the UANs (user access nodes).

Note that Sunspot uses ALCF credentials (same as Theta, ThetaGPU, Polaris and https://accounts.alcf.anl.gov website) and not JLSE credentials (used for Arcticus/Florentia, JLSE wiki, https://accounts.cels.anl.gov website, and CNDA Slack workspace)

Home and project directories

  1. Home mounted as /home, shared on uans and computes. Bastions have a different /home which is on Swift (shared with Polaris, Theta, Cooley). Default quota is 50 GB.

  2. Project directories are on /lus/gila/projects

    • ALCF staff should use /lus/gila/projects/Aurora_deployment project directory. ESP and ECP project members should use their corresponding project directories. The project name is similar to the name on Theta/Polaris with an _CNDA suffix (for eg: projectA_aesp_CNDA, CSC250ADABC_CNDA). Default quota is 1 TB. The project PI should email support@alcf.anl.gov if their project requires additional storage.

Home and Project directories are on a Lustre file system called Gila.

Quotas

Default home quota is 50 GB.  Use this command to view your home directory quota usage: 

soft/tools/alcf_quota/bin/myquota

Default quota for the project directories is 1 TB. The project PI should email support@alcf.anl.gov if their project requires additional storage. Use this command to check your project quota usage:

/soft/tools/alcf_quota/bin/myprojectquotas

Scheduling

Sunspot has PBSPro. For more information on using PBSPro for job scheduling, see PBSPro at ALCF.

There are two production execution queues "workq" and "diag" and one debug queue called "debug" on Sunspot. In addition, there is a routing queue called "workq-route", that can be used to hold multiple jobs which get routed to workq. Note that users can submit jobs to workq directly and do not have to use the routing queue (workq-route) if they don't need to.

  • diags queue is a lower priority queue, intended for operational diagnostics, that will run jobs when there are no jobs queued in workq. Access to the diag queue is restricted. Email support@alcf.anl.gov if you have a need to use this queue and provide a write-up of your use-case.


For example a one node, interactive job on workq can be requested for 30 min with:

qsub -l select=1 -l walltime=30:00 -A Aurora_deployment -q workq -I

Queue Policies:

For workq queue:

  1. max job length: 2 hr 
  2. max job size : 128 - (nodes that are down) - (nodes that have broken/validation flags set on them [currently 4]) - (4 debug nodes)
  3. interactive jobs have a shell time out of 30 mins which will cause idle interactive shells to exit if idle for more than 30 minutes
  4. max number of jobs: 1 running and 1 queued

For workq-route queue (routing queue):

  1. max job length: 2 hr 
  2. max job size : 128 - (nodes that are down) - (nodes that have broken/validation flags set on them [currently 4]) - (4 debug nodes)
  3. max number of jobs queued: 30

For diag queue:

There are no restrictions for the diag queue. It is a lower priority queue that will run jobs when there are no jobs queued in workq or debug queues, and intended for operational diagnostics. Access to diags queue is restricted. Email support@alcf.anl.gov if you have a need to use this queue and provide a write-up of your use-case.

For debug queue:

  1. max job length: 1 hr
  2. max job size :  1 node  (a total of 4 nodes are reserved for the debug queue)
  3. interactive jobs have a shell time out of 30 mins which will cause idle interactive shells to exit if idle for more than 30 minutes
  4. max number of jobs : 1 running and 1 queue

Submission Options:

For jobs running in the production queues, the follow default settings will be applied unless otherwise changed by the user:

  • hbm_mode=flat
  • numa_mode=quad

The following is an example of how to specify flat mode:

-l  select=16:ncpus=208:hbm_mode=flat 

Too submit a full job flat if you have multiple chunks your select statement will need to be along the line of this example to be applied to each chunk:

-l select=1:vnode=x1921c0s0b0n0:hbm=flat+1:vnode=1921c1s0b0n0:hbm_mode=flat

Allocation usage

The allocation accounting system sbank, is installed on sunspot. 

  • To obtain the usage information for all your projects, issue the sbank command on sunspot: sbank-list-allocations.

For more information, see this page: https://docs.alcf.anl.gov/account-project-management/allocation-management/allocation-management/

Data Transfer

Currently, scp and SFTP are the only ways to transfer data to/from Sunspot. 

As an expedient for initiating ssh sessions to sunspot login nodes via the bastion indirect nodes, and to enable scp from remote hosts to sunspot login nodes, add the following lines to your ~/.ssh/config file on the remote host. You should do this on your laptop/desktop, from which you are initiating ssh login sessions to sunspot via bastion, and on other non-ALCF host systems from which you want to copy files to sunspot login nodes using scp.

Replace id_rsa with the name of your own private ssh key file. When you run the ssh command on your laptop/desktop, you'll be prompted for two ALCF authentication one-time passwords (Mobilepass+ or Cryptocard passcodes) - one for bastion, and the other for the sunspot login node. Likewise, when you run scp from a remote host to copy files to sunspot login nodes, you'll be prompted for two ALCF authentication one-time passwords.

File: ~/.ssh/config

Host bastion.alcf.anl.gov
    user <your_ALCF_username>

 

Host *.sunspot.alcf.anl.gov sunspot.alcf.anl.gov
    ProxyJump bastion.alcf.anl.gov
    DynamicForward 3142
    IdentityFile ~/.ssh/id_rsa
    user <your_ALCF_username>

 

Proxy Settings

export HTTP_PROXY=http://proxy.alcf.anl.gov:3128

export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128

export http_proxy=http://proxy.alcf.anl.gov:3128

export https_proxy=http://proxy.alcf.anl.gov:3128

git config --global http.proxy http://proxy.alcf.anl.gov:3128

Git with SSH protocol

The default SSH port 22 is blocked on Sunspot; by default, this prevents communication with Git remotes that are SSH URLs such as:

git clone [user@]server:project.git

For a workaround for GitLab, GitHub, and Bitbucket, edit ~/.ssh/config to include:

Host github.com
     User git
     hostname ssh.github.com

Host gitlab.com
     User git
     hostname altssh.gitlab.com

Host bitbucket.org
     User git
     hostname altssh.bitbucket.org

Host github.com gitlab.com bitbucket.org bitbucket.org
     Port 443
     ProxyCommand /usr/bin/socat - PROXY:proxy.alcf.anl.gov:%h:%p,proxyport=3128

Your environment variable Proxy Settings must be set as above.

Using Non-Default SSH Key for GitHub

If you need to use something besides your default SSH key on sunspot for authentication to GitHub in conjunction with the SSH workaround, you may set

export GIT_SSH_COMMAND="ssh -i ~/.ssh/specialGitKey -F /dev/null"

where specialGitKey is the name of the private key in your .ssh subdirectory, for which you have uploaded the public key to GitHub.

Programming Environment Setup

Loading Intel OneAPI SDK + Aurora optimized MPICH

The modules are located in /soft/modulefiles  and are setup by default in user path. The default set of modules is deliberately kept to a minimum on Sunspot.


If you do a module list  and don't see the oneapi module loaded, you can reset it to default by following the instructions below: 

uan-0001:~$ module purge

uan-0001:~$ module restore

Cray PE for GNU compilers, PALS, etc., are located in /opt/cray/pe/lmod/modulefiles/core. Module path should already be set in your user env.

 

If you would like to load explicitly the fabric/network stack after you modify the default SDK/UMD, please load append-deps/default at the end as,

 

uan-0001:~$ module load append-deps/default

Note, Cray-PALS modulefile should be loaded last as its important that the correct mpiexec from PALS is present as the default mpi. This can be confirmed with type -acommand as below

uan-0001:~$ type -a mpiexec

 

mpiexec is /opt/cray/pe/pals/1.2.4/bin/mpiexec

You can also use other modules thanks to spack (see Spack and E4S for details).

Note that the default set of modules is deliberately kept to a minimum on Sunspot.

For example, for cmake:

uan-0001:~$ module load spack 

uan-0001:~$ module load cmake

For iprof (the module name is THAPI) :

uan-0001:~$ module load spack thapi

OpenMP Stack Size on the CPU

This is a note that the default stack size per CPU OpenMP thread with Intel OpenMP is 4MB. ( https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2023-0/supported-environment-variables.html ). It can also be queried at runtime by running with OMP_DISPLAY_ENV=T set. If you see a segfault in a code which uses OpenMP CPU threads, you can try increasing the value in this environment variable.

GPU Validation Check

In some cases a workload might hang on the GPU, in such situations its possible to use the included gpu_check script (FLR in JLSE) thats setup when you load the runtime, to verify if all the GPUs are okay, kill any hung/running workloads on the GPU and if necessary reset the GPUs as well.

x1922c6s6b0n0:~$ gpu_check -rq 

Checking 6 GPUs  . . . . . . .

All 6 GPUs are okay!!!

MPI

Various ways to use MPI.

Aurora MPICH

Aurora MPICH is what will be the primary MPI on Aurora. It is jointly developed by Intel and Argonne. It allows GPU-aware communication.

You should have access to it with the default oneAPI module loaded. 

Use the associated compiler wrappers mpicxx, mpifort, mpicc, etc., as opposed to the Cray wrappers CC, ftn, cc. As always, the MPI compiler wrappers automatically link in MPI libraries when you use them to link your application.

Use mpiexec to invoke your binary, or a wrapper script around your. binary. You will generally need to use a wrapper script to control how MPI ranks are placed within and among GPUs. Variables set by the HPE PMIX system provide hooks to things like node counts and rank counts.

The following job script and wrapper script illustrate:

Example job script: jobscript.pbs

#!/bin/bash
#PBS -l select=32:system=sunspot,place=scatter
#PBS -A MyProjectAllocationName
#PBS -l walltime=01:00:00
#PBS -N 32NodeRunExample
#PBS -k doe
  
export TZ='/usr/share/zoneinfo/US/Central'
export OMP_PROC_BIND=spread
export OMP_NUM_THREADS=8
unset OMP_PLACES

cd /path/to/my/run/directory

echo Jobid: $PBS_JOBID
echo Running on host `hostname`
echo Running on nodes `cat $PBS_NODEFILE`

NNODES=`wc -l < $PBS_NODEFILE`
NRANKS=12          # Number of MPI ranks per node
NDEPTH=16          # Number of hardware threads per rank, spacing between MPI ranks on a node
NTHREADS=$OMP_NUM_THREADS # Number of OMP threads per rank, given to OMP_NUM_THREADS

NTOTRANKS=$(( NNODES * NRANKS ))

echo "NUM_NODES=${NNODES}  TOTAL_RANKS=${NTOTRANKS}  RANKS_PER_NODE=${NRANKS}  THREADS_PER_RANK=${OMP_NUM_THREADS}"
echo "OMP_PROC_BIND=$OMP_PROC_BIND OMP_PLACES=$OMP_PLACES"

mpiexec -np ${NTOTRANKS} -ppn ${NRANKS} -d ${NDEPTH} --cpu-bind depth -envall gpu_tile_compact.sh ./myBinaryName

Where gpu_tile_compact.sh  should be in your path and located in /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh. It will round-robin GPU tiles between ranks.

 

The example job script includes everything needed except the queue name, which will default accordingly. Invoke it using qsub

qsub jobscript.pbs

CrayMPI (WIP)

CrayMPI is the MPI provide by HPE which is a derivative of MPICH. It is optimized for Slingshot but provides no integration with Intel GPUs.

This is setup for CrayPE 22.10.

Check CPE Version

> ls -l /opt/cray/pe/cpe

total 0

drwxr-xr-x 2 root root 264 Jun  1 21:56 22.10

lrwxrwxrwx 1 root root   5 Jun  1 21:41 default -> 22.10

Building on UAN

Configure the modules to bring in support for CPE and expected PALS environment.

UAN Build

#If still using oneapi SDK

> module unload mpich

#Purge env if you want to use Cray PE GNU compilers

#module purge

> module load craype PrgEnv-gnu cray-pmi craype-network-ofi craype-x86-spr craype/2.7.17 cray-pals/1.2.9 cray-libpals/1.2.9 cray-mpich

You can use the Cray HPE wrappers to compile MPI code that is CPU-only.

CPU-only compile/link

 

> cc -o test test.c

> ldd test | grep mpi

    libmpi_gnu_91.so.12 => /opt/cray/pe/lib64/libmpi_gnu_91.so.12 (0x00007ff2f3329000)

Building code that utilizes offload should use the Intel compiler suite otherwise linking with cc could result in SPIR-V code getting stripped from the binary.

Add the specific MPI compiler and linker flags to link within your Makefile and use the Intel compiler of choice.

Makefile

CXX=icpx

CMPIFLAGS=-I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include 

CXXOMPFLAGS=-fiopenmp -fopenmp-targets=spir64

CXXSYCLFLAGS=-fsycl -fsycl-targets=spir64

CMPILIBFLAGS=-D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2

TARGETS=mpi-omp mpi-sycl


all: $(TARGETS)


mpi-omp.o: mpi-omp.cpp

    $(CXX) -c $(CXXOMPFLAGS) $(CMPIFLAGS) $^


mpi-sycl.o: mpi-sycl.cpp

    $(CXX) -c $(CXXSYCLFLAGS) $(CMPIFLAGS) $^


mpi-omp: mpi-omp.o

    $(CXX) -o $@ $^ $(CXXOMPFLAGS) $(CMPILIBFLAGS)


mpi-sycl: mpi-sycl.o

    $(CXX) -o $@ $^ $(CXXSYCLFLAGS) $(CMPILIBFLAGS)


clean::

    rm -f *.o $(TARGETS)

Expected output

Build Output

> make

icpx -c -fiopenmp -fopenmp-targets=spir64 -I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include  mpi-omp.cpp
icpx -o mpi-omp mpi-omp.o -fiopenmp -fopenmp-targets=spir64 -D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2
icpx -c -fsycl -fsycl-targets=spir64 -I/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/include -I/opt/cray/pe/pmi/6.1.6/include  mpi-sycl.cpp
icpx -o mpi-sycl mpi-sycl.o -fsycl -fsycl-targets=spir64 -D__TARGET_LINUX__ -L/opt/cray/pe/mpich/8.1.20/ofi/gnu/9.1/lib -L/opt/cray/pe/pmi/6.1.6/lib -Wl,--as-needed,-lmpi_gnu_91,--no-as-needed -Wl,--as-needed,-lpmi,--no-as-needed -Wl,--as-needed,-lpmi2

Running on Compute Nodes

The job script must also set the appropriate modules. It must also set the path to find the correct libpals as an older version gets picked up by default regardless of module selection.

run.sh

#!/bin/bash

#PBS -A Aurora_deployment

#PBS -q workq

#PBS -l select=1

#PBS -l walltime=10:00

#PBS -l filesystems=home


rpn=6

ranks=$((PBS_NODES * rpn))


#If still using oneapi SDK

module unload mpich

#Purge env if you want to use Cray PE GNU compilers

#module purge

module load craype PrgEnv-gnu cray-pmi cray-pmi-lib craype-network-ofi craype-x86-spr craype/2.7.17 cray-pals/1.2.4 cray-libpals/1.2.4 cray-mpich

module list


cd $PBS_O_WORKDIR


mpiexec -n $ranks -ppn $rpn ./mpi-omp

Submit the job from the UAN

Job submission

> qsub ./run.sh

1123.amn-0001

Output from the test cases

OMP Output

> mpiexec -n 6 -ppn 6 ./mpi-omp

hi from device 2 and rank 2

hi from device 0 and rank 0

hi from device 3 and rank 3

hi from device 4 and rank 4

hi from device 1 and rank 1

hi from device 5 and rank 5

SYCL Output

> > mpiexec -n 6 -ppn 6 ./mpi-sycl

World size: 6

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 4 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 3 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 0 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 1 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 2 ! 

Running on Intel(R) Graphics [0x0bd6]

Hello, World from 5 !

The programs used to generate these outputs are mpi-omp.cpp and mpi-sycl.cpp.

 

mpi-omp.cpp

#include <mpi.h>

#include <omp.h>

#include <stdio.h>


int main(int argc, char** argv) {

  // Initialize the MPI environment

  MPI_Init(NULL, NULL);

  // Get the rank of the process

  int world_rank;

  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);


#pragma omp target device( world_rank % omp_get_num_devices())

  {

    printf( "hi from device %d and rank %d\n", omp_get_device_num(), world_rank );

  }


  // Finalize the MPI environment. module load daos/base

  MPI_Finalize();

}

mpi-sycl.cpp

#include <mpi.h>

#include <sycl/sycl.hpp>

#include <stdio.h>

#include <string.h>


int main(int argc, char** argv) {

  // Initialize the MPI environment

  MPI_Init(NULL, NULL);

  // Get the rank of the process

  int world_rank;

  int world_size;

  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

   

  char zemask[256];

  snprintf(zemask, sizeof(zemask), "ZE_AFFINITY_MASK=%d", world_rank % 6);

  putenv(zemask);


  if (world_rank == 0) std::cout << "World size: " << world_size << std::endl;


  sycl::queue Q(sycl::gpu_selector{});


  std::cout << "Running on "

            << Q.get_device().get_info<sycl::info::device::name>()

            << "\n";


  Q.submit([&](sycl::handler &cgh) {

    // Create a output stream

    sycl::stream sout(1024, 256, cgh);

    // Submit a unique task, using a lambda

    cgh.single_task([=]() {

      sout << "Hello, World from "  << world_rank << " ! " << sycl::endl;

    }); // End of the kernel function

  });   // End of the queue commands. The kernel is now submited

  Q.wait();

 

  // Finalize the MPI environment.

  MPI_Finalize();

}

Kokkos

There is one central build of kokkos in place now, with \{Serial,OpenMP,SYCL\} execution spaces, with AoT for PVC.

module use /soft/modulefiles

module load kokkos

will load it. If you're using cmake to build your Kokkos app, it's the usual drill (note that cmake is available via module load spack cmake). Otherwise, loading this module will set the KOKKOS_HOME environment variable, which you can use in Makefiles etc. to find include files and libraries.

Debugging Applications

Running gdb-oneapi in batch mode

In batch mode, gdb-oneapi can attach to each MPI ranks to obtain stack traces.  The standard output and error can go to individual files distinguished by environment variables PBS_JOBID and PALS_RANKID.  The example command below uses mpiexec to launch bash to access the environment variables of each MPI rank, and redirects their outputs.  The bash process calls gdb-oneapi, which launches ./the_executable with optional arguments.  The gdb commands "run" and "thread apply all bt" runs the executable and prints out backtrace when the application receives erroneous signals.  More gdb commands can go in with each prefixed by "-ex", such as setting break points or extra signal handlers.  Note that the command below follows the Bourne shell's quoting rule, such that the whole gdb-oneapi ... command is in single quotes, and the environment variables only get interpreted by the bash process launched by mpiexec.

mpiexec [mpiexec_args ...] bash -c '

    gdb-oneapi -batch

        -ex run

        -ex "thread apply all bt"

        --args ./the_executable [executable_args ...]

        >out.${PBS_JOBID%.*}.$PALS_RANKID 2>err.${PBS_JOBID%.*}.$PALS_RANKID'

Conda

source $IDPROOT/etc/profile.d/conda.sh

Spack and E4S

Spack is a package manager used to manage HPC software environments.

The Extreme-Scale Scientific Software Stack (E4S) is a project of ECP which provides an open-source scientific software stack.

The ALCF provides Spack-managed software on Sunspot via modules, including E4S deployments.

Using Spack packages

Currently, three Spack metamodules are available: spack/linux-sles15-x86_64-ldpath, e4s/22.08, and e4s/22.11. Loading a metamodule will make additional software modules available:

uan-0001:~$ module load spack


uan-0001:~$ module avail

--------------------- /soft/packaging/spack/gnu-ldpath/modules/linux-sles15-x86_64 ----------------------

 autoconf/2.69-gcc-11.2.0-mfogo75 ninja/1.11.1-gcc-11.2.0-6biwuw5

 autoconf/2.71-gcc-11.2.0-ofpl6wv (D) numactl/2.0.14-gcc-11.2.0-nzqw57c

 automake/1.15.1-gcc-11.2.0-2kuz3tx openssl/1.1.1d-gcc-11.2.0-amlvxob

 babeltrace2/2.0.4-gcc-11.2.0-xfjn3pn patchelf/0.17.0-gcc-11.2.0-rsf5nuy

 bzip2/1.0.6-gcc-11.2.0-gs35ttl perl/5.26.1-gcc-11.2.0-pqmes6b

 cmake/3.24.2-gcc-11.2.0-pcasswq pkg-config/0.29.2-gcc-11.2.0-cchn55a

 ...

uan-0001:~$ module load cmake

uan-0001:~$ which cmake

/soft/packaging/spack/gnu-ldpath/build/linux-sles15-x86_64/gcc-11.2.0/cmake-3.24.2-pcasswqhzb3tyew7ujqyxxvvwdsvnyqd/bin/cmake

The spack module loads basic libraries and utilities, while the e4s modules load more specialized scientific software. Packages in the e4s modules are not optimized when installed; however, individual package builds may be customized by request to provide alternative variants or improve performance.

The available packages for each Spack deployment are listed at the bottom of this page.

Using Spack to build packages

You may find Spack useful for your own software builds, particularly if there is a large dependency tree associated with your software. In order to do so, you will need to install a user instance of Spack. We recommend using the latest develop branch of Spack since it includes some necessary patches for Sunspot's environment. See Spack's Getting Started Guide for installation details.

You can also copy the Spack configuration files used for the E4S deployment - this may simplify the process of using the OneAPI compilers as well as any external libraries and dependencies. Copy the files in /soft/packaging/spack/settings/ into your spack installation at $spack/etc/spack to apply the configurations to all environments using your Spack instance.

Package lists

--------------------- /soft/packaging/spack/gnu-ldpath/modules/linux-sles15-x86_64 ----------------------

-------------------------------- /soft/packaging/spack/e4s/22.11/modules --------------------------------

VTune

Please refer to the JLSE testbed VTune documentation (Note that this page requires JLSE Aurora early hw/sw resource account for access).

Because of the two-step process to login to the Sunspot login nodes, going first through the bastion nodes, the instructions for VTune Profiler as a Web Server should be augmented: On your desktop/laptop, where you initiate the ssh session for port forwarding to the vtune gui backend you have started on sunspot, you should make this addition into your ~/.ssh/config file:

Insert this into your ~/.ssh/config file

Host *.sunspot.alcf.anl.gov sunspot.alcf.anl.gov

    ProxyJump bastion.alcf.anl.gov

    DynamicForward 3142

    IdentityFile ~/.ssh/id_rsa

where you replace id_rsa with the name of your own private ssh key file. When you run the port-forwarding ssh command on your laptop/desktop, you'll be prompted for two ALCF authentication one-time passwords - one for bastion, and the other for the sunspot login node.

DAOS

Users should submit a request as noted below to have their DAOS pool created. Once created, users may create and manage containers within the pool as they wish.  As this time, we ask users to avoid creating data using erasure encoding data protection. The current release of DAOS has an issue during rebuild of EC protected data. This will be resolved in the next DAOS release.

 

NoteWhen DAOS is upgraded to 2.4, the system will be reformatted which will lead to data loss. Any critical data should be backed up to $HOME. Notification will be provided before the update happens.

Using DAOS:

     Your pool will be named by the short name of your project. You will have permissions to create and manage containers within the pool.

  1. Request a storage allocation between 1 to 50TB for your project by emailing support@alcf.anl.gov with the following information:
    • Sunspot DAOS Pool
      • Username for owner 
      • Unix group for read/write access
      • Storage capacity
  2.  Load the daos/base module. (This should be a default module)

 

module load daos/base
module list
  1. Confirm access to pool

 

Pool Example

daos pool query <pool name>

harms@uan-0002:~> daos pool query software
Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=640, disabled=0, leader=2, version=131
Pool space info:
- Target(VOS) count:640
- Storage tier 0 (SCM):
  Total size: 6.0 TB
  Free: 4.4 TB, min:6.5 GB, max:7.0 GB, mean:6.9 GB
- Storage tier 1 (NVMe):
  Total size: 200 TB
  Free: 194 TB, min:244 GB, max:308 GB, mean:303 GB
Rebuild done, 4 objs, 0 recs
  1. Create a container

     The container is your basic unit of storage. A POSIX container can contain 100s of millions of files, you can use it to store all of y our date. You only need a small set of containers perhaps just one per major unit of project work.

 

Container Example

mkcont --type POSIX --pool <pool name> --user $USER --group <group> <container name>

harms@uan-0002:~> mkcont --type=POSIX --pool iotest --user harms --group users random
  Container UUID : 9a6989d3-3835-4521-b9c6-ba1b10f3ec9c
  Container Label: random                              
  Container Type : POSIX                               

Successfully created container 9a6989d3-3835-4521-b9c6-ba1b10f3ec9c
0
  1. Mount the container

     Currently, you must manually mount your container prior to use on any node you are working on.

    For the UAN, mount it at a convenient mount point using the default dfuse parameters. This enables full caching or both metadata and data for best interactive performance.

Mount Example

dfuse --pool=<pool name> --cont=<cont name> -m $HOME/daos/<pool>/<cont>
mkdir -p $HOME/daos/iotest/random
dfuse --pool=iotest --cont=random -m $HOME/daos/iotest/random

harms@uan-0002:~> mount | grep iotest
dfuse on /home/harms/daos/iotest/random type fuse.daos (rw,nosuid,nodev,noatime,user_id=4211,group_id=100,default_permissions)

From a compute node (CN), you need to mount the container on all compute nodes. We provide some scripts to help perform this from within your job script.

More examples are available in /soft/daos/examples. The following examples uses two support scripts to startup dfuse on each compute node and then shut it down at job end.

Job Submission

qsub -v DAOS_POOL=<name>,DAOS_CONT=<name> ./job-script.sh

Job Script Example

#!/bin/bash
#PBS -A <project>
#PBS -lselect=1
#PBS -lwalltime=30:00
#PBS -k doe
#
# Test case for MPI-IO code example

# ranks per node
rpn=4

# threads per rank
threads=1

# nodes per job
nnodes=$(cat $PBS_NODEFILE | wc -l)

# Verify the pool and container are set
if [ -z "$DAOS_POOL" ];
then
    echo "You must set DAOS_POOL"
    exit 1
fi

if [ -z "$DAOS_CONT" ];
then
    echo "You must set DAOS_CONT"
    exit 1
fi

# load daos/base module (if not loaded)
module load daos/base
module unload mpich/50.1/icc-all-pmix-gpu
module use /soft/restricted/CNDA/updates/modulefiles
module load mpich/50.2-daos/icc-all-pmix-gpu

# print your module list (useful for debugging)
module list

# print your environment (useful for debugging)
#env

# turn on output of what is executed
set -x

#
# clean previous mounts (just in case)
#
clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT}

# launch dfuse on all compute nodes
# will be launched using pdsh
# arguments:
#   pool:container
# may list multiple pool:container arguments
# will be mounted at:
#   /tmp/<pool>/<container>
launch-dfuse.sh ${DAOS_POOL}:${DAOS_CONT}

# change to submission directory
cd $PBS_O_WORKDIR

# run your job(s)
# these test cases assume 'testfile' is in the CWD
cd /tmp/${DAOS_POOL}/${DAOS_CONT}

echo "write"

mpiexec -np $((rpn*nnodes)) \
-ppn $rpn \
-d $threads \
--cpu-bind numa \
--no-vni \ # enables DAOS access
-genvall \
/soft/daos/examples/src/posix-write

echo "read"
mpiexec -np $((rpn*nnodes)) \
-ppn $rpn \
-d $threads \
--cpu-bind numa \
--no-vni \ # enables DAOS access
-genvall \
/soft/daos/examples/src/posix-read

# cleanup dfuse mounts
clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT}

exit 0
  1. Usage

      Application can use POSIX codes as normal using a DAOS POSIX container. MPI-IO based codes can use the a DAOS MPICH ADIO by prepending the 'daos:' string to the path passed to MPI_File_open().

     Additional, improved performance can be had by using a special preloaded library. Adding LD_PRELOAD=$DAOS_PRELOAD into the mpiexec command will enable kernel bypass of most POSIX I/O calls, but still use metadata via the FUSE mount point.