HTCondor-CE Interface to Cooley

HTCondor-CE Interface to Cooley

To accommodate workflows that may require job submissions to Cooley from outside of ANL, we provide a basic HTCondor-CE interface to the Cobalt job scheduler.   This allows simple job submissions from an outside host running the HTCondor job management system.  Please note: this system is a standalone HTCondor interface intended to provide remote job submission access for individuals that already have ALCF accounts.  It is not associated with the Open Science Grid and does not accept pilot jobs, nor does it provide a means for non ALCF account holders to submit jobs.  In this sense, it is simply an alternate interace to the Cobalt job scheduler that precludes the need to SSH in to a login node, and allows interoperability with workflow managers that integrate with HTCondor.

For more information about how to use HTCondor, you can view the HTCondor documentation at UW Madison (the Quick Start Guide is a good introduction):

https://research.cs.wisc.edu/htcondor/manual/

 

There are a few limitations to be aware of when using the HTCondor interface to Cooley:

  • Our HTCondor-CE instance is a standalone installation and is not associated with any OSG VO (i.e. there is no institutional level trust with outside entities and it cannot accept jobs from non-ALCF users)
  • Only basic job submission parameters are supported: nodecount, queue, walltime, and project.  We do not currently have support for the full range of submission parameters available via qsub.
  • As with any other Cobalt job running on Cooley, there is no built-in support for MPI job launching -- job launch must be handled within your job script, just as with jobs submitted via Cobalt.
  • While we do support native Condor file transfer, this is intended primarily for smaller file transfers.  If you need to transfer a very large volume of data, we recommend you use Globus to stage your data in ahead of time.
  • By default, job working directories are in a special Condor spool area that counts against your home directory quota.
  • HTCondor only has visibility to jobs that have been queued via HTCondor -- condor_q will not show any jobs that have been submitted by other users using qsub from the Cooley login nodes.  To see the current status of the system incuding all jobs, you will need to log into a login node and use qstat, or view the Cooley status page at https://status.alcf.anl.gov/cooley/activity
  • You must have myproxy-logon installed on your client system in order to retrieve a credential to submit to HTCondor, using your CrpytoCard credentials.  Personal grid certificates are not supported.

Prior to submitting your job, you will need to run myproxy-logon -s myproxy.alcf.anl.gov to authenticate to our MyProxy server and retrieve a credential.  When promted to enter your MyProxy pass phrase, use your PIN+CryptoCard.  If you are using the mobile token app instead of a physical CryptoCard, supply the token provided by the app (with no PIN).  For example:

% myproxy-logon -s myproxy.alcf.anl.gov
Enter MyProxy pass phrase: <type PIN+CryptoCard>

A credential has been received for user acherry in /tmp/x509up_u3648.

The default lifetime for a MyProxy credential is 12 hours.  If you anticipate your job will take longer than this (including queue wait time), you should request a longer credential lifetime using the -t option to myproxy-logon.  You can specify a certificate lifetime in hours, up to a maximum of 176 hours (one week plus 8 hours).

Make note of the location of the credential provided in the output -- you will need it for your Condor job submission.

The class ad (job description) that you supply to condor_submit should be similar to the following example:

universe                = grid
use_x509userproxy       = true
grid_resource           = condor cooley-ce.pub.alcf.anl.gov cooley-ce.pub.alcf.anl.gov:9619
executable              = <your job script>
output                  = $(Cluster).$(Process).out
error                   = $(Cluster).$(Process).err
log                     = $(Cluster).$(Process).log
input                   = <input files>
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
notification            = NEVER
+HostNumber             = <node count>
+Queue                  = <queue name>
+BatchProject           = <your project>
+BatchRuntime           = <walltime in minutes>
x509userproxy           = <path to your myproxy credential>
queue

Items in angle brackets above should be replaced with your job parameters:

  • <your job script> - path to your job script on your local filesystem
  • <input files> - a comma separated list of input files to be staged to your job working directory -- this parameter be omitted if your job script does not require any additional input or if you are using input data located elsewhere on the filesystem that has been staged in by other means (e.g. Globus) 
  • <node count> - the number of nodes needef or your job (equivalent to the -n option to qsub)
  • <queue name> - the queue on Cooley you are submitting to (equivalent to the -q option to qsub).  Note: a queue must be explicitly specified, even if you are submitting to the default queue.
  • <your project> - the name of your project (equivalent to the -A option to qsub)
  • <walltime in minutes> - desired walltime in minutes (equivalent to the -t option to qsub)
  • <path to your myproxy credential> - the path to your retrieved myproxy credential (typically /tmp/x509up_u<your_numeric_uid>).  This is supplied in the output of myproxy-login, or can be retrieved afterwards by running grid-proxy-info. 

if you base your job description on the example above, your job stdout will be written to <cluster_number>.<process_number>.out and stderr will be written to <cluster_number>.<process_number>.err; the cluster number is the Condor equivalent to a job ID (note, this will not be the same as the Cobalt job ID), and for Cooley jobs the process number is generally 0.  Any new files created in your job's working directory during the course of your job run will be copied back to your client after job completion.

Running in a Project Directory

Temporary job working directories for HTCondor jobs are created (and removed) dynamically in a special location in /gpfs/mira-home.  As a result, these are subject to home directory quota limitations.  If you need your job to use a project directory instead of the default spool location, you will need to craft your job script to change into the project directory first (or directly reference files within the project directory instead of the job working directory).

When using a project directory, you will need to copy any output files back to the Condor spool directory if you want Condor to automatically retrieve the output after the job is done.   You'll also need to save the Condor-assigned working directory before changing to the project directory so that your script can know where to copy the files back to.  There are a number of ways to do this, but the easiest is to use the ​pushd and popd commands.   For example:

JOBDIR=/projects/myproject/myjobdir
mkdir $JOBDIR
cp input1 input2 $JOBDIR
pushd $JOBDIR
<code to run job>
popd
cp $JOBDIR/output1 $JOBDIR/output2 .

Note, if your output files are too large to fit within your home directory quota and your quota cannot be increased, you will need to use another means to retrieve your job output files from your project directory later (e.g. scp or Globus).