DOCS : JOBSCRIPTS
Jobscripts
The jobscripts supplied when creating a stage are shell scripts which the wrapper jobs execute for the user, on the worker nodes matched to that stage.
They are started in an empty workspace directory. Several environment
variables are made available to the scripts, all prefixed with JUSTIN_,
including $JUSTIN_WORKFLOW_ID
,
$JUSTIN_STAGE_ID
and $JUSTIN_SECRET
which allows
the jobscript to authenticate to the
allocator service. $JUSTIN_PATH
is
used to reference files and scripts provided by justIN.
To get the details of an input file to work on, the command
$JUSTIN_PATH/justin-get-file
is executed by the jobscript. This produces
a single line of output with the Rucio DID of the chosen file, its PFN on
the optimal RSE, and the name of that RSE, all separated by spaces. This
code fragment shows how the DID, PFN and RSE can be put into shell
variables:
did_pfn_rse=`$JUSTIN_PATH/justin-get-file`
did=`echo $did_pfn_rse | cut -f1 -d' '`
pfn=`echo $did_pfn_rse | cut -f2 -d' '`
rse=`echo $did_pfn_rse | cut -f3 -d' '`
If no file is available to be processed, then justin-get-file
produces no
output to stdout, which should also be checked for so the jobscript can
stop at the point. justin-get-file
logs errors to stderr.
justin-get-file
can be called multiple times to process more than one file in
the same jobscript. This can be done all at the start or repeatedly
during the lifetime of the job. justin-get-file
is itself a simple wrapper
around the curl
command and it would also be possible to access the
allocator service's REST API directly from an application.
justin-get-file
has a single option which may also be given:
--seconds-needed NNNN
where NNNN is the maximum number of wallclock
seconds which will be needed by the jobscript to process another file
and finish. If there is not enough time left based on the
--wall-seconds
option used when defining the stage, then
justin-get-file
will in that case return an empty result, just as if no more
files were available for processing. This can easily be used to create
adaptable jobscripts which process a series of input files without running
out of time on the last one.
Marking input files as processed
Each file returned by justin-get-file is marked as allocated and will not be processed by any other jobs. When the jobscript finishes, it must leave files with lists of the files it processed in its workspace directory. These lists are sent to the allocator service by the wrapper job which marks those input files as being successfully processed. Any allocated files which are not listed are treated as unprocessed, and the allocator service resets their state to unallocated, ready for matching by another job.
Files can be referred to either by DID or PFN, one per line, in the appropriate list file:
justin-processed-dids.txt
justin-processed-pfns.txt
It is not necessary to create list files which would otherwise be empty. You can refer to each processed file either by its DID or PFN (or both!) as long as they are put in the correct list file.
Output files which are to be uploaded with Rucio by the wrapper job must be created in the jobscript's workspace directory and have filenames matching the patterns given by --output-pattern or --output-pattern-next-stage when the stage was created. The suffix .json is appended to find the corresponding metadata files for MetaCat.
Jobscript exit codes
All shell scripts return an exit code, either explicitly with the command
exit N
where N is the code, or implicitly in which case the exit code
of the last command executed is returned.
You can give explicit exit codes in the range 0 to 127. They are visible on the status page for each job and in the JOB_SCRIPT_ERROR events for your jobs.
Jobscripts checklist
If you are writing your own jobscript or modifying one from someone else, please check the following are true:
- You check that the application you are running has "worked" somehow.
Usually this will involve checking the lar executable has returned zero.
This is available as the shell variable
$?
immediately after the command you are checking. If you put other commands between your application's command and checking$?
, you will be checking if those other commands succeeded instead. - When your jobscript has "worked", return 0. You can do this with
exit 0
- When your jobscript has failed, return a non-zero value. This is logged
by justIN so you can quickly see what is happening, and also stops justIN
from uploading any of the output files
(which are presumably wrong in some way?) to storage. Exit codes can be
between 1 and 127. For example
exit 57
- Whenever you run
justin-get-file
in your jobscript, check if the output is empty. In that case, there are no more input files to process and your jobscript should stop immediately withexit 0
as it's ok, there's just nothing more to do. - Once an input file has been processed successfully, add its PFN (like an xroot URL) to justin-processed-pfns.txt or its DID (Rucio scope:name) to justin-processed-dids.txt This tells justIN that the file can be marked as processed in its database, and does not need to be given to another job to try again.
- If your jobscript processes multiple input files, do not leave output files resulting from input files you fail to process successfully: those output files will be uploaded to storage too. A good pattern is to check the processing worked, and then rename the output file with mv to its final name if and only if the processing worked.
- If your jobscript produces metadata files for your output files, they
must have exactly the same name as the output file they are about plus
.json
This is how justIN finds the metadata, and if none is found, then only very basic metadata generated by justIN itself is stored in MetaCat.