Any job can be set up to be automatically resubmitted for execution if a failure occurs. Typical failures are a node shutdown or a program aborting. In a cluster environment it is useful to have jobs submitted into a generic batch queue. In this way if a node goes down the job gets resubmitted into the generic queue which can then in turn route the job to another node.
To enable restarts on any particular job, set the restart count to a number greater then zero.
Or use the following command:
C:\> schedule modify \demo\a\start/general=restart_count=4
The restart count indicates how many times a job can be restarted if it fails. After this number of restarts have occurred the job is considered to have failed. Once a job has terminated (either successfully or failed) the initiate list associated with the job is examined and any appropriate jobs on the list are started.
A group of environment variables is automatically set up whenever a job is submitted. These environment variables can be used by the job to determine whether or not the job is restarting. The environment variables used are listed below:
Environment variable |
Description |
SCHEDULE_STEP |
current step number |
SCHEDULE_ENTRY |
scheduling queue entry number |
SCHEDULE_RESTARTING |
0 or 1 if job is restarting |
SCHEDULE_RESTART_COUNT |
number of restarts that have occurred |
SCHEDULE_RESTART_LIMIT |
allowed number of restarts |
The basic idea is to modify the STEP number as a job proceeds. At the beginning of the job a check is made for the RESTARTING flag, if it is set then a GOTO is done to the correct step. A typical command sequence that uses this process would appear as follows:
C:\> if schedule_restarting then goto step_'schedule_step'
C:\>REM
C:\>step_1:
C:\> schedule modify/queue/entry='schedule_entry'/step=1
C:\> run program1
C:\>REM
C:\>step_2:
C:\> schedule modify/queue/entry='schedule_entry'/step=2
C:\> run program2
C:\>REM
The above method is excellent for handling certain types of failures. Another method that also can be used to handle failures is to assign an initiate list to the job. On that initiate list provide a job with a condition level of FATAL and/or ERROR. This error job will then be started whenever the job fails (or still fails after exhausting all allowed restarts).