Retry Limit

Any job can be set up to be automatically resubmitted for execution if a failure occurs. Typical failures are a node shutdown or a program aborting. In a cluster environment it is useful to have jobs submitted into a generic batch queue. In this way if a node goes down the job gets resubmitted into the generic queue which can then in turn route the job to another node.

To enable restarts on any particular job, set the restart count to a number greater then zero.

Open the Job Properties page either by right clicking on the job in the layout window or in the tree view.
Select the Settings tab.
Enter the number of required restarts in the Retry limit is: field.

Or use the following command:

C:\> schedule modify \demo\a\start/general=restart_count=4

The restart count indicates how many times a job can be restarted if it fails. After this number of restarts have occurred the job is considered to have failed. Once a job has terminated (either successfully or failed) the initiate list associated with the job is examined and any appropriate jobs on the list are started.

A group of environment variables is automatically set up whenever a job is submitted. These environment variables can be used by the job to determine whether or not the job is restarting. The environment variables used are listed below:

Environment variable	Description
SCHEDULE_STEP	current step number
SCHEDULE_ENTRY	scheduling queue entry number
SCHEDULE_RESTARTING	0 or 1 if job is restarting
SCHEDULE_RESTART_COUNT	number of restarts that have occurred
SCHEDULE_RESTART_LIMIT	allowed number of restarts

The basic idea is to modify the STEP number as a job proceeds. At the beginning of the job a check is made for the RESTARTING flag, if it is set then a GOTO is done to the correct step. A typical command sequence that uses this process would appear as follows:

C:\> if schedule_restarting then goto step_'schedule_step'

C:\>REM

C:\>step_1:

C:\> schedule modify/queue/entry='schedule_entry'/step=1

C:\> run program1

C:\>REM

C:\>step_2:

C:\> schedule modify/queue/entry='schedule_entry'/step=2

C:\> run program2

C:\>REM

The above method is excellent for handling certain types of failures. Another method that also can be used to handle failures is to assign an initiate list to the job. On that initiate list provide a job with a condition level of FATAL and/or ERROR. This error job will then be started whenever the job fails (or still fails after exhausting all allowed restarts).