Please check the status of this specification in Launchpad before editing it. If it is Approved, contact the Assignee or another knowledgeable person before making changes.
Launchpad entry: https://launchpad.net/products/upstart/+spec/job-failed-event
When a job fails, there is currently no way for another job to be run as a result of that failure. This specification proposes adding arguments to the stopping and stopped events that indicate a job failed.
It's often useful to know when a particular job fails, another job may be able to handle that case and either fix the problem or alert the user in some meaningful manner.
- If the job that checks or mounts a filesystem fails, we want to start an emergency console.
- Some jobs may be run when another stops, but only if the task succeeded.
The scope of this specification is limited to the addition of arguments to the existing events, and a change of behaviour to the normalexit stanza necessary to use this to its full advantage.
When the job's main process or binary unexpectedly terminates with an exit code not listed in normalexit, the job will be considered failed.
normalexit will contain the exit code 0, unless the job is marked as respawn, in which case it will remain empty.
normalexit will be extended to permit signal names.
If the job terminates normally, the argument ok will be appended to both the stopping and stopped events.
stopping mysql ok
When a job terminates unexpectedly, the argument failed will be appended to both the stopping and stopped events.
stopping apache failed
- The exit status is placed in the EXIT_STATUS environment variable, or if killed by a signal, the name of the signal is placed in the EXIT_SIGNAL variable.
stopped apache failed EXIT_SIGNAL=SIGSEGV
When any job script fails, these arguments will also be appended; in addition, the script name (e.g. pre-start) will be received as the third position argument.
stopped tomcat failed pre-start EXIT_STATUS=1
The Job structure in init/job.h will need to be extended to record whether a job has failed, which goal it failed in, and details about the exit status.
int failed; JobState failed_state; int exit_status;
These are set to FALSE when a job enters the starting state, and are set appropriately in job_child_reaper in init/job.c if the job fails.
In that function, the JOB_RUNNING case that only checks normalexit if the job is to be respawned will be moved up a level; such that it is checked for all such jobs.. If the job is not to be respawned, but exited normally, then only the goal is changed and not considered to have failed.
Otherwise any script that is killed, or exits with a non-zero status is considered to have failed unless the goal is already JOB_STOP; in which case it's reasonable to assume the death was caused by reasons out of its own control.
Data preservation and migration
These changes are largely backwards compatible with the previous behaviour.
The new meaning of normalexit to decide whether a job succeeded or failed, along with the new default value, will not adversely affect any existing job; the only additional change will be an alteration in the message for non-respawn jobs.