Differences between revisions 4 and 5
Revision 4 as of 2006-12-13 19:00:09
Size: 2845
Editor: scott
Comment: extend normalexit to include signal names
Revision 5 as of 2006-12-15 15:34:14
Size: 3707
Editor: scott
Comment: redraft without using a separate event
Deletions are marked like this. Additions are marked like this.
Line 9: Line 9:
When a job fails, there is currently no way for another job to be run as a result of that failure. This specification proposes adding a new event that indicates a job failed. When a job fails, there is currently no way for another job to be run as a result of that failure. This specification proposes adding arguments to the `stopping` and `stopped` events that indicate a job failed.
Line 19: Line 19:
 * Some jobs may be run when another stops, but only if the task succeeded.
Line 21: Line 23:
The scope of this specification is limited to the addition of the new event, and a change of behaviour to the `normalexit` stanza necessary to use this to its full advantage. The scope of this specification is limited to the addition of arguments to the existing events, and a change of behaviour to the `normalexit` stanza necessary to use this to its full advantage.
Line 25: Line 27:
 * When the job's main process or binary terminates with an exit code not listed in `normalexit`, a new `failed` event will be emitted.  * When the job's main process or binary unexpectedly terminates with an exit code not listed in `normalexit`, the job will be considered `failed`.
Line 34: Line 36:
 * This will receive the job name as the first positional argument.  * If the job terminates normally, the argument `ok` will be appended to both the `stopping` and `stopped` events.
Line 36: Line 38:
failed ssh stopping mysql ok
}}}

 * When a job terminates unexpectedly, the argument `failed` will be appended to both the `stopping` and `stopped` events.
 {{{
stopping apache failed
Line 41: Line 48:
failed apache
  EXIT_SIGNAL=SIGTERM
stopped apache failed
  EXIT_SIGNAL=SIGSEGV
Line 45: Line 52:
 * When any job script fails, the `failed` event will also be emitted.

 * T
he script name (e.g. `pre-start`) will be received as the second position argument.
 * When any job script fails, these arguments will also be appended; in addition, the script name (e.g. `pre-start`) will be received as the third position argument.
Line 49: Line 54:
failed tomcat pre-start stopped tomcat failed pre-start
  EXIT_STATUS=1
Line 56: Line 62:
This can be implemented in `job_child_reaper` in `init/job.c`. The `Job` structure in `init/job.h` will need to be extended to record whether a job has failed, which goal it failed in, and details about the exit status.
 {{{
int failed;
JobState failed_state;
int exit_status;
}}}
Line 58: Line 69:
The `JOB_RUNNING` case that only checks `normalexit` if the job is to be respawned will be moved up a level so that this is checked for all main processes. If the job is not to be respawned, but exited normally, then only the goal is changed and the new event not emitted. These are set to FALSE when a job enters the `starting` state, and are set appropriately in `job_child_reaper` in `init/job.c` if the job fails.
Line 60: Line 71:
Otherwise, if the goal is not already `JOB_STOP`, the new event will be emitted before changing the goal. In that function, the `JOB_RUNNING` case that only checks `normalexit` if the job is to be respawned will be moved up a level; such that it is checked for all such jobs.. If the job is not to be respawned, but exited normally, then only the goal is changed and not considered to have failed.

Otherwise any script that is killed, or exits with a non-zero status is considered to have failed unless the goal is already `JOB_STOP`; in which case it's reasonable to assume the death was caused by reasons out of its own control.

Please check the status of this specification in Launchpad before editing it. If it is Approved, contact the Assignee or another knowledgeable person before making changes.

Summary

When a job fails, there is currently no way for another job to be run as a result of that failure. This specification proposes adding arguments to the stopping and stopped events that indicate a job failed.

Rationale

It's often useful to know when a particular job fails, another job may be able to handle that case and either fix the problem or alert the user in some meaningful manner.

Use cases

  • If the job that checks or mounts a filesystem fails, we want to start an emergency console.
  • Some jobs may be run when another stops, but only if the task succeeded.

Scope

The scope of this specification is limited to the addition of arguments to the existing events, and a change of behaviour to the normalexit stanza necessary to use this to its full advantage.

Design

  • When the job's main process or binary unexpectedly terminates with an exit code not listed in normalexit, the job will be considered failed.

  • normalexit will contain the exit code 0, unless the job is marked as respawn, in which case it will remain empty.

  • normalexit will be extended to permit signal names.

    normalexit SIGUSR1
  • If the job terminates normally, the argument ok will be appended to both the stopping and stopped events.

    stopping mysql ok
  • When a job terminates unexpectedly, the argument failed will be appended to both the stopping and stopped events.

    stopping apache failed
  • The exit status is placed in the EXIT_STATUS environment variable, or if killed by a signal, the name of the signal is placed in the EXIT_SIGNAL variable.
    stopped apache failed
      EXIT_SIGNAL=SIGSEGV
  • When any job script fails, these arguments will also be appended; in addition, the script name (e.g. pre-start) will be received as the third position argument.

    stopped tomcat failed pre-start
      EXIT_STATUS=1

Implementation

Code

The Job structure in init/job.h will need to be extended to record whether a job has failed, which goal it failed in, and details about the exit status.

  • int      failed;
    JobState failed_state;
    int      exit_status;

These are set to FALSE when a job enters the starting state, and are set appropriately in job_child_reaper in init/job.c if the job fails.

In that function, the JOB_RUNNING case that only checks normalexit if the job is to be respawned will be moved up a level; such that it is checked for all such jobs.. If the job is not to be respawned, but exited normally, then only the goal is changed and not considered to have failed.

Otherwise any script that is killed, or exits with a non-zero status is considered to have failed unless the goal is already JOB_STOP; in which case it's reasonable to assume the death was caused by reasons out of its own control.

Data preservation and migration

These changes are largely backwards compatible with the previous behaviour.

The new meaning of normalexit to decide whether a job succeeded or failed, along with the new default value, will not adversely affect any existing job; the only additional change will be an alteration in the message for non-respawn jobs.


CategorySpec

JobFailedEvent (last edited 2006-12-15 15:34:14 by scott)