Please check the status of this specification in Launchpad before editing it. If it is Approved, contact the Assignee or another knowledgeable person before making changes.
Table of contents
- Table of contents
- Use cases
- Potential Issues
At the November vUDS we discussed adding support for cgroups in Upstart.
Before going into details about the proposed stanza and overall behaviour, I'd begin by saying that contrary to some other init systems, our intent is solely related to resource controls which is the main goal of cgroups. Process grouping and tracking will remain unaffected by the addition of cgroup support.
Quite a few of our users expressed interest in applying resource restriction to Upstart jobs, both for system jobs and user jobs.
The scope of this work is limited to adding the new cgroup stanza and its integration with the cgroup manager API. We don't want to add to much cgroup-specific knowledge into upstart and want the new stanza to feel as natural as possible.
Cgroup support will be implemented by adding a new "cgroup" stanza which will control the application of cgroup based restrictions to the job. The limits will be applied to all scripts
The format for the stanza is:
cgroup <controller> [<cgroup-name>] [<key> <value>]
This allows the stanza to be specified in the following ways:
cgroup <controller> # Implicit name, no setting. cgroup <controller> <cgroup-name> # Explicit name, no setting. cgroup <controller> <key> <value> # Implicit name with setting. cgroup <controller> <cgroup-name> <key> <value> # Explicit name with setting.
Detail on the fields:
Name for one of the cgroup controller
Currently the valid values are (but won't be hardcoded into upstart):
Name of the cgroup to use (and create if non-existing)
The name may contain slash (e.g. "db/pgsql") indicating that it's requesting a sub-cgroup. The name may also contain spaces, in which case it absolutely needs to be quoted.
The name may contain any of the usual upstart variables. An extra one is defined $UPSTART_CGROUP which is only considered valid in the scope of the cgroup stanza and will be the equivalent of $UPSTART_JOB-$UPSTART_INSTANCE with any "/" in those variables replaced by an "_" (similar to what's done for logging).
If a name is not specified, Upstart will create a default cgroup for the job (equivalent to specifyinga name of "$UPSTART_CGROUP") and move the job processes in it. Note that if a name is specified and that name does not begin with "$UPSTART_CGROUP", the job can join an existing, non-Upstart-created cgroup.
The main use of that field is for cases where a set of jobs should share limits, in such case the main job should declare the various values and the others just refer to the cgroup by name but not defined values.
The name may be different for the various controllers but may not differ within the same controller. Example:
valid => cgroup memory group1 limit_in_bytes 52428800 cgroup cpuset group2 cpus 0-1 invalid => cgroup memory group1 limit_in_bytes 52428800 cgroup memory group2 soft_limit_in_bytes 1024
The cgroup control file minus the controller name, so for example memory.soft_limit_in_bytes will become soft_limit_in_bytes.
Any value valid for the given control file, upstart itself won't perform any validation.
If the value contains spaces, it should be put between double-quotes (e.g.):
cgroup devices $UPSTART_CGROUP allow "c 1:2 rwm"
Upstart won't have any controller aware logic in its code, instead, it'll simply talk over dbus (using a private dbus socket) to the cgroup manager which will take care of applying the various limits. That cgroup manager will be started very early in the boot sequence. Any job containing a cgroup stanza will be held until the manager is started.
The cgroup will be destroyed when a job is stopped and the cgroup isn't shared with another job (task count is 0 and it has no child cgroup).
It'll be possible to disable cgroup support entirely by either building upstart without it (needed for non-Linux systems) or by passing --no-cgroup as a parameter to upstart. In that case, the cgroup stanza will simply be ignored and the jobs will start without limitations.
All of the above is also meant to apply to user sessions. The cgroup manager will allow unprivileged cgroup configuration, so as long as the user has write access to a sub-section of a controller, it'll be allowed to write entries there. Similarly to other restriction stanzas, failure to apply a cgroup limit in a user session won't be fatal.
New initctl for cgroup manager notification
A new initctl command tentatively called notify-cgroup-manager-address will be added. Example usage:
initctl notify-cgroup-manager-address $CGROUP_MANAGER_DBUS_ADDRESS
This follows the pattern of notify-disk-writeable and notify-dbus-address and will allow the Cgroup controllers post-start to notify Upstart of the D-Bus address it should connect to to talk to the Cgroup controller.
Upstart will use this connection to request the controller create and destroy cgroups as and when jobs require them.
Cgroup creation, deletion, entry and setting values
Cgroups will be created at the point they are needed: that is if a job specifies the "cgroup" stanza, Upstart will fork the child process, perform the initial child setup, create the cgroups, move the pid into the group and exec the requested program now running in the correct group.
Deletion will be handled by the cgmanager automatically when the cgroup becomes empty. See https://bugs.launchpad.net/ubuntu/+source/cgmanager/+bug/1281683 since this avoids the need for PID 1 to retain state on whether Upstart created the cgroup on behalf of a job and thus whether the group should be deleted once all the jobs that are using the group end. (This is especially convenient since the design below does not nominally allow sufficient feedback from the child to the parent on whether cgroups existed before the child called Create() or not).
The setup phase in more detail:
PID 1 connects to the cgmanager when notified of its address (via initctl notify-cgroup-manager-address).
- This connection is only used by the parent to identify if the cgmanager is available and therefore to allow a job that needs cgroup support to be started. All the cgmanager calls are handled by the child.
PID 1 calls fork().
- PID 1 will check the pipe fd connected to the child as usual.
- Child will perform the initial child setup:
- On failure, it will write the appropriate error to the fd and then exit as usual.
On success, it will manually close the fd so PID 1 sees that the (initial) child setup was successful and mark the job as "JOB_SETUP" (meaning cgroup setup in progress).
- Child open a new connection to the cgmanager.
- This might block the child, but that's OK since this cannot block PID 1.
- On failure, the child may hang forever or exit.
On success, when the child calls exec(), Upstart will be notified since it's being ptraced.
Child calls the cgmanager's MovePidAbs() D-Bus method to escape its current controller cgroup (to ensure created groups are absolute, rather than below the current group).
Child calls the cgmanager's Create() D-Bus method to create the appropriate cgroups.
- Note that since the creation is handled in the child, the parent has no way of knowing when to delete the cgroup.
Child calls the cgmanager's RemoveOnEmpty() D-Bus method to have the cgmanager auto-delete the cgroup once it has become empty.
If appropriate, the child calls the cgmanager's SetValue() D-Bus method.
If the job specifies setuid or setgid, chown() the cgroup to the appropriate user/group.
Child raised SIGSTOP on itself.
Parent will see the SIGSTOP and since the job is in state JOB_SETUP know that the cgroup creation was successful and thus move the state on.
Parent will send SIGCONT to the child to allow it to continue.
Child drops privs using setgid() and setuid() if the job specifies the corresponding stanzas.
Child calls the cgmanager's MovePid() D-Bus method as the final uid the job will run as.
Child calls exec().
Breaking with the existing model, if the exec() fails, PID 1 won't be notified directly via the pipe (since it's now closed). However, upstart will be notified of the failure since the child will exit, and therefore seen by the waitid() handler. As the job is marked "JOB_CGROUP" and being ptraced, PID 1 will know that the cgroup setup phase failed.
PID 1 will be notified of one of its job processes calling exec() (via the ptrace) and will then drop the ptrace, allowing the job process to continue in the cgroup it was moved into.
So in the worst possible case where the cgmanager is not responding, we'll get a build-up of jobs in state "JOB_CGROUP" which will not block PID 1, but which can be killed.
Additional work item
Move the setgid() and setuid() setup code to the end of the setup
- phase (for cgroup jobs and non-cgroup jobs) so that it is handled just
prior to the exec().
- phase (for cgroup jobs and non-cgroup jobs) so that it is handled just
- Ensure the code is reviewed by the Security Team.
It is not an error if the cgroup already exists, however, if the cgroup does not exist but cannot be created prior to starting the job process, the job will be failed.
Upstart will maintain state such that once the last consumer of a particular cgroup has finished, Upstart will request that the cgroup be deleted.
If a job specifies a sub-group that does not yet exist, Upstart will attempt to create any missing levels in the heirarchy. For example if a job specifies:
cgroup cpuset db/$UPSTART_CGROUP cpus 1
And if when that job starts, the cgroup "db/" does not already exist, this will not be considered an error - Upstart will attempt to create it and fail the job if any level in the heirarchy cannot be created for a reason other than EEXIST.
D-Bus does not currently allow a DBusConnection to be serialised. Until that limitation is overcome, Upstart must serialise:
$CGROUP_MANAGER_DBUS_ADDRESS to allow it to reconnect to the cgroup manager post re-exec.
- Sufficient state such that post re-exec it will know which existing cgroups that it requested by created should request be deleted once all consumers of that cgroup end.
Now a few examples to try and illustrate the thoughts behind that proposal:
Single job simple example
cgroup memory $UPSTART_CGROUP limit_in_bytes 52428800
The job will only start once the manager is up and running and will have a 50MB memory limit. If the system has less than 50MB, the job will fail to start.
Single job complex example
cgroup memory $UPSTART_CGROUP limit_in_bytes 52428800 cgroup cpuset $UPSTART_CGROUP cpus 0-1 cgroup blkio slowio throttle.write_bps_device "8:16 1048576"
The job will only start once the manager is up and running and will have a 50MB memory limit, be restricted to CPU ids 0 and 1 and have a 1MB/s write limit to the block device 8:16. The job will fail to start if the system has less than 50MB of RAM or less than 2 CPUs.
Multiple jobs complex example
cgroup cpuset db cpus 0-1 cgroup memory db limit_in_bytes 104857600 cgroup blkio db throttle.write_bps_device "8:16 1048576"
cgroup cpuset db/$UPSTART_CGROUP cpus 1 cgroup memory db/$UPSTART_CGROUP limit_in_bytes 52428800 cgroup blkio db/$UPSTART_CGROUP throttle.write_bps_device "8:17 1048576"
cgroup cpuset db cgroup memory db
cgroup cpuset db/$UPSTART_CGROUP cpus 2
This is rather complex, so let's go job by job:
- Job 1 will start bound to CPU 0 and 1 with a 100MB memory limit and 1MB/s write limit to the 8:16 block device. It'll fail to start if the system has less than 2 CPUs or less than 100MB of RAM.
- Job 2 will start bound to CPU 1 and with a 50MB memory limit. It'll inherit the 1MB/s write limit to 8:16 and on top of that also rate limit writes to 8:17 also at 1MB/s. The job will fail to start if the system has less than 50MB of RAM or less than 2 CPUs.
- Job 3 will start in the "db" cpuset and memory cgroups. If it starts before Job 1, no limit will be applied at startup time. As soon as Job 1 starts however Job 3 will be limited to 2 CPUs and 100MB of memory. As it doesn't have a blkio statement, it won't have rate limited I/Os.
- Job 4 if started after Job 1 will fail to start as it's requesting a CPU that the parent cgroup doesn't have access to. If started before Job 1 however, it won't have a parent value set so will inherit the default and so will start so long as the system has at least 3 CPUs.
This feature depends on the implementation of the cgroup manager which is currently in progress.
It should be possible to disable the feature both at build time and at run time. When disabled, any cgroup stanza will be ignored without an error being raised.
When enabled, upstart will wait for a signal (to be determined) to be emitted by the cgroup manager once it's ready to process request, at which point upstart will allow any job that has its startup condition met and uses cgroups to be started.
Communication to the cgroup manager is done over DBus using a private socket (doesn't depend on the system bus).
cgroup stanza syntax
As mentioned on the LXC mailing list, the cgroup syntax may change. We need to be very aware of this and ensure that a suitable abstraction for the cgroup stanza values is used if appropriate. Since the cgmanager authors are already discussing this issue with the cgroup kernel subsystem maintainer, we should however get this "for free" once the cgmanager spec is finalised.
An important consideration from the Upstart side is to ensure that Upstart should not block when requesting services from cgmanager. Ideally, the cgmanager would offer a callback-type interface to allow Upstart to handle cgroup creation/deletion events (both requested and indirectly notified) via its main loop.
initctl notify-cgroup-manager-address command
This follows the existing pattern used by notify-disk-writeable and notify-dbus-address. Dmitrijs has suggested we considereneric "initctl notify <name> <value>" facility which is a good idea as:
- it could simplify re-exec handling internally
- it's a more elegant solution to the problem of notifying upstart of important changes (we could alternatively use events, or implement namespaced events).
Inotify notification for cgmanager socket
It would be clearner if we could use use inotify to avoid the need for notify-cgroup-manager-address. However, if the well-known socket the cgmanager creates is abstract, we can't watch for it.
This specification has been developed with the input of the following people:
- Stéphane Graber (original author)
- James Hunt
- Serge Hallyn
- Steve Langasek