Please check the status of this specification in Launchpad before editing it. If it is Approved, contact the Assignee or another knowledgeable person before making changes.
Table of contents
- Table of contents
- Use cases
- Potential Issues
At the November vUDS we discussed adding support for cgroups in Upstart.
Before going into details about the proposed stanza and overall behaviour, I'd begin by saying that contrary to some other init systems, our intent is solely related to resource controls which is the main goal of cgroups. Process grouping and tracking will remain unaffected by the addition of cgroup support.
Quite a few of our users expressed interest in applying resource restriction to Upstart jobs, both for system jobs and user jobs.
The scope of this work is limited to adding the new cgroup stanza and its integration with the cgroup manager API. We don't want to add to much cgroup-specific knowledge into upstart and want the new stanza to feel as natural as possible.
Cgroup support will be implemented by adding a new "cgroup" stanza which will control the application of cgroup based restrictions to the job. The limits will be applied to all scripts
The format for the stanza is:
cgroup <controller> [<cgroup-name>] [<key> <value>]
This allows the stanza to be specified in the following ways:
cgroup <controller> # Implicit name, no setting. cgroup <controller> <cgroup-name> # Explicit name, no setting. cgroup <controller> <key> <value> # Implicit name with setting. cgroup <controller> <cgroup-name> <key> <value> # Explicit name with setting.
Detail on the fields:
Name for one of the cgroup controller
Currently the valid values are (but won't be hardcoded into upstart):
Name of the cgroup to use (and create if non-existing)
The name may contain slash (e.g. "db/pgsql") indicating that it's requesting a sub-cgroup. The name may also contain spaces, in which case it absolutely needs to be quoted.
The name may contain any of the usual upstart variables. An extra one is defined $UPSTART_CGROUP which is only considered valid in the scope of the cgroup stanza and will be the equivalent of $UPSTART_JOB-$UPSTART_INSTANCE with any "/" in those variables replaced by an "_" (similar to what's done for logging).
If a name is not specified, Upstart will create a default cgroup for the job (equivalent to specifyinga name of "$UPSTART_CGROUP") and move the job processes in it. Note that if a name is specified and that name does not begin with "$UPSTART_CGROUP", the job can join an existing, non-Upstart-created cgroup.
The main use of that field is for cases where a set of jobs should share limits, in such case the main job should declare the various values and the others just refer to the cgroup by name but not defined values.
The name may be different for the various controllers but may not differ within the same controller. Example:
valid => cgroup memory group1 limit_in_bytes 52428800 cgroup cpuset group2 cpus 0-1 invalid => cgroup memory group1 limit_in_bytes 52428800 cgroup memory group2 soft_limit_in_bytes 1024
The cgroup control file minus the controller name, so for example memory.soft_limit_in_bytes will become soft_limit_in_bytes.
Any value valid for the given control file, upstart itself won't perform any validation.
If the value contains spaces, it should be put between double-quotes (e.g.):
cgroup devices $UPSTART_CGROUP allow "c 1:2 rwm"
Upstart won't have any controller aware logic in its code, instead, it'll simply talk over dbus (using a private dbus socket) to the cgroup manager which will take care of applying the various limits. That cgroup manager will be started very early in the boot sequence. Any job containing a cgroup stanza will be held until the manager is started.
The cgroup will be destroyed when a job is stopped and the cgroup isn't shared with another job (task count is 0 and it has no child cgroup).
It'll be possible to disable cgroup support entirely by either building upstart without it (needed for non-Linux systems) or by passing --no-cgroup as a parameter to upstart. In that case, the cgroup stanza will simply be ignored and the jobs will start without limitations.
All of the above is also meant to apply to user sessions. The cgroup manager will allow unprivileged cgroup configuration, so as long as the user has write access to a sub-section of a controller, it'll be allowed to write entries there. Similarly to other restriction stanzas, failure to apply a cgroup limit in a user session won't be fatal.
New initctl for cgroup manager notification
A new initctl command tentatively called notify-cgroup-manager-address will be added. Example usage:
initctl notify-cgroup-manager-address $CGROUP_MANAGER_DBUS_ADDRESS
This follows the pattern of notify-disk-writeable and notify-dbus-address and will allow the Cgroup controllers post-start to notify Upstart of the D-Bus address it should connect to to talk to the Cgroup controller.
Upstart will use this connection to request the controller create and destroy cgroups as and when jobs require them.
Cgroup creation and deletion
Cgroups will be created at the point they are needed: that is if a job specifies the "cgroup" stanza, Upstart (running as the forked child process to avoid blocking the main thread of execution) will attempt to create the cgroup immediately before any job process (exec, pre-start, et cetera) runs. It is not an error if the cgroup already exists, however, if the cgroup does not exist but cannot be created prior to starting the job process, the job will be failed.
Upstart will maintain state such that once the last consumer of a particular cgroup has finished, Upstart will request that the cgroup be deleted.
If a job specifies a sub-group that does not yet exist, Upstart will attempt to create any missing levels in the heirarchy. For example if a job specifies:
cgroup cpuset db/$UPSTART_CGROUP cpus 1
And if when that job starts, the cgroup "db/" does not already exist, this will not be considered an error - Upstart will attempt to create it and fail the job if any level in the heirarchy cannot be created for a reason other than EEXIST.
D-Bus does not currently allow a DBusConnection to be serialised. Until that limitation is overcome, Upstart must serialise:
$CGROUP_MANAGER_DBUS_ADDRESS to allow it to reconnect to the cgroup manager post re-exec.
- Sufficient state such that post re-exec it will know which existing cgroups that it requested by created should request be deleted once all consumers of that cgroup end.
Now a few examples to try and illustrate the thoughts behind that proposal:
Single job simple example
cgroup memory $UPSTART_CGROUP limit_in_bytes 52428800
The job will only start once the manager is up and running and will have a 50MB memory limit. If the system has less than 50MB, the job will fail to start.
Single job complex example
cgroup memory $UPSTART_CGROUP limit_in_bytes 52428800 cgroup cpuset $UPSTART_CGROUP cpus 0-1 cgroup blkio slowio throttle.write_bps_device "8:16 1048576"
The job will only start once the manager is up and running and will have a 50MB memory limit, be restricted to CPU ids 0 and 1 and have a 1MB/s write limit to the block device 8:16. The job will fail to start if the system has less than 50MB of RAM or less than 2 CPUs.
Multiple jobs complex example
cgroup cpuset db cpus 0-1 cgroup memory db limit_in_bytes 104857600 cgroup blkio db throttle.write_bps_device "8:16 1048576"
cgroup cpuset db/$UPSTART_CGROUP cpus 1 cgroup memory db/$UPSTART_CGROUP limit_in_bytes 52428800 cgroup blkio db/$UPSTART_CGROUP throttle.write_bps_device "8:17 1048576"
cgroup cpuset db cgroup memory db
cgroup cpuset db/$UPSTART_CGROUP cpus 2
This is rather complex, so let's go job by job:
- Job 1 will start bound to CPU 0 and 1 with a 100MB memory limit and 1MB/s write limit to the 8:16 block device. It'll fail to start if the system has less than 2 CPUs or less than 100MB of RAM.
- Job 2 will start bound to CPU 1 and with a 50MB memory limit. It'll inherit the 1MB/s write limit to 8:16 and on top of that also rate limit writes to 8:17 also at 1MB/s. The job will fail to start if the system has less than 50MB of RAM or less than 2 CPUs.
- Job 3 will start in the "db" cpuset and memory cgroups. If it starts before Job 1, no limit will be applied at startup time. As soon as Job 1 starts however Job 3 will be limited to 2 CPUs and 100MB of memory. As it doesn't have a blkio statement, it won't have rate limited I/Os.
- Job 4 if started after Job 1 will fail to start as it's requesting a CPU that the parent cgroup doesn't have access to. If started before Job 1 however, it won't have a parent value set so will inherit the default and so will start so long as the system has at least 3 CPUs.
This feature depends on the implementation of the cgroup manager which is currently in progress.
It should be possible to disable the feature both at build time and at run time. When disabled, any cgroup stanza will be ignored without an error being raised.
When enabled, upstart will wait for a signal (to be determined) to be emitted by the cgroup manager once it's ready to process request, at which point upstart will allow any job that has its startup condition met and uses cgroups to be started.
Communication to the cgroup manager is done over DBus using a private socket (doesn't depend on the system bus).
cgroup stanza syntax
As mentioned on the LXC mailing list, the cgroup syntax may change. We need to be very aware of this and ensure that a suitable abstraction for the cgroup stanza values is used if appropriate. Since the cgmanager authors are already discussing this issue with the cgroup kernel subsystem maintainer, we should however get this "for free" once the cgmanager spec is finalised.
An important consideration from the Upstart side is to ensure that Upstart should not block when requesting services from cgmanager. Ideally, the cgmanager would offer a callback-type interface to allow Upstart to handle cgroup creation/deletion events (both requested and indirectly notified) via its main loop.
initctl notify-cgroup-manager-address command
This follows the existing pattern used by notify-disk-writeable and notify-dbus-address. Dmitrijs has suggested we considereneric "initctl notify <name> <value>" facility which is a good idea as:
- it could simplify re-exec handling internally
- it's a more elegant solution to the problem of notifying upstart of important changes (we could alternatively use events, or implement namespaced events).
Inotify notification for cgmanager socket
It would be clearner if we could use use inotify to avoid the need for notify-cgroup-manager-address. However, if the well-known socket the cgmanager creates is abstract, we can't watch for it.
This specification has been developed with the input of the following people:
- Stéphane Graber (original author)
- James Hunt
- Serge Hallyn
- Steve Langasek