Please check the status of this specification in Launchpad before editing it. If it is Approved, contact the Assignee or another knowledgeable person before making changes.

Table of contents

Summary

At the November vUDS we discussed adding support for cgroups in Upstart.

Before going into details about the proposed stanza and overall behaviour, I'd begin by saying that contrary to some other init systems, our intent is solely related to resource controls which is the main goal of cgroups. Process grouping and tracking will remain unaffected by the addition of cgroup support.

Rationale

Quite a few of our users expressed interest in applying resource restriction to Upstart jobs, both for system jobs and user jobs.

Scope

The scope of this work is limited to adding the new cgroup stanza and its integration with the cgroup manager API. We don't want to add to much cgroup-specific knowledge into upstart and want the new stanza to feel as natural as possible.

Design

cgroup stanza

Cgroup support will be implemented by adding a new "cgroup" stanza which will control the application of cgroup based restrictions to the job. The limits will be applied to all scripts

The format for the stanza is:

cgroup <controller> [<cgroup-name>] [<key> <value>]

This allows the stanza to be specified in the following ways:

cgroup <controller>                               # Implicit name, no setting.
cgroup <controller> <cgroup-name>                 # Explicit name, no setting.
cgroup <controller> <key> <value>                 # Implicit name with setting.
cgroup <controller> <cgroup-name> <key> <value>   # Explicit name with setting.

Detail on the fields:

controller

Name for one of the cgroup controller

Currently the valid values are (but won't be hardcoded into upstart):

cgroup-name

Name of the cgroup to use (and create if non-existing)

The name may contain slash (e.g. "db/pgsql") indicating that it's requesting a sub-cgroup. The name may also contain spaces, in which case it absolutely needs to be quoted.

The name may contain any of the usual upstart variables. An extra one is defined $UPSTART_CGROUP which is only considered valid in the scope of the cgroup stanza and will be the equivalent of $UPSTART_JOB-$UPSTART_INSTANCE with any "/" in those variables replaced by an "_" (similar to what's done for logging).

If a name is not specified, Upstart will create a default cgroup for the job (equivalent to specifyinga name of "$UPSTART_CGROUP") and move the job processes in it. Note that if a name is specified and that name does not begin with "$UPSTART_CGROUP", the job can join an existing, non-Upstart-created cgroup.

The main use of that field is for cases where a set of jobs should share limits, in such case the main job should declare the various values and the others just refer to the cgroup by name but not defined values.

The name may be different for the various controllers but may not differ within the same controller. Example:

valid =>    cgroup memory group1 limit_in_bytes 52428800
            cgroup cpuset group2 cpus 0-1

invalid =>  cgroup memory group1 limit_in_bytes 52428800
            cgroup memory group2 soft_limit_in_bytes 1024

key

The cgroup control file minus the controller name, so for example memory.soft_limit_in_bytes will become soft_limit_in_bytes.

value

Any value valid for the given control file, upstart itself won't perform any validation.

If the value contains spaces, it should be put between double-quotes (e.g.):

cgroup devices $UPSTART_CGROUP allow "c 1:2 rwm"

Upstart won't have any controller aware logic in its code, instead, it'll simply talk over dbus (using a private dbus socket) to the cgroup manager which will take care of applying the various limits. That cgroup manager will be started very early in the boot sequence. Any job containing a cgroup stanza will be held until the manager is started.

The cgroup will be destroyed when a job is stopped and the cgroup isn't shared with another job (task count is 0 and it has no child cgroup).

It'll be possible to disable cgroup support entirely by either building upstart without it (needed for non-Linux systems) or by passing --no-cgroup as a parameter to upstart. In that case, the cgroup stanza will simply be ignored and the jobs will start without limitations.

All of the above is also meant to apply to user sessions. The cgroup manager will allow unprivileged cgroup configuration, so as long as the user has write access to a sub-section of a controller, it'll be allowed to write entries there. Similarly to other restriction stanzas, failure to apply a cgroup limit in a user session won't be fatal.

New initctl for cgroup manager notification

A new initctl command tentatively called notify-cgroup-manager-address will be added. Example usage:

initctl notify-cgroup-manager-address $CGROUP_MANAGER_DBUS_ADDRESS

This follows the pattern of notify-disk-writeable and notify-dbus-address and will allow the Cgroup controllers post-start to notify Upstart of the D-Bus address it should connect to to talk to the Cgroup controller.

Upstart will use this connection to request the controller create and destroy cgroups as and when jobs require them.

Cgroup creation, deletion, entry and setting values

Cgroups will be created at the point they are needed: that is if a job specifies the "cgroup" stanza, Upstart will fork the child process, perform the initial child setup, create the cgroups, move the pid into the group and exec the requested program now running in the correct group.

Deletion will be handled by the cgmanager automatically when the cgroup becomes empty. See https://bugs.launchpad.net/ubuntu/+source/cgmanager/+bug/1281683 since this avoids the need for PID 1 to retain state on whether Upstart created the cgroup on behalf of a job and thus whether the group should be deleted once all the jobs that are using the group end. (This is especially convenient since the design below does not nominally allow sufficient feedback from the child to the parent on whether cgroups existed before the child called Create() or not).

The setup phase in more detail:

So in the worst possible case where the cgmanager is not responding, we'll get a build-up of jobs in state "JOB_CGROUP" which will not block PID 1, but which can be killed.

Additional work item

It is not an error if the cgroup already exists, however, if the cgroup does not exist but cannot be created prior to starting the job process, the job will be failed.

Upstart will maintain state such that once the last consumer of a particular cgroup has finished, Upstart will request that the cgroup be deleted.

Sub-cgroups

If a job specifies a sub-group that does not yet exist, Upstart will attempt to create any missing levels in the heirarchy. For example if a job specifies:

cgroup cpuset db/$UPSTART_CGROUP cpus 1

And if when that job starts, the cgroup "db/" does not already exist, this will not be considered an error - Upstart will attempt to create it and fail the job if any level in the heirarchy cannot be created for a reason other than EEXIST.

re-exec considerations

D-Bus does not currently allow a DBusConnection to be serialised. Until that limitation is overcome, Upstart must serialise:

Use cases

Now a few examples to try and illustrate the thoughts behind that proposal:

Single job simple example

Job

cgroup memory $UPSTART_CGROUP limit_in_bytes 52428800

Result

The job will only start once the manager is up and running and will have a 50MB memory limit. If the system has less than 50MB, the job will fail to start.

Single job complex example

Job

cgroup memory $UPSTART_CGROUP limit_in_bytes 52428800
cgroup cpuset $UPSTART_CGROUP cpus 0-1
cgroup blkio slowio throttle.write_bps_device "8:16 1048576"

Result

The job will only start once the manager is up and running and will have a 50MB memory limit, be restricted to CPU ids 0 and 1 and have a 1MB/s write limit to the block device 8:16. The job will fail to start if the system has less than 50MB of RAM or less than 2 CPUs.

Multiple jobs complex example

Job 1

cgroup cpuset db cpus 0-1
cgroup memory db limit_in_bytes 104857600
cgroup blkio db throttle.write_bps_device "8:16 1048576"

Job 2

cgroup cpuset db/$UPSTART_CGROUP cpus 1
cgroup memory db/$UPSTART_CGROUP limit_in_bytes 52428800
cgroup blkio db/$UPSTART_CGROUP throttle.write_bps_device "8:17 1048576"

Job 3

cgroup cpuset db
cgroup memory db

Job 4

cgroup cpuset db/$UPSTART_CGROUP cpus 2

Result

This is rather complex, so let's go job by job:

Development

This feature depends on the implementation of the cgroup manager which is currently in progress.

It should be possible to disable the feature both at build time and at run time. When disabled, any cgroup stanza will be ignored without an error being raised.

When enabled, upstart will wait for a signal (to be determined) to be emitted by the cgroup manager once it's ready to process request, at which point upstart will allow any job that has its startup condition met and uses cgroups to be started.

Communication to the cgroup manager is done over DBus using a private socket (doesn't depend on the system bus).

Potential Issues

cgroup stanza syntax

As mentioned on the LXC mailing list, the cgroup syntax may change. We need to be very aware of this and ensure that a suitable abstraction for the cgroup stanza values is used if appropriate. Since the cgmanager authors are already discussing this issue with the cgroup kernel subsystem maintainer, we should however get this "for free" once the cgmanager spec is finalised.

Non-blocking calls

An important consideration from the Upstart side is to ensure that Upstart should not block when requesting services from cgmanager. Ideally, the cgmanager would offer a callback-type interface to allow Upstart to handle cgroup creation/deletion events (both requested and indirectly notified) via its main loop.

initctl notify-cgroup-manager-address command

This follows the existing pattern used by notify-disk-writeable and notify-dbus-address. Dmitrijs has suggested we considereneric "initctl notify <name> <value>" facility which is a good idea as:

Inotify notification for cgmanager socket

It would be clearner if we could use use inotify to avoid the need for notify-cgroup-manager-address. However, if the well-known socket the cgmanager creates is abstract, we can't watch for it.

References

Contributors

This specification has been developed with the input of the following people:


Cgroup (last edited 2014-04-01 18:54:37 by jamesodhunt)