Garbage Collection for Python

Portable Garbage Collection

Summary

Reference cycles involving lists, tuples, instances, classes, dictionaries, and functions are found. Instances with __del__ methods are handled in a sane way. It is easy to add GC support to new types. GC enabled Python is binary compatible with regular Python.

Generational collection works (currently three generations). The overhead measured by pybench is about 4 percent. Virtually all extension modules should work unchanged (I had to modify new and cPickle in the standard distribution). A new module called gc is available for tuning the collector and setting debugging options.

The collector should be portable across all platforms. The patched version of Python passes all regression tests and runs Grail, Idle and Sketch without problems.

The portable garbage collection has been included in Python since version 2.0. It is enabled by default. Be happy.

Why do we need it?

The current version of Python uses reference counting to keep track of allocated memory. Each object in Python has a reference count which indicates how many objects are pointing to it. When this reference count reaches zero the object is freed. This works well for most programs. However, there is one fundamental flaw with reference counting and it is due to something called reference cycles. The simplest example of a reference cycle is one object that refers to itself. For example:
    >>> l = []
    >>> l.append(l)
    >>> del l
The reference count for the list created is now one. However, since it cannot not be reached from inside Python and cannot possibly be used again, it should be considered garbage. In the current version of Python, this list will never be freed.

Creating reference cycles is usually not good programming practice and can almost always be avoided. However, sometimes it is difficult to avoid creating reference cycles and other times the programmer does not even realize it is happening. For long running programs such as servers this is especially troublesome. People do not want their servers to run out of memory because reference counting failed to free unreachable objects. For large programs it is difficult to find how reference cycles are being created.

What about "traditional" Garbage Collection?

Traditional garbage collection (eg. mark and sweep or stop and copy) usually works as follows:
  1. Find the root objects of the system. These are things like the global environment (like the __main__ module in Python) and objects on the stack.
  2. Search from these objects and find all objects reachable from them. This objects are all "alive".
  3. Free all other objects.
Unfortunately this approach cannot be used in the current version of Python. Because of the way extension modules work, Python can never fully determine the root set. If the root set cannot be determined accurately we risk freeing objects still referenced from somewhere. Even if extension modules were designed differently, the is no portable way of finding what objects are currently on the C stack. Also, reference counting provides some nice benefits in terms of locality of memory reference and finalizer semantics that Python programmers have come to expect. What would be best is if we could find a way of still using reference counting but also free reference cycles.

How does this Approach Work?

Conceptually, this approach does the opposite of traditional garbage collection. Instead of trying to find all reachable objects it tries to find unreachable objects. This is much safer because if the algorithm fails we are no worse off than with no garbage collection (except for the time and space wasted).

Since we are still using reference counting, the garbage collector only has to find reference cycles. The reference counting will handle freeing other types of garbage. First we observe that reference cycles can only be created by container objects. These are objects which can hold references to other objects. In Python lists, dictionaries, instances, classes, and tuples are all examples of container objects. Integers and strings are not containers. With this observation we realize that non-container objects can be ignored for the purposes of garbage collection. This is a useful optimization because things like integers and strings should be fast and small.

Our idea now is to keep track of all container objects. There are several ways that this can be done but one of the best is using doubly linked lists with the link fields inside the objects structure. This allows objects to be quickly inserted and removed from the set as well as not requiring extra memory allocations. When a container is created it is inserted into this set and when deleted it is removed.

Now that we have access to all the container objects, how to we find reference cycles? First we add another field to container objects in addition to the two link pointers. We will call this field gc_refs. Here are the steps to find reference cycles:

  1. For each container object, set gc_refs equal to the object's reference count.
  2. For each container object, find which container objects it references and decrement the referenced container's gc_refs field.
  3. All container objects that now have a gc_refs field greater than one are referenced from outside the set of container objects. We cannot free these objects so we move them to a different set.
  4. Any objects referenced from the objects moved also cannot be freed. We move them and all the objects reachable from them too.
  5. Objects left in our original set are referenced only by objects within that set (ie. they are inaccessible from Python and are garbage). We can now go about freeing these objects.

The trouble with Finalizers

One problem with our grand scheme is the use of finalizers. In Python these are __del__ methods on instances. With reference counting finalizers work fine. When an object's reference count goes to zero the finalizer is called before the object is freed. This is straight forward and easily understood by programmers.

With garbage collection, calling finalizers becomes a tricker problem, especially in the face of reference cycles. If two objects in a reference cycle both have finalizers, what should be done? Which should be called first? After calling the first finalizer the object cannot be freed as the second finalizer still may access it.

Since there is no good solution to this problem, cycles that are referenced from objects with finalizers are not freed. Instead these objects are added to a global list of uncollectable garbage. It should almost always be possible for the program to be rewritten so that this does not occur. As a last resort, the program can access the global list and free cycles in a way that makes sense for application.

What are the costs?

As some people say, there is no such thing as a free lunch. However, this form of garbage collection is fairly cheap. One of the biggest costs is the three extra words of memory required for each container object. There is also the overhead of maintaining the set of containers. For the current version of the collector this overhead accounts for about a 4% slowdown according to pybench.

The collector currently keeps track of three generations of objects. By adjusting the collection parameters the time spent collecting can be made as small as desired. For some applications it may make sense to disable automatic collection and explicitly call the collection routine. However, with the default collection parameters, the time spent collecting does not seem to be significant when running pybench. Obviously applications that allocate container objects extensively will cause more time to be spent collecting.

The current patch adds a new configure option to enable the collector. Python with the collector is binary compatible with standard Python. If the option is disabled there is no effect on the workings of the Python interpreter.

How can I use it?

Just download a current version of Python. The garbage collector has been included since version 2.0 and is enabled by default. If you are using Python 1.5.2 there is an old patch available that may work. If you are on Windows you can download a replacement python15.dll.

Boehm-Demers Conservative Garbage collection

This patch adds modifies Python 1.5.2 to use the Boehm-Demers conservative garbage collector. You must apply this patch first however. Reference counting is still used. The garbage collector only frees memory that the reference counting does not (ie. reference cycles). This seems to provide the best performance. To use:
    $ cd Python-1.5.2
    $ patch -p1 < ../gc-malloc-cleanup.diff
    $ patch -p1 < ../gc-boehm.diff
    $ autoconf
    $ ./configure --with-gc
The patch assumes you have libgc.a installed somewhere so that linking with -lgc works (/usr/local/lib should be okay). If you don't have the library, download and install it before compiling.

Currently the patch is only tested on Linux. It will probably work on other Unix machines as well. On my Linux machine, the GC version of Python passes all regression tests.

Other Stuff

If you are trying to fix a memory leaking Python program, Tim Peter's Cyclops module may also be useful. It is available on the Python ftp site.

Thanks to Narihiro Nakamura, there is a Japanese translation of this page available.

Updated: Wed, 06 Dec 2000 10:36:38 -0800
Neil Schemenauer <nas@arctrix.com>