Memory Management and Circular References in Python – Part 1

Coming from the Java world, it was a big, unpleasant, surprise to discover that Python and circular references are no friend. Sure, you can always find your way around, but in general, it’s a PITA to deal with circular references. This series of posts will cover my trip into the wonderful world of garbage collection, circular references and finalizers in Python.

Finalizers
Finalizers in Python go wild when there is a circular reference around. Roughly speaking, a finalizer is a function that is called when an object is about to be destroyed/garbage collected. These are special methods that should be used with care, especially because there is no guarantee when they will be called (even in Java). Finalizers can be implemented by overriding the __del__ method of a class, but objects that override the __del__ method and that have a circular reference are not garbage collected until the circular references are manually broken. The recommended way of implementing a finalizer is to create a weak reference to the object that will call a function once there is no longer a strong reference to the object. But even this solution is not trouble-free. Read along to learn why.

Py4J Model
To illustrate how memory management can be tricky with Python, I will use a simplified representation of the Py4J model. Py4J enables Python programs to access objects residing in a Java Virtual Machine. The Java objects are represented by a JavaObject instance in Python while Java Methods are represented by a JavaMember instance in Python. Here is one possible implementation of JavaObject and JavaMember:

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}

    def __getattr__(self, name):
       if name not in self._methods:
           self._methods[name] = JavaMember(name)
       return self._methods[name]

class JavaMember(object):
    def __init__(self, name):
        self.name = name

    def __call__(self, *args):
        j = 0
        # Do some work. In Py4J, make a remote call to the Java method.
        for i in xrange(1,10):
            j += i

# Example of use:
# my_object = JavaObject1('oid123')
# my_object.someJavaMethod() 
#   --> calls: JavaObject1.__get__attr__, then calls JavaMember.__call__

In the previous example, JavaObject1 represents a java object with an identifier and any call to a method will create an appropriate instance of that method and cache it in _methods.

Garbage Collection
Here comes the fun part. In object brokering systems like Py4J, there must be a way from one side to tell the other side that an object is no longer used.

Assume you created an instance in Python like this: my_obj = JavaObject1. In the (real) Py4J, this has the effect of creating a reference to an object on the corresponding JVM. When my_obj is no longer referenced by Python, it will be garbage collected on the Python side, creating a leak on the JVM side. There must thus be a way to link the garbage collection process of Python with the JVM’s one.

This is where finalizers come in handy: when Python is ready to collect a JavaObject instance, the instance’s finalizer should warn the JVM that it no longer needs to reference this instance. This is what we do in the new version of JavaObject1. Instead of communicating with the JVM, we will simply increase an accumulator, to keep track of the destroyed instances:

accumulator1 = 0

def inc1():
    global accumulator1
    accumulator1 += 1 

class JavaObject1(object):
    def __init__(self, id):
        self._id = id
        self._methods = {}
        self._wr = weakref.ref(self, lambda wr : inc1())  

    ...

def m1():
    for j in xrange(0,100):
        java_object = JavaObject1('o' + str(j))
        for i in xrange(10000):
            java_object.method1()

def timer(func,run):
    start = time.time()
    func()
    print(run + str(time.time() - start))
    print('acc1:' + str(accumulator1))

if __name__ == '__main__':
    timer(m1,'With JavaObject1: ')

When we create a new instance of JavaObject1, this instance creates a weak reference to itself with a callback that will be invoked when the instance is about to be garbage collected. We also defined a function that will time the execution of (1) creating 100 JavaObject instances and (2) calling a method 100 * 10 000 times. This will be useful when we will compare various implementations of JavaObject. If we run our timer function:

With JavaObject1: 1.8631708622
acc1:100

We see that the finalizers worked: 100 JavaObject1 instances were “destroyed”. But there is a problem with this scheme. Can you spot it? Here is again how one would use Py4J. See the next post for the solution…:

my_obj = JavaObject1()
my_obj.method1() # Calls: JavaObject1.__get__attr__, then calls JavaMember.__call__

Go to Part 2 of this series.

Context Resuming

Poor souls like me who work on their open source projects in their spare time sometimes suffer from a form of contextus resumis. You know, when you sit down, think of all the time you can finally spend on your favorite project, and then, realize, horror-struck, that you don’t know how to resume the task you were working on two weeks ago?

This is particularly an issue when you are working on core tasks that affect most parts of the project and that have deep design implications. My guess is that they are also the kind of tasks that cause contextus resumis: small tasks can (and should) generally be completed in one coding session.

Sure, there are software solutions like Mylyn that can make your IDE look like the way it was when you started to work on your task, but I always found that this kind of solution did not work well for system-wide tasks. What would Mylyn do? Open up all source files of Py4J? Anyway, Mylyn is not an option right now because it cannot connect to SourceForge’s trac installations, a problem that has been known for 9 months now.

One obvious solution is to divide your big task into smaller tasks (I know, you wanted to shout this from the beginning). But I’m currently changing the network and threading models of Py4J and these two models cannot be separated from each other. They also impact both the Java and the Python sides and these changes are part of a bigger redesign effort to enable Java code to callback Python code (more on this in the next post).

Do you have any tips or tricks to share?

Interacting with Eclipse through Py4J

One of the reasons I created Py4J is that I want to reuse the two Java projects I developed in recent years: Semdiff and Partial Program Analysis (PPA). These two technologies are built on top of Eclipse and I wanted a way to access Eclipse from a Python interpreter. Jython was not an option because I use a library, LXML, that is not compatible with Jython.

I created an Eclipse plug-in/feature/update site that embeds Py4J and that enables developers to access Eclipse. The update site will be release with Py4J 0.3, but early adopters can checkout the relevant projects from the subversion repository (look for projects starting with net.sf.py4j).

Once you include the net.sf.py4j plug-in in your dependencies, you can just create a GatewayServer instance like in the example on the front page.

Then, in python, you can interact with Eclipse:

>>> gateway = JavaGateway()
>>> ResourcePlugin = gateway.jvm.org.eclipse.core.resources.ResourcesPlugin
>>> workspaceRoot = ResourcePlugin.getWorkspace().getRoot()
>>> project1 = workspaceRoot.getProject('Project1')
>>> project1.isOpen()
True
>>> gateway.help(ResourcePlugin)
Help on class ResourcesPlugin in package org.eclipse.core.resources:

ResourcesPlugin extends org.eclipse.core.runtime.Plugin {
|  
|  Methods defined here:
|  
|  start(BundleContext) : void
...

Note

You do not need to add any other plug-in to your plug-in dependencies (e.g., org.eclipse.core.resources): Py4J can access any class defined in any plug-in loaded in Eclipse.

Instead of using the Py4J plug-in, you could just add the Py4J jar file to your Eclipse plug-in. If you use the jar file, you need to add the following property to your plug-in manifest file to make sure that Py4J can access the class declared in other plug-ins:

Eclipse-BuddyPolicy: global

Indeed, in Eclipse, every plug-in has its own class loader so Py4J cannot load the classes of other plug-ins by default. Adding this property enables Py4J to load plug-in classes and you can even access plug-ins that are not in your plug-in’s dependencies.

Experimenting with protobuf

I just spent a couple of hours experimenting with Protobuf to see if it could replace the current text-based protocol used by Py4J. Protobuf is a library from Google that makes it easy to serialize a structure composed of native fields (e.g., boolean, integer, double, string in UTF-8) into a binary stream. The structure can then be serialized/deserialized by programs written in Java, C++, Python, and .NET. I’ve been considering moving to Protobuf since the first version of Py4J but I wanted to invest my effort in user-visible features first.

After looking at the documentation of Protobuf and trying it, I found that the Java API is well developed, but that the Python API still lags behind (e.g., there is no built-in way in Python to send and receive a message over a stream with the size of the message first… This is a required feature if messages are exchanged over sockets). Although this is not a show stopper, I also did a small performance test where I serialized and deserialized 1000*3 messages using Protobuf and my custom text protocol using a Java Client and a Java Server over a local socket. I repeated this little experiment 10 times.

To my surprise, there was no significant time difference between the two: the text protocol was 500 ms faster, but over 44 seconds, this does not matter much to me right now. It should be noted that I tried to serialize worst-case messages for the text protocol. For example, a large integer is represented as multiple characters in the text protocol (e.g., 2’000’000 would take 7 bytes) whereas it takes no more than 4 bytes with Protobuf.

Before doing this performance test, I also tried to serialize typical messages that are sent with Py4J in my unit tests and they were always smaller in size when serialized with my text protocol.

There is no doubt in my mind that in the long term, Protobuf might outperform my text protocol. But as long as the Python API does not improve, I don’t have the motivation to spend long hours converting my protocol for no obvious performance gain instead of developing more useful features (e.g., callbacks).

Welcome to Py4J’s development blog

Welcome to Py4J’s development blog. We will post news and stories about Py4J here. Because Py4J is written in Java and Python, expect a fair amount of ramblings about the difference between these two languages 🙂

If you want to comment on Py4J or discuss the direction the project is taking, do not hesitate to post a comment on this blog, write an email to the mailing list or fill a feature request.