Experimenting with protobuf

I just spent a couple of hours experimenting with Protobuf to see if it could replace the current text-based protocol used by Py4J. Protobuf is a library from Google that makes it easy to serialize a structure composed of native fields (e.g., boolean, integer, double, string in UTF-8) into a binary stream. The structure can then be serialized/deserialized by programs written in Java, C++, Python, and .NET. I’ve been considering moving to Protobuf since the first version of Py4J but I wanted to invest my effort in user-visible features first.

After looking at the documentation of Protobuf and trying it, I found that the Java API is well developed, but that the Python API still lags behind (e.g., there is no built-in way in Python to send and receive a message over a stream with the size of the message first… This is a required feature if messages are exchanged over sockets). Although this is not a show stopper, I also did a small performance test where I serialized and deserialized 1000*3 messages using Protobuf and my custom text protocol using a Java Client and a Java Server over a local socket. I repeated this little experiment 10 times.

To my surprise, there was no significant time difference between the two: the text protocol was 500 ms faster, but over 44 seconds, this does not matter much to me right now. It should be noted that I tried to serialize worst-case messages for the text protocol. For example, a large integer is represented as multiple characters in the text protocol (e.g., 2’000’000 would take 7 bytes) whereas it takes no more than 4 bytes with Protobuf.

Before doing this performance test, I also tried to serialize typical messages that are sent with Py4J in my unit tests and they were always smaller in size when serialized with my text protocol.

There is no doubt in my mind that in the long term, Protobuf might outperform my text protocol. But as long as the Python API does not improve, I don’t have the motivation to spend long hours converting my protocol for no obvious performance gain instead of developing more useful features (e.g., callbacks).

Welcome to Py4J’s development blog

Welcome to Py4J’s development blog. We will post news and stories about Py4J here. Because Py4J is written in Java and Python, expect a fair amount of ramblings about the difference between these two languages 🙂

If you want to comment on Py4J or discuss the direction the project is taking, do not hesitate to post a comment on this blog, write an email to the mailing list or fill a feature request.