Py4J, a bidirectional bridge between Python and Java, has come a long way since the first release in December 2011 and yet, almost 7 years later, it still hasn’t reached the mythical 1.0 release. Let’s make sure that by December 2016, we reach this important milestone!
I released Py4J 0.10.0 in April 2016 and this marked an important milestone from a project maintenance perspective: Py4J has now coding conventions and a reliable build process with automated code quality checks for both Python and Java. The Java API is also relying more on interfaces and the architecture of what Py4J 1.0 should look like is finally stabilizing. Here are the two final steps I am planning to go through to get to 1.0:
Py4J 0.11 – Planned release date: August 2016
In addition to small features and bug fixes, there are two main features I want to add in this release:
Efficient transfer of binary data
A contributor already added a feature that allows the Python side to read raw bytes sent by the Java side through the same socket used to exchange commands between Python and Java. Although the API needs to be improved, I want to build on this feature to improve the API and provide a similar feature to the reverse direction (reading raw bytes sent by Python to Java). My hope is that this feature can then be extended by other contributors to implement things such as transferring large numpy arrays between Python and Java.
Stress testing – Memory leaks, thread leaks, connectivity testing
A few bugs reported in the past releases (0.9.2, 0.10.0, and 0.10.1) made me realize (1) how diverse the use cases of Py4J were and (2) how easy it was to make programming mistakes that would create memory leaks or negatively impact connectivity and performance. For example, one user reported he was quickly creating and shutting down ClientServer instances and another reported that calling a method with a 10 MB parameter was a standard use case for him. I already started creating a small benchmark to track the progression or regression of performance between releases, but I need to push this idea further by creating test suites that measure whether a release introduced a leak. I currently run these tests manually by modifying the code and I need a way to automate these testing, outside of the regular test suite.
I do not want to focus on performance too much for now, but having a robust benchmark and stress testing suite will be very helpful for a 1.0 release and will allow me or other contributors to work on performance once the API and architecture is stable.
Py4J 1.0 – Planned released date: December 2016
After 0.11.0 is released, I hope that users will try out the new binary/stream feature and I expect I’ll have to work on some of its kinks. There are three main areas I want to work on for 1.0:
Moving to the org.py4j namespace
Currently, the Java package is “py4j” and the Maven artifact is “net.sf.py4j” because artifacts are usually fully qualified and Py4J started its life on SourceForge. Between 0.8.2.1 and 0.9, I moved the Py4J project from SourceForge to py4j.org, so when I released Py4J as two OSGi bundles in 0.10.1, I selected a name that was in line with the new project namespace: org.py4j.java and org.py4j.python.
1.0 is thus a good time to change the main Java package to org.py4j and move from net.sf.py4j to org.py4j.java on Maven. This should help standardize the Py4J name and it is more in line with usual Java coding conventions. This will unfortunately break the code of anyone using Py4J, but I expect the change to be straightforward for most users (add org. in your import statements or just use the Optimize/Organize Import feature in your IDE).
Deprecating some of the current API
One comment I often hear from users is that it’s difficult to wrap their head around the Py4J process and memory model. When users want to contribute, they also struggle with the name of the main classes. What is a “Gateway”? Why is the Python side having a “callback server” when Java is initiating the calls and Python is making the callbacks to Java? I experimented with a few names and I believe that the terms “PythonClient” and “JavaServer” for the Java side and “PythonServer” and “JavaClient” for the Python side are easier to understand for users than “JavaGateway”, “GatewayServer”, “CallbackClient”, and “CallbackServer”. They are also highly representative of the Py4J model compared to Jython or JPype.
With 1.0, I want the API to be more approachable and I want the classes that everyone uses to have more meaningful names. BUT I do not want to put too much pressure on existing users and I’ll do my best to keep the existing classes and just deprecate them with pointers to the new ones.
If you have ideas or opinions on how to name the various classes, do not hesitate to hop on the mailing list. I’ll also be posting the names I believe are the best and gather feedback before making a final decision.
A new web site and new documentation
I believe that Py4J has relatively good documentation that strikes a balance between reference documentation (API doc, Javadoc), and a manual with how tos and examples. But it utterly fails in guiding new users, especially if they are new to Java or Python, in creating their first program with Py4J.
I want to focus on the “getting started” experience more so that new users can do cool things very quickly in a few lines of code. Most of the existing documentation will be reused and reorganized, but having clear walkthroughs for beginners and the smallest possible working code example on the front page will help grow the user base AND hopefully decrease the number of questions I get related to how to run javac!
And let’s face it, responsive web sites were just beginning in 2009, but they are now the norm and py4j.org is unusable on a mobile device while RTD has already solved this problem 🙂
What does 1.0 mean?
I want releases after 1.0 to maintain backward compatibility for the main classes so users do not have to adapt their code. Py4J has been mostly backward compatible throughout its history, but the latest releases broke interfaces, a change needed to make the codebase more extensible.
After 1.0, there are a few areas I want to work on, but it will depend on the needs and interests of the community:
This is a very large topic because the use cases vary a lot and optimizing one use case might penalize another one. I have a few ideas though: using small caches to reduce the number of times we use the Java reflection API, exploring the use of a binary protocol (protobuf performance has improved a lot in Python since I last tried it), taking a few public use cases of Py4J and profiling them to find what is slow, relying on efficient byte transfer to pass large integers, floats, etc.
Better support for JVM languages and Java 8
I sometimes get reports of people writing programs in Groovy or Scala and having a hard time using Py4J. I also get questions about “new” features of Java that do not have a clear mapping with Py4J. Having a few examples of programs in Groovy, Scala and programs using new Java features will go a long way toward increasing the feature set of Py4J.
The role of funding on Py4J
Let’s close this roadmap by mentioning the role of funding on Py4J. I work on Py4J in my spare time and with a young kid and a relatively new company, I don’t have much spare time! When a company like kichwacoders funds Py4J to see a feature implemented, it allows me to spend continuous hours at my work on Py4J and I can tackle difficult problems that cannot be divided in 30-minute/1-hour buckets. It benefits all users because I often have to think about the overall architecture and then, when I resume my work on Py4J in my spare time, I believe my contributions look more focused and directed toward a structured goal instead of looking like a bunch of unrelated patches and quick fixes.
I am not asking for donations: keep those for people who really need it. But if your company or institution uses Py4J and you want a new feature or you want to make sure a feature on the roadmap is implemented, consider contracting Resulto, the company I work for. We are an incorporated business and produces an invoice so it’s easy for a company to expense.
Comments? Questions on the roadmap?
If you have any comments to share about this roadmap or if you believe important features should be part of 1.0, do not hesitate to share your thoughts on the mailing list or privately at barthelemy at infobart dot com.
Thanks for your interest in Py4J: the community has grown a lot and I am trying my best to be a good project steward.