Sunday, August 4, 2013

How to support both Python 2 and 3

I'll start with the conclusion: making backwards incompatible version of a language is a terrible idea, and it was bad a mistake. This mistake was somewhat corrected over the years by eventually adding features to both Python 2.7 and 3.3 that actually allow to run a single code base on both Python versions --- which, as I show below, was discouraged by both Guido and the official Python documents (though the latest docs mention it)... Nevertheless, a single code base fixes pretty much all the problems and it actually is fun to use Python again. The rest of this post explains my conclusion in great detail. My hope is that it will be useful to other Python projects to provide tips and examples how to support both Python 2 and 3, as well as to future language designers to keep languages backwards compatible.

When Python 3.x got released, it was pretty much a new language, backwards incompatible with Python 2.x, as it was not possible to run the same source code in both versions. I was extremely unhappy about this situation, because I simply didn't have time to port all my Python code to a new language.

I read the official documentation about how the transition should be done, quoting:

You should have excellent unit tests with close to full coverage.

  1. Port your project to Python 2.6.
  2. Turn on the Py3k warnings mode.
  3. Test and edit until no warnings remain.
  4. Use the 2to3 tool to convert this source code to 3.0 syntax. Do not manually edit the output!
  5. Test the converted source code under 3.0.
  6. If problems are found, make corrections to the 2.6 version of the source code and go back to step 3.
  7. When it's time to release, release separate 2.6 and 3.0 tarballs (or whatever archive form you use for releases).

I've also read Guido's blog post, which repeats the above list and adds an encouraging comment:

Python 3.0 will break backwards compatibility. Totally. We're not even aiming for a specific common subset.

In other words, one has to maintain a Python 2.x code base, then run 2to3 tool to get it converted. If you want to develop using Python 3.x, you can't, because all code must be developed using 2.x. As to the actual porting, Guido says in the above post:

If the conversion tool and the forward compatibility features in Python 2.6 work out as expected, steps (2) through (6) should not take much more effort than the typical transition from Python 2.x to 2.(x+1).

So sometime in 2010 or 2011 I started porting SymPy, which is now a pretty large code base (sloccount says over 230,000 lines of code, and in January 2010 it said almost 170,000 lines). I remember spending a few full days on it, and I just gave up, because it wasn't just changing a few things, but pretty fundamental things inside the code base, and one cannot just do it half-way, one has to get all the way through and then polish it up. We ended up using one full Google Summer of Code project for it, you can read the final report. I should mention that we use metaclasses and other things, that make such porting harder. Conclusion: this was definitely not "the typical transition from Python 2.x to 2.(x+1)".

Ok, after months of hard work by a lot of people, we finally have a Python 2.x code base that can be translated using the 2to3 tool and it works and tests pass in Python 3.x.

The next problem is that Python 3.x is pretty much like a ghetto -- you can use it as a user, but you can't develop in it. The 2to3 translation takes over 5 minutes on my laptop, so any interactivity is gone. It is true that the tool can cache results, so the next pass is somewhat faster, but in practice this still turns out to be much much worse than any compilation of C or Fortran programs (done for example with cmake), both in terms of time and in terms of robustness. And I am not even talking about pip issues or issues regarding calling 2to3. What a big mess... Programming should be fun, but this is not fun.

I'll be honest, this situation killed a lot of my enthusiasm for Python as a platform. I learned modern Fortran in the meantime and with admiration I noticed that it still compiles old F77 programs without modification and I even managed to compile a 40 year old pre-F77 code with just minimal modifications (I had to port the code to F77). Yet modern Fortran is pretty much a completely different language, with all the fancy features that one would want. Together with my colleagues I created a website, where you can compare Python/NumPy side by side with modern Fortran, it's pretty much 1:1 translation and a similar syntax (for numerical code), except that you need to add types of course. Yet Fortran is fully backwards compatible. What a pleasure to work with!

Fast forward to last week. A heroic effort by Sean Vig who ported SymPy to single code base (#2318) was merged. Earlier this year similar pull requests by other people have converted NumPy (#3178, #3191, #3201, #3202, #3203, #3205, #3208, #3216, #3223, #3226, #3227, #3231, #3232, #3235, #3236, #3237, #3238, #3241, #3242, #3244, #3245, #3248, #3249, #3257, #3266, #3281, #3191, ...) and SciPy (#397) codes as well. Now all these projects have just one code base and it works in all Python versions (2.x and 3.x) without the need to call the 2to3 tool.

Having a single code base, programming in Python is fun again. You can choose any Python version, be it 2.x or 3.x, and simply submit a patch. The patch is then tested using Travis-CI, so that it works in all Python versions. Installation has been simplified (no need to call any 2to3 tools and no more hacks to get working).

In other words, this is how it should be, that you write your code once, and you can use any supported language version to run it/compile it, or develop in. But for some reason, this obvious solution has been discouraged by Guido and other Python documents, as seen above. I just looked up the latest official Python docs, and that one is not upfront negative about a single code base. But it still does not recommend this approach as the one. So let me fix that: I do recommend a single code base as the solution.

The newest Python documentation from the last paragraph also mentions

Regardless of which approach you choose, porting is not as hard or time-consuming as you might initially think.

Well, I encourage you to browse through the pull requests that I linked to above for SymPy, NumPy or SciPy. I think it is very time consuming, and that's just converting from 2to3 to single code base, which is the easy part. The hard part was to actually get SymPy to work with Python 3 (as I discussed above, that took couple months of hard work), and I am pretty sure it was pretty hard to port NumPy and SciPy as well.

The docs also says:

It /single code base/ does lead to code that is not entirely idiomatic Python

That is true, but our experience has been, that with every Python version that we drop, we also delete lots of ugly hacks from our code base. This has been true for dropping support for 2.3, 2.4 and 2.5, and I expect it will also be true for dropping 2.6 and especially 2.7, when we can simply use the Python 3.x syntax. So not a big deal overall.

To sum this blog post up, as far as I am concerned, pretty much all the problems with supporting Python 2.x and 3.x are fixed by having a single code base. You can read the pull requests above to see how to implemented things (like metaclasses, and other fancy stuff...). Python is still quite the same language, you write your code, you use a Python version of your choice and things will just work. Not a big deal overall. The official documentation should be fixed to recommend this approach, and deprecate the other approaches.

I think that Python is great and I hope it will be used more in the future.

Written with StackEdit.

No comments: