Ondřej Čertík

Sunday, August 4, 2013

How to support both Python 2 and 3

I'll start with the conclusion: making backwards incompatible version of a language is a terrible idea, and it was bad a mistake. This mistake was somewhat corrected over the years by eventually adding features to both Python 2.7 and 3.3 that actually allow to run a single code base on both Python versions --- which, as I show below, was discouraged by both Guido and the official Python documents (though the latest docs mention it)... Nevertheless, a single code base fixes pretty much all the problems and it actually is fun to use Python again. The rest of this post explains my conclusion in great detail. My hope is that it will be useful to other Python projects to provide tips and examples how to support both Python 2 and 3, as well as to future language designers to keep languages backwards compatible.

When Python 3.x got released, it was pretty much a new language, backwards incompatible with Python 2.x, as it was not possible to run the same source code in both versions. I was extremely unhappy about this situation, because I simply didn't have time to port all my Python code to a new language.

I read the official documentation about how the transition should be done, quoting:

You should have excellent unit tests with close to full coverage.

Port your project to Python 2.6.

Turn on the Py3k warnings mode.

Test and edit until no warnings remain.

Use the 2to3 tool to convert this source code to 3.0 syntax. Do not manually edit the output!

Test the converted source code under 3.0.

If problems are found, make corrections to the 2.6 version of the source code and go back to step 3.

When it's time to release, release separate 2.6 and 3.0 tarballs (or whatever archive form you use for releases).

I've also read Guido's blog post, which repeats the above list and adds an encouraging comment:

Python 3.0 will break backwards compatibility. Totally. We're not even aiming for a specific common subset.

In other words, one has to maintain a Python 2.x code base, then run 2to3 tool to get it converted. If you want to develop using Python 3.x, you can't, because all code must be developed using 2.x. As to the actual porting, Guido says in the above post:

If the conversion tool and the forward compatibility features in Python 2.6 work out as expected, steps (2) through (6) should not take much more effort than the typical transition from Python 2.x to 2.(x+1).

So sometime in 2010 or 2011 I started porting SymPy, which is now a pretty large code base (sloccount says over 230,000 lines of code, and in January 2010 it said almost 170,000 lines). I remember spending a few full days on it, and I just gave up, because it wasn't just changing a few things, but pretty fundamental things inside the code base, and one cannot just do it half-way, one has to get all the way through and then polish it up. We ended up using one full Google Summer of Code project for it, you can read the final report. I should mention that we use metaclasses and other things, that make such porting harder. Conclusion: this was definitely not "the typical transition from Python 2.x to 2.(x+1)".

Ok, after months of hard work by a lot of people, we finally have a Python 2.x code base that can be translated using the 2to3 tool and it works and tests pass in Python 3.x.

The next problem is that Python 3.x is pretty much like a ghetto -- you can use it as a user, but you can't develop in it. The 2to3 translation takes over 5 minutes on my laptop, so any interactivity is gone. It is true that the tool can cache results, so the next pass is somewhat faster, but in practice this still turns out to be much much worse than any compilation of C or Fortran programs (done for example with cmake), both in terms of time and in terms of robustness. And I am not even talking about pip issues or setup.py issues regarding calling 2to3. What a big mess... Programming should be fun, but this is not fun.

I'll be honest, this situation killed a lot of my enthusiasm for Python as a platform. I learned modern Fortran in the meantime and with admiration I noticed that it still compiles old F77 programs without modification and I even managed to compile a 40 year old pre-F77 code with just minimal modifications (I had to port the code to F77). Yet modern Fortran is pretty much a completely different language, with all the fancy features that one would want. Together with my colleagues I created a fortran90.org website, where you can compare Python/NumPy side by side with modern Fortran, it's pretty much 1:1 translation and a similar syntax (for numerical code), except that you need to add types of course. Yet Fortran is fully backwards compatible. What a pleasure to work with!

Fast forward to last week. A heroic effort by Sean Vig who ported SymPy to single code base (#2318) was merged. Earlier this year similar pull requests by other people have converted NumPy (#3178, #3191, #3201, #3202, #3203, #3205, #3208, #3216, #3223, #3226, #3227, #3231, #3232, #3235, #3236, #3237, #3238, #3241, #3242, #3244, #3245, #3248, #3249, #3257, #3266, #3281, #3191, ...) and SciPy (#397) codes as well. Now all these projects have just one code base and it works in all Python versions (2.x and 3.x) without the need to call the 2to3 tool.

Having a single code base, programming in Python is fun again. You can choose any Python version, be it 2.x or 3.x, and simply submit a patch. The patch is then tested using Travis-CI, so that it works in all Python versions. Installation has been simplified (no need to call any 2to3 tools and no more hacks to get setup.py working).

In other words, this is how it should be, that you write your code once, and you can use any supported language version to run it/compile it, or develop in. But for some reason, this obvious solution has been discouraged by Guido and other Python documents, as seen above. I just looked up the latest official Python docs, and that one is not upfront negative about a single code base. But it still does not recommend this approach as the one. So let me fix that: I do recommend a single code base as the solution.

The newest Python documentation from the last paragraph also mentions

Regardless of which approach you choose, porting is not as hard or time-consuming as you might initially think.

Well, I encourage you to browse through the pull requests that I linked to above for SymPy, NumPy or SciPy. I think it is very time consuming, and that's just converting from 2to3 to single code base, which is the easy part. The hard part was to actually get SymPy to work with Python 3 (as I discussed above, that took couple months of hard work), and I am pretty sure it was pretty hard to port NumPy and SciPy as well.

The docs also says:

It /single code base/ does lead to code that is not entirely idiomatic Python

That is true, but our experience has been, that with every Python version that we drop, we also delete lots of ugly hacks from our code base. This has been true for dropping support for 2.3, 2.4 and 2.5, and I expect it will also be true for dropping 2.6 and especially 2.7, when we can simply use the Python 3.x syntax. So not a big deal overall.

To sum this blog post up, as far as I am concerned, pretty much all the problems with supporting Python 2.x and 3.x are fixed by having a single code base. You can read the pull requests above to see how to implemented things (like metaclasses, and other fancy stuff...). Python is still quite the same language, you write your code, you use a Python version of your choice and things will just work. Not a big deal overall. The official documentation should be fixed to recommend this approach, and deprecate the other approaches.

I think that Python is great and I hope it will be used more in the future.

Written with StackEdit.

Monday, July 1, 2013

My impressions from the SciPy 2013 conference

I have attended the SciPy 2013 conference in Austin, Texas. Here are my impressions.

Number one is the fact that the IPython notebook was used by pretty much everyone. I use it a lot myself, but I didn't realize how ubiquitous it has become. It is quickly becoming the standard now. The IPython notebook is using Markdown and in fact it is better than Rest. The way to remember the "[]()" syntax for links is that in regular text you put links into () parentheses, so you do the same in Markdown, and append [] for the text of the link. The other way to remember is that [] feel more serious and thus are used for the text of the link. I stressed several times to +Fernando Perez and +Brian Granger how awesome it would be to have interactive widgets in the notebook. Fortunately that was pretty much preaching to the choir, as that's one of the first things they plan to implement good foundations for and I just can't wait to use that.

It is now clear, that the IPython notebook is the way to store computations that I want to share with other people, or to use it as a "lab notebook" for myself, so that I can remember what exactly I did to obtain the results (for example how exactly I obtained some figures from raw data). In other words --- instead of having sets of scripts and manual bash commands that have to be executed in particular order to do what I want, just use IPython notebook and put everything in there.

Number two is that how big the conference has become since the last time I attended (couple years ago), yet it still has the friendly feeling. Unfortunately, I had to miss a lot of talks, due to scheduling conflicts (there were three parallel sessions), so I look forward to seeing them on video.

+Aaron Meurer and I have done the SymPy tutorial (see the link for videos and other tutorial materials). It's been nice to finally meet +Matthew Rocklin (very active SymPy contributor) in person. He also had an interesting presentation
about symbolic matrices + Lapack code generation. +Jason Moore presented PyDy.
It's been a great pleasure for us to invite +David Li (still a high school student) to attend the conference and give a presentation about his work on sympygamma.com and live.sympy.org.

It was nice to meet the Julia guys, +Jeff Bezanson and +Stefan Karpinski. I contributed the Fortran benchmarks on the Julia's website some time ago, but I had the feeling that a lot of them are quite artificial and not very meaningful. I think Jeff and Stefan confirmed my feeling. Julia seems to have quite interesting type system and multiple dispatch, that SymPy should learn from.

I met the VTK guys +Matthew McCormick and +Pat Marion. One of the keynotes was given by +Will Schroeder from Kitware about publishing. I remember him stressing to manage dependencies well as well as to use BSD like license (as opposed to viral licenses like GPL or LGPL). That opensource has pretty much won (i.e. it is now clear that that is the way to go).

I had great discussions with +Francesc Alted, +Andy Terrel, +Brett Murphy, +Jonathan Rocher, +Eric Jones, +Travis Oliphant, +Mark Wiebe, +Ilan Schnell, +Stéfan van der Walt, +David Cournapeau, +Anthony Scopatz, +Paul Ivanov, +Michael Droettboom, +Wes McKinney, +Jake Vanderplas, +Kurt Smith, +Aron Ahmadia, +Kyle Mandli, +Benjamin Root and others.

It's also been nice to have a chat with +Jason Vertrees and other guys from Schrödinger.

One other thing that I realized last week at the conference is that pretty much everyone agreed on the fact that NumPy should act as the default way to represent memory (no matter if the array was created in Fortran or other code) and allow manipulations on it. Faster libraries like Blaze or ODIN should then hook themselves up into NumPy using multiple dispatch. Also SymPy would then hook itself up so that it can be used with array operations natively. Currently SymPy does work with NumPy (see our tests for some examples what works), but the solution is a bit fragile (it is not possible to override NumPy behavior, but because NumPy supports general objects, we simply give it SymPy objects and things mostly work).

Similar to this, I would like to create multiple dispatch in SymPy core itself, so that other (faster) libraries for symbolic manipulation can hook themselves up, so that their own (faster) multiplication, expansion or series expansion would get called instead of the SymPy default one implemented in pure Python.

Other blog posts from the conference:

Aaron's post
Fernando's post

Sunday, September 2, 2012

How to make pdflatex accept .eps images

Unfortunately, pdflatex does not support .eps images, for example the following code fails:

\usepackage{graphicx}
...
\includegraphics{qc.eps}

The fix is to put the following lines at the top of the document:

\usepackage{epstopdf}
\epstopdfsetup{suffix=}
\DeclareGraphicsRule{.eps}{pdf}{.pdf}{`epstopdf #1}

and compile with the -shell-escape flag:

pdflatex -shell-escape my_file.tex

Then the above code works. You need the epstopdf program that is run behind the scene.

Update: Apparently, it's enough to just include the following line at the top of the document:

\usepackage{epstopdf}

One still has to compile with the -shell-escape flag. This works for me in Ubuntu 12.04.

Thanks to: http://chi3x10.wordpress.com/2009/06/18/eps-and-pdflatex-no-more-converting-eps-to-pdf/

Monday, June 4, 2012

How to convert scanned images to pdf

From time to time I need to convert scanned documents to a pdf format.

Usage scenario 1: I scan part of a book (i.e. some article) on a school's scanner that sends me 10 big separate color pdf files (one pdf per page). I want to get one nice, small (black and white) pdf file with all the pages.

Usage scenario 2: I download a web form, print it, fill it in, sign it, scan it on my own scanner using Gimp and now I want to convert the image into a nice pdf file (either color or black & white) to send back over email.

Solution: I save the original files (be it pdf or png) into a folder and use git to track it. Then create a simple reduce script to convert it to the final format (view it as a pipeline). Often I need to tweak one or two parameters in the pipeline.

Here is a script for scenario 1:

And here for scenario 2:

There can be several unexpected surprises along the way. From my experience:

If I convert png directly to tiff, sometimes the resolution can be wrong. The solution is to always convert to ppm (color) or pbm (black and white) first, which is just a simple file format containing the raw pixels. This is the "starting" format (so first I need to convert the initial pdf or png into ppm/pbm) and then do anything else. That proved to be very robust.
The tiff2pdf utility proved to be the most robust way to convert an image to a pdf. All other ways that I have tried failed in one way or another (resolution, positioning, paper format and other things were wrong....). It can create multiple pages pdf files, set paper format (US Letter, A4, ...) and so on.
The linux convert utility is a robust tool for cropping images, converting color to black and white (using a threshold for example) and other things. As long as the image is first converted to ppm/pbm. In principle it can also produce pdf files, but that didn't work well for me.
I sometimes use the unpaper program in the pipeline for some automatic polishing of the images.

In general, I am happy with my solution. So far I was always able to get what I needed using this "pipeline" method.

Sunday, January 29, 2012

Discussion about global warming

I have read the text The Truth About Greenhouse Gases by William Happer, a physicist at Princeton. I liked it, so I posted it to my Google+. I was surprised by so many emotional responses. The post also made Michael Tobis to write a blog post with his opinion.

I was not satisfied about the overall tone of the discussion. I am really just interested in factual arguments. As such, I browsed through all the arguments against William's paper from the above discussion, and chose one well formulated question, that I think represents the most important objection: "In your paper you state at several places, that doubling the CO2 concentrations will only increase the temperature by 1 C. However, it is claimed (see for example Michael's post above) that the increase will be around 2.5 C. Which number is correct and where is it coming from?" I wrote to William and asked whether he would be willing to answer it. With his permission I am posting his answer here.

Answer: sensitivity.pdf
Images referenced in the answer: image001.png, image002.png.
Link referenced in the answer: http://www.thegwpf.org/best-of-blogs/4247-steve-mcintyre-closing-thoughts-on-best.html

Edit: I was told that Blogger makes it really hard to comment under the article. You can discuss it at my G+ post about this: https://plus.google.com/u/0/104039945248245758823/posts/PJeqx7GKtLg

Thursday, January 26, 2012

When double precision is not enough

I was doing some finite element (FE) calculation and I needed the sum of the lowest 7 eigenvalues of a symmetric matrix (that comes from the FE assembly) to converge to at least 1e-8 accuracy (so that I can check calculation done by some other solver of mine, that calculates the same but doesn't use FE). In reality I wanted the rounded value to 8 decimal digits to be correct, so I really needed 1e-9 accuracy (but it's ok if it is let's say 2e-9, but not ok if it is 9e-9). With my FE solver, I couldn't get it to converge more than to roughly 5e-7 no matter how hard I tried. Now what?

When doing the convergence, I take a good mesh and keep increasing "p" (the polynomial order) until it converges. For my particular problem, it is fully converged for about p=25 (the solver supports the order up to 64). Increasing "p" further will not increase the accuracy anymore, and the accuracy stays at the level 5e-7 for the sum of the lowest 7 eigenvalues. For optimal meshes, it converges at p=25, for not optimal meshes, it converges for higher "p", but in all cases, it doesn't get below 5e-7.

I know from experience, that for simpler problems, the FE solver can easily converge to 1e-10 or more using double precision. So I know it is doable, now the question is what the problem is: there
are a few possible reasons:

The FE quadrature is not accurate enough
The condition number of the matrix is high, thus LAPACK doesn't return very accurate eigenvalues
Bug in the assembly/solver (like single/double corruption in Fortran, or some other subtle bug)

When using the same solver for simpler potential, it converged nicely to 1e-10. So this suggests there is no bug in the assembly or solver itself. It is possible that the quadrature is not accurate enough, but again, if it converges for simple problem, it's probably not it. So it seems it is the ill conditioned matrix, that causes this. So I printed the residuals (that I simply calculated in Fortran using the matrix and the eigenvectors returned by LAPACK), and it only showed 1e-9. For simpler problems, it can go to 1e-14 easily. So that must be it. How do we fix it?

Obviously by making the matrix less ill conditioned, which is caused by the mesh for the problem (the ratio of the longest/shortest elements is 1e9) but for my problem I really needed such a mesh. So the other option is to increase the real number accuracy.

In Fortran all real variables are defined as real(dp), where dp is an integer defined at a single place in the project. There are several ways to define it, but it's value is 8 for gfortran and it means double precision. So I increased it to 16 (quadruple precision), recompiled. Now the whole program calculates in quadruple precision (more than 30 significant digits). I had to recompile LAPACK using the "-fdefault-real-8" gfortran option, that promotes all double precision numbers to quadruple precision, and I used the "d" versions (double precision, now promoted to quadruple) of LAPACK routines.

I rerun the calculation ---- and suddenly LAPACK residuals are around 1e-13, and the solver converges to 1e-10 easily (for the sum of the lowest 7 eigenvalues). Problem solved.

Turning my Fortran program to quadruple precision is as easy as changing one variable and recompiling. Turning LAPACK to quadruple precision is easy with a single gfortran flag (LAPACK uses the old f77 syntax for double precision, if it used real(dp), then I would simply change it as for my program). The whole calculation got at least 10x slower with quadruple. The reason is that gfortran runtime uses the libquadmath library, that simulates quadruple precision (as current CPUs only support double precision natively).

I actually discovered a few bugs in my program (typically some constants in older code didn't use the "dp" syntax, but had the double precision hardwired). Fortran warns about all such cases, when the real variables have incompatible precision.

It is amazing how easy it is to work with different precision in Fortran (literally just one change and recompile). How could this be done with C++? This wikipedia page suggests, that "long double" is only 80bit in most cases (quadruple is 128bit), but gcc offers __float128, so it seems I would have to manually change all "double" to "__float128" in the whole C++ program (this could be done with a single "sed" command).

Thursday, November 18, 2010

Google Code vs GitHub for hosting opensource projects

Cython is now considering options where to move the main (mercurial) repository, and Robert Bradshaw (one of the main Cython developers) has asked me about my experience with regards to Google Code and GitHub, since we use both with SymPy.

Google Code is older, and it was the first service that provided free (virtually unlimited) number of projects that you could easily and immediately setup. At that time (4 years ago?) that was something unheard of. However, the GitHub guys in the meantime not only made this available too, but also implemented features, that (as far as I know) no one offers at all, in particular hosting your own pages at your own domain (but at GitHub's servers, some examples are sympy.org and docs.sympy.org), commenting on git branches and pull requests before the code gets merged in (I am 100% convinced that this is the right approach, as opposed to comment on the code after it gets in), allow to easily fork the repository and it has simply more social features, that the Google Code doesn't have.

I believe that managing an opensource project is mainly a social activity, and GitHub's social features really make so many things easier. From this point of view, GitHub is clearly the best choice today.

I think there is only one (but potentially big) problem with GitHub, that its issue tracker is very bad, compared to the Google Code one. For that reason (and also because we already use it), we keep our issues at Google Code with SymPy.

The above are the main things to consider. Now there are some little things to keep in mind, that I will briefly touch below: Google Code doesn't support git and blocks access from Cuba and other countries, when you want to change the front page, you need to be an admin, while at GitHub I simply add push access to all sympy developers, so anyone just pushes a patch to this repository: https://github.com/sympy/sympy.github.com, and it automatically appears on our front page (sympy.org), with Google Code we had to write long pages (in our docs) about how to send patches, with GitHub we just say, send us a pull request, and point to: http://help.github.com/pull-requests/. In other words, GitHub takes care of teaching people how to use git and figure out how to send patches, and we can concentrate on reviewing the patches and pushing them in.

Wikipages at github are maintained in git, and they provide the webfrontend to it as opensource, so there is no vendor lock-in. Anyone with github account can modify our wiki pages, while the Google Code pages can only be modified by people that I add to the Google Code project, which forced us to install mediawiki on my linode server (hosted at linode.com, which by the way is an excellent VPS hosting service, that I have been using for couple of years already and I can fully recommend it), and I had to manage it all the time, and now we are moving our pages to the github wiki, so that I have one less thing to worry about.

So as you can see, I, as admin, have less things to worry about, as github manages everything for me now, while with Google Code, I had to manage lots of things on my linodes.

One other thing to consider is that GitHub is only for git, but they also provide svn and hg access (both push and pull, they translate the repository automatically between git and svn/hg), I never really used it much, so I don't know how stable this is. As I wrote before, I think that git is the best tool now for maintaining a project, and I think that github is now the best choice to host it (except the issue tracker, where Google Code is better).