Saturday, August 16, 2008

I am switching from Mercurial to git

After a long night of debugging and while preparing, reviewing and pushing final patches before a SymPy release, I got this:

$ hg qpush
applying 1645.diff
Unable to read 1645.diff
** unknown exception encountered, details follow
** report bug details to http://www.selenic.com/mercurial/bts
** or mercurial@selenic.com
** Mercurial Distributed SCM (version 1.0.1)
Traceback (most recent call last):
File "/usr/bin/hg", line 20, in
mercurial.dispatch.run()
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 20, in run
sys.exit(dispatch(sys.argv[1:]))
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 29, in dispatch
return _runcatch(u, args)
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 45, in _runcatch
return _dispatch(ui, args)
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 364, in _dispatch
ret = _runcommand(ui, options, cmd, d)
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 417, in _runcommand
return checkargs()
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 373, in checkargs
return cmdfunc()
File "/var/lib/python-support/python2.5/mercurial/dispatch.py", line 356, in
d = lambda: func(ui, repo, *args, **cmdoptions)
File "/var/lib/python-support/python2.5/hgext/mq.py", line 1942, in push
mergeq=mergeq)
File "/var/lib/python-support/python2.5/hgext/mq.py", line 833, in push
top = self.applied[-1].name
IndexError: list index out of range

I am using:

$ hg --version
Mercurial Distributed SCM (version 1.0.1)

Copyright (C) 2005-2008 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


And I just got pissed off and I am switching to git for good. I've been using mercurial every day since about November 2007, and I think I am quite proficient in it. But we were missing some features, like recording a patch by hunks (git add -p), so Kirill Smelkov (another SymPy developer) just implemented this to mercurial queues, it is now included in hg 1.0. But I am constantly beaten by stupid bugs, mercurial broke into pieces several times already (last time with hg bisect, before it with mercurial queues and file renaming), that just should not happen in a production version. Another problem is that hgweb is constantly using 100% of CPU on my server whenever someone clicks to see the contents of the README file. Again I can spend time debugging it, but I will just use git and if it works, I'll stay with it.

I started learning git recently and it is just better in every aspect. Bigger community, it has all the features that I was always missing in mercurial (like rebase, hunks recording, diff diff --color-words, better gitk). And it's superfast.

Unless I find some showstopper bugs in git too, I don't think I am coming back.

We'll have to create a live mercurial mirror of SymPy, so that people can continue using mercurial, but I'll myself just be using git. We wrote a simple translation table for anyone using mercurial to get started:

http://wiki.sympy.org/wiki/Git_hg_rosetta_stone

I love Python and I like that Mercurial is just a small C core and the rest is in Python, but for some reason, it still has the baby bugs, that should have been fixed years ago and why should we spend our time fixing and improving Mercurial, if other people have already done the job for us (and usually a better job than I could do) in git?

Another reason for switching is that in Debian, mercurial is not used for packages, while git is used a lot.

Thursday, August 14, 2008

SymPy core in Cython, general CAS design thoughts

We managed to write a new core of SymPy that is an order of magnitude faster than the current sympy, but contrary to sympycore (which is Pearu's attempt to write a fast core), it uses the same architecture as sympy, so we can now merge it back and speedup everything.

It started like this: I was unsatisfied with the current sympy speed, and also with the fact, that it was only Kirill and sometimes Mateusz who were submitting speedup patches to sympy, but unfortunately noone was doing any effort to get sympycore merged. I (and I think many people too) hoped that when sympycore got started, the aim was to get it merged back. That was not happening. So I said to myself, ok, noone does the job for you, you need to get your hands dirty and gets things moving yourself.

I came with my thoughts 3 years back, when it was a rainy weekend in Portland, Oregon, I was sitting in my office at the university, just finished playing with swiginac that we wrote with Ola Skavhaug, and said to myself then -- no, there must be a better way. First all this bloated swig thing, but this could be fixed by hand (today we just use Cython). But especially with how things were done in GiNaC internally. Too many classes, too difficult to write new functions. So I wrote my own cas in pure Python just to see if things can be done better. Then a year later at the Space Telescope Science Institute, I was again wondering, ok, a good CAS in Python is really needed.

So I wrote to ginac developers saying my concerns and exchanging ideas how to do it better. I read this email from July 2006 now again and it seems I was right. Citing
"I think there are people, who would implement for example the factorization (you can copy it from eigenmath for example) and other stuff like limits and integration, if the code weren't so
complex (or am I wrong?)."
I was right, except limits (that I wrote myself), all of the above was implemented by someone else in SymPy (Mateusz, but others too, did a stellar job here). It's actually pretty amusing for me to read my other emails from the ginac list, for example this. I basically said the design of SymPy in that email. As one of replies, I was told to consider education in C++. As I also found later, ginac developers get very easily attacking me over nothing. But if one ignores it, I had several interesting discussions with them about a CAS design (we also discussed something later, but I cannot find it now). But anyway, flaming people is something I would never tolerate on a sympy list, because this is exactly what drives people away. And I am not interested in any flame. I just want to get things done. Anyway, I did exactly what I said in those emails on the ginac list with sympy and it seems it worked really well. One remainging piece of the puzzle is speed though. As I watched one Linus talk today, he said "well, maybe 30s is not much for you, but if people are used to do the same thing in 0.1s, trust me, it is way too slow." He's of course 100% right. So we need to speed sympy up drastically if we want to be competitive.

So those were my thoughts recently. So I wrote a SymPy core in pure C over the last weekend, see my emails to the sympy list. I achieved some pretty competitive benchmarks, but there was one "minor" issue -- memory management. I wasn't freeing any memory as I thought I could fix that later easily. And it turned out to be a night mare. I hoped I could use boehm-gc, but it doesn't work with glib (it gives nasty segfaults), that I was using for dictionaries and string management. So I hacked up reference counting just like in Python, but I didn't manage to make it work without segfaults. Segfaults can be debugged, but it's a lot of effort, especially if it fails deeply in some libc function, far away from the code that actually breaks it. Valgrind helps here too, but I want to spend my time on other things than debugging segfaults. Then Robert Bradshaw told me: why not to use Cython? BTW, Kirill was telling me this from the beginning. But I am stubborn, if I believe in something, I want to try it. So I tried and failed. Anyway, I must say I really enjoyed coding in C, it's a very good language.

So then I threw the code away and started from scratch. I think this was the 4th time I wrote the "x+y+x -> 2*x + y" simplification algorithm. For the first time 2 years ago it took me maybe a week to get it right. In the C (3rd time) it took me just maybe an hour. And now (4th time) in Python it was just couple minutes. So it makes me feel good if I can see that I am improving after all. Anyway, so I wrote a core in pure Python, but using all my 2 years experience and a know-how after discussing with Kirill, Mateusz, Fredrik, Pearu and many others how to do things. So our new core, we call it sympyx, is less than 600 lines long, it uses the same architecture as sympy and it is amazingly fast.

Then we reserved couple evenings with Kirill and cythonized it (neither of us has time during the day, I need to work on my thesis and Kirr works). We logged into one machine, setup screen so that we could see each other typing and also type to each other's terminal and started cythonizing. Kirr is very good at it, he submitted couple patches to Cython and he's just much better developer than I am. So he used his approach of rewriting everything directly to Cython, while I used my iterative approach of always trying to satisfy tests. So when he completely broke the core in the middle and I saw on his terminal something like:

In [1]: x+x
[huge stacktrace with Cython errors]

and then I saw the cursor stopped moving, I write to his terminal "haha" and was laughing. Probably a minute later I screwed something up in my terminal and I saw that my cursor wrote haha.
Actually very good moment was that I managed to get all tests run first. So I got hold of his terminal and wrote in his vim session "look at my terminal". And there he saw:

$ py.test
============================= test process starts ==============================
executable: /usr/bin/python (2.5.2-final-0)
using py lib: /usr/lib/python2.5/site-packages/py

test_basic.py[13] .............
test_basic2.py[4] ....

================== tests finished: 17 passed in 0.04 seconds ===================

You know, all tests run, with the core in Cython! And then I saw how my cursor started to move and it was typing "git diff" and "git cherry -v origin" and "git log" and other things, as Kirr was checking that what he sees is correct. But I haven't cdefed the classes, just make it work (there were problems with the __new__ method not yet supported in Cython and other minor things). We both came home from our work and continued, so I continued bringing more and more classes and methods to Cython, but then I got this message from Kirr on my jabber: "all tests pass. Speedup is 3x". So I said to myself -- yes! So he beated me and he did it. I pulled his changes and indeed, everything was cythonized and it was superfast. So his approach provided faster results after all.

We polished things, did couple benchmarks and announced it on sympy and sage-devel lists.

GiNaC is still faster, but not 100x faster, depending on the benchmark usually 2x to 10x faster (but sometimes also slower). There is still a lot of room for both design and technical optimizations, so I think we'll improve for sure. I very strongly believe it is possible to be simple in design, maintainable and fast.