farmdev

The Python Packaging Problem

At PyCon 2009 the fact that Python needs to solve the "packaging problem" came up a few times. This is not a new discussion. However, the problem is still not completely solved so here I'll point out the details of the problem, the unsolved parts, the solved parts, and how the solved parts could be solved better.

#1 Gimme A Module

You want to install a module that someone else built into your Python installation. Easy, you download the module from the Python Package Index (PyPI), untar it, and run

$ sudo python setup.py install

Or if you don't mind having setuptools installed you can do all that in one command. For a concise example, imagine you want to install docutils

$ sudo easy_install docutils

If you have a module but you haven't made it available on PyPI yet, simply create a standard setup.py script and run

$ python setup.py sdist register upload -s

Let's go back to easy_install for a minute. All this script does is lookup the package by name on PyPI. In its most straightforward form, it downloads the source package, patches distutils so it can run python setup.py egg_info within the downloaded source (for Python < 2.5), then runs python setup.py install like you would manually, shown above. If there is an egg, then it will download the egg (more on egg_info and eggs later).

Why don't people like easy_install? From the first couple hits on Google, it's because no one understands easy_install. But there are two things that I don't like:

  • It adds an easy-install.pth file (this is standard Python, see .pth files) but this file has a hack that alters sys.path so that once you install a package globally with easy_install you cannot use it locally using PYTHONPATH or sys.path very easily.
  • It installs module directories within a version stamped directory. This only works because of the easy-install.pth file and thus ties you to the site directories which makes it hard to work with local packages (more on that below). Setuptools uses this convention so that you can have simultaneous versions if you want and namespace packages. There are better solutions for this use case now.

#2 Gimme A Module For Just This One Project

Installing everything into your global system makes it hard to work on multiple projects on one machine. This is mostly a development problem but it's also a deployment problem because it's generally overkill to build a new machine (or new virtual machine) for each Python project you want to deploy.

The vanilla Python solution to this is PYTHONPATH or sys.path. Pretty straightforward.

However, due to the easy_install problems I pointed out above, the vanilla solution is not sufficient if you want to mix and match. Instead you need to use virtualenv which works well for both development and deployment. However, it's a little bit overkill. The vanilla approach is simple: just tell me where the other modules are. It shouldn't require you to symlink your entire Python installation into a new location, which is how virtualenv works.

#3 Gimme A Module Greater Than Version X But Less Than Version Y

But wait! It's all fine and well to download and install a module to put in your global system, but how do you upgrade it? And what version did that other-developer-who-no-longer-works-here install, anyway? Some modules define a version attribute __version__ in __init__.py (and django defines VERSION) but there is no standard and most modules don't define a version at all except for in their setup.py script. Between setuptools and PyPI this is solved.

In easy_install you can manage versions like this

$ easy_install docutils==0.4
$ easy_install docutils>=0.4
$ easy_install docutils>=0.4,<0.5

Those do what you'd expect. You can also upgrade a module like this

$ easy_install -U docutils

The way this all works is by storing meta data on disk in the egg-info format and by simply making HTTP requests to PyPI. So, to avoid the easy-install.pth problem there is now a new tool for this called pip and it works the same but instead of using version-stamped subdirectories it installs modules "flat" just like you would if you were to manually run python setup.py install. Pip also adds a module.egg-info file next to the flat module so that the currently installed version can be detected (for upgrading, requirements, etc). Pip even handles namespaced packages by preserving egg-info dirs and simply stitching together each one into a flat, python-compatible module. Pip does not support installing multiple versions of the same module in the same place but you can use virtualenv for that.

#3 b. Gimme A Module Greater Then Version X But I Don't Want an Alpha Release

The version request against the PyPI site gets tricky when people release "alpha" or "beta" versions. For example

$ easy_install SQLAlchemy>=0.5,<0.6

This will work as expected unless a package named SQLAlchemy-0.6-alpha exists. It will download 0.6-beta even though your code is only compatible with 0.5. This may be a bug in easy_install and pip but there is a lot of ambiguity around these types of version numbers. This is an unsolved problem.

#4 Gimme A Module At Version X For Just This One Project

This is the most important use case. When you start to work with lots of projects that have lots of dependencies (e.g. Pylons) you need a way to specify the different versions that each one requires and keep them independent from each other so the dependencies do not conflict. You can do that with install_requires=['SQLAlchemy>=0.5,<0.6'] in a setuptools enabled setup.py script but then you need to use easy_install and virtualenv or pip and virtualenv.

That's fine but what if you want to provide your users with all dependencies right out of the box? Out of all the projects that want to distribute dependencies (i.e. Google App Engine SDK, Django, Pinax, and others) I have not seen one to adopt egg-info. So how can you be sure of what versions they are distributing? (I think these projects all have the version numbers documented in human readable form but you see my point.) It also seems that pip and easy_install are already disagreeing on an egg-info format (see PEP 376). Sigh.

Conclusions

I think flattening modules to make them fully Python compatible and tacking on an egg-info directory is the way to go. This is how pip does it. Pip is not yet a drop-in replacement for easy_install though because it does not support binary packages. This is a problem if you don't keep build tools (gcc) on your production server because you can't download the source and build it with pip.

What should we do? For starters, apply a patch to pip for binary handling (in other words, so that it can download eggs). Next, we need better tools for managing a directory of modules that can be committed to version control and distributed. I'm working on a pip wrapper named eco for that but I'm still working out some kinks. Feel free to play around with it.

Why Not Just Use Nifty-Package-Manager-Foo ?

I have heard that the answer to all of this is to use rpm, apt-get, macports, fink, yum, BSD ports, or whatever. I don't really see how this is a solution since each package manager still has to decide how to install the package and where to store the version meta data.

Did I miss anything? Any other suggestions?

UPDATE: Tarek Ziadé is working on the metadata standardization process and posting drafts and links to PEPs here: http://wiki.python.org/moin/Distutils