Saturday, April 9, 2011

Managing Python packages: virtualenv, pip and yolk

I've recently been playing with the Python virtualenv package - along with pip and yolk - as a way of managing third-party packages. This post is my brief introduction to the basics of these three tools.

virtualenv lets you create isolated self-contained "virtual environments" which are separate from the system Python. You can then install and manage the specific Python packages that you need for a particular application - safe from potential problems due to version incompatibilities, and without needing superuser privileges - using the pip package installer. yolk provides an extra utility to keep track of what's installed.

1. virtualenv: building virtual Python environments

virtualenv can either be installed via your system's package manager (for example, synaptic on Ubuntu), or by using the easy_install tool, i.e.:

$ easy_install virtualenv

(If you don't have the SetupTools package which provides easy_install then you can download the "bootstrap" install script from http://peak.telecommunity.com/dist/ez_setup.py. Save as ez_setup.py and run using /path/to/python ez_setup.py.)

Once virtualenv is installed you can create a new virtual environment (called in this example, "myenv") as follows:

$ virtualenv --no-site-packages myenv

This makes a new directory myenv in the current directory (which will contain bin, include and lib subdirectories) based on the system version of Python. The --no-site-packages option tells virtualenv not to include any third-party packages which might have been installed into the system Python (see the virtualenv documentation for details of other options).

To start using the new environment, run the environment's "activate" command e.g.:

$ source myenv/bin/activate

The shell command prompt will change from e.g. $ to (myenv)$, indicating that the "myenv" environment (and any packages installed in it) will be used instead of the system Python for applications run in this shell. (Note that the Python application code doesn't need to be inside the virtual environment directory; in fact this directory is just using for the packages associated with the virtual environment.)

Finally, when you've finished working with the virtual environment you can leave it by running the deactivate command (also in the bin directory).

(On Windows you may have to specify the full path to the "Scripts" directory of your Python installation when invoking the easy_install and virtualenv commands above, e.g. C:\Python27\Scripts\virtualenv. Also, note that when a virtual environment is created it won't contain a "bin" directory - instead it's activated by invoking the Scripts\activate batch file in the virtual environment directory. Invoking the deactivate command exits the environment as before.)

2. pip: installing Python packages

Once you're created a virtual environment you can start to add packages (which is really the point of doing this in the first place). virtualenv automatically includes both easy_install and an alternative package installer called pip (at least, for virtualenv 1.4.1 and up; earlier versions only have easy_install, so you'll need to run easy_install pip within the virtual environment in order to get it).

Most packages that are easy_installable can also be installed using pip, and it's designed to work well with virtualenv. However I think its main advantage is that it offers some useful functionality that's missing from easy_install - most significantly, the ability to uninstall previously installed packages. (Other useful features include the ability to explicitly control and export versions of third-party package dependencies via "requirements files" - see the pip documentation for more details.)

Basic pip usage looks like this:

(myenv)$ pip install python-dateutil # install latest version of a package

(myenv)$ pip uninstall python-dateutil # remove package

(myenv)$ pip install python-dateutil==1.5 # install specific version


(As an aside, the python-dateutil package is illustrative of one of the advantages of using pip over easy_install: after installing the latest version of python-dateutil, I discovered that it's only compatible with Python 3 - an earlier 1.* version is required to work with Python 2. pip let me uninstall the newer version and reinstall the older one.)

3. yolk: checking Python packages installed on your system

The final utility I'd recommend is yolk, which provides a way of querying which packages (and versions) have been installed in the current environment. It also has options to query PyPI (the Python Package Index). Installing it is easy:

(myenv)$ pip install yolk

Running it with the -l option (for "list") then shows us what packages are available:
(myenv)$ yolk -l
Python - 2.6.4 - active development (/usr/lib/python2.6/lib-dynload)
pip - 1.0 - active
python-dateutil - 1.5 - active
setuptools - 0.6c9 - active
wsgiref - 0.1.2 - active development (/usr/lib/python2.6)
yolk - 0.4.1 - active
(See the yolk documentation to learn more about its other features.)

Summary

Obviously the above is just an introduction to the basics of virtualenv, pip and yolk for managing and working with third-party packages - but hopefully it's enough to get started. If you're interested in using virtualenv in practice then Doug Hellman's article about working with multiple virtual environments (and his virtualenvwrapper project, which provides tools to help) is recommended as a starting point for further reading.

Monday, April 4, 2011

Richard Stallman: "A Free Digital Society?"

About a month ago I was fortunate to attend an IET-hosted lecture by Richard Stallman, entitled "A Free Digital Society?". Probably most famous as the originator of the GNU project (out of which came GNU/Linux) and initiator of the free software movement, Stallman has for many years been an active and vocal advocate for free software, and has a campaigned against excessive extension of copyright laws

He began the talk with the observation that there is an implicit assumption in the recent movement towards "digital inclusion", that using computers and the internet is inherently good and beneficial. However, as the question mark in the title of his talk indicated this assumption merits closer attention, as (in his opinion) there are various issues and threats associated with these technologies. These include:
  • Survelliance: technology now makes it possible for ISPs, websites and other organisations to monitor and analyse what individuals do online (e.g. the sites that they visit, things they buy, search terms they use etc) to an extent to which (in Stallman's words) "Stalin could only dream".
  • Censorship: for example, governments or corporations blocking access to particular websites (think Google in China), or even forcing them to close.
  • Restrictions on users imposed by data formats: both proprietary (e.g. Silverlight) and patented data formats (e.g. MP3) restrict what the end user is able to do with the data they encode.
  • Non-free software: here "free" is in the sense of "freedom", rather than price. Non-free software is essentially software that isn't under the control of you, the user - in the case of proprietary software, it's controlled by the owner (for example Microsoft, Apple, Amazon) who is able to insert features (e.g. to track user behaviour) that serves their interests rather than those of the user. By contrast, free software - which by the way you can still charge money for - gives the user four basic freedoms: 0. to run the software for any purpose; 1. to study how the software works, and make changes to it; 2. to redistribute the software as-is; 3. to redistribute the software with your changes (see the free software definition). In this way malicious features can be detected and removed, and control is returned to the user.
  • "Software as a service" (SaaS): in Stallman's definition, "software as a service" is anything where the computation is done by programs that you can't control - this is like non-free software above, because someone else has control and can change how your computing is done at any time without your permission. He made a distinction between things like e-commerce, online storage storage (e.g. Dropbox), publishing (e.g. Twitter) and search (which are about "data" or "communication", and so are not SaaS), and e.g. Google Docs (which does do computation for you, and so is SaaS). (See Stallman's article Who does that server really serve?)
  • Misuse of an individual's data: essentially doing something with your data without your permission, or even your knowledge - for example, passing on personal data to the authorities, unilaterally modifying your data, or even (for example in the case of Facebook) using it for commercial purposes.
  • "The War on Sharing": according to Stallman, sharing is "using the internet for what it's best at", and the war on sharing - whether digital rights management (DRM) technology or threatening internet users with disconnection (as under the UK's Digital Economy Act) - is an attempt by commercial interests to unfairly restrict what users are allowed to do (see Stallman's article Ending the War on Sharing).
  • Users don't have a postive right to do things on the internet: essentially, all the activities that users perform on the internet - communications, payment etc - are dependent on organisations who have no obligation to continue providing those services to you.
This is a pretty long list of issues (hopefully I've accurately captured the essence of each), and while many of them can be mitigated by moving to free software; others (for example, monitoring by ISPs) require other solutions - and Stallman admitted that he's quite pessimistic about the future. Aside from that, it was a fascinating and entertaining talk (including the auctioning of a GNU gnu soft toy to raise funds for the Free Software Foundation) and the subsequent audience Q&A session provided many opportunities for elaboration and clarification on many of the issues.

I'm still mulling over many of the issues raised. On the one hand there is a fundamental question about what moral rights you believe individuals should have, both generally and with specific regard to the digital world; and on the other there is the question of what you should do if you feel those rights are not being upheld. Stallman's position is clear and uncompromising: for example, not owning a mobile phone and not using a key card to enter his office (to avoid the possibility of being tracked), and using a netbook that allows him to run 100% free software (down to the BIOS level). It's certainly given me plenty to think about, and I'm looking forward to reading his book of collected essays "Free Software, Free Society" - which might be a good place to start if you're also interested in learning more.

Sunday, April 3, 2011

Book review: "Python 2.6 Text Processing: Beginner’s Guide" by Jeff McNeil

Jeff McNeil’s “Python 2.6 Text Processing: Beginner’s Guide” is a practical introduction to a wide range of methods for reading, processing and writing textual data from a variety of structured and unstructured data formats. Aimed primarily at novice Python programmers who have some elementary knowledge of the language basics but without prior experience in text processing, the book offers hands-on examples for each of the techniques it discusses – ranging from Python’s built-in libraries for handling strings, regular expressions, and formats such as JSON, XML and HTML, through to more advanced topics such as parsing custom grammars, and efficiently searching large text archives. In addition it contains a great deal of general supporting material on working with Python, including installing packages and third-party libraries, and working with Python 3.

The first three chapters lay the foundations, covering a number of Python basics including a crash course in file and URL I/O, and the essentials of Python’s built-in string handling functions. Useful background topics – such as installing packages with easy_install, and using virtualenv – are also introduced here. (A sample of the first chapter can be freely downloaded from the book’s website at https://www.packtpub.com/python-2-6-text-processing-beginners-guide/book). The next three cover: using the standard library to work with simple structured data formats (delimited “CSV” data, “ini”-style configuration files, and JSON-formatted data); working with Python regular expressions (a stand out chapter for me); and handling structured markup (specifically, XML and HTML). Subsequent chapters on using the Mako templating package (the default system for the Pylons web framework) to generate emails and web pages, and on writing more advanced data formats (PDF, Excel and OpenDocument), are separated by an excellent overview of understanding and working with Unicode, encodings and application internationalization (“i18n”).

The remaining two chapters cover more advanced topics, with some good background theory supplementing the practical examples: using the PyParsing package to create parsers for custom grammars (with a brief nod to the basics of natural language processing using the Natural Language Toolkit, NLTK); and the Nucular package for indexing large quantities of textual data (not necessarily just plain text) to enable highly efficient searching. Finally, an appendix offers a grab-bag of general Python resources, references to some more advanced text processing tools (such as Apache’s Lucene/Solr), and an excellent overview of the differences between Python 2 and 3 (including a hands-on example of migrating code from 2 to 3).

The book covers a lot of ground and moves fairly quickly; however it adopts a largely successful hands-on approach, engaging the reader with working examples at each stage to illustrate the key points, and this certainly helped me keep up. I was also impressed by the clear and concise quality of code in the examples, and the very natural way that general Python concepts and principles – generators, duck typing, packaging and so on – were introduced as asides. (One very minor criticism is that the layout of the example code could have been improved, as the indentation levels weren’t always immediately obvious to me.) Aside from a surprisingly unsatisfying chapter on structured markup (reluctantly, I would recommend looking elsewhere for an introduction to XML processing with Python) and a few niggling typos, there’s a lot of excellent material in this book, and the author has a knack for presenting some tricky concepts in a deceptively easy-to-understand manner. I think that the chapter on regular expressions is possibly one of the best introductions to the subject that I’ve ever seen; other chapters on encodings and internationalization, advanced parsing, and indexing and searching were also highlights for me (as was the section on Python 3 in the appendix).

Overall I really enjoyed working through the book and felt I learned a lot. I think it’s fair to say that given the rather ambitious range of techniques presented, in many cases (particularly for the more advanced or specialised topics) that the chapters are inevitably more introductory than definitive in nature: the reader is given enough information to grasp the background concepts and get started, with pointers to external resources to learn more. In conclusion, I think this is a great introduction to a wide range of text processing techniques in Python, both for novice Pythonistas (who will undoubtedly also benefit from the more general Python tips and tricks presented in the book) and more experienced programmers who are looking for a place to start learning about text processing.

Disclosure: a free e-copy of this book was received from the publisher for review purposes; this review has also been submitted to Amazon.