Practical Computing for Biologists
By Steven H.D. Haddock and Casey W. Dunn
Sinauer Associates, Sunderland (MA). 2011
During the latter part of his career my father was a computer programmer (he was a Methodist minister before that). This actually didn’t help me learn anything about computers, at first, because in those days neither he nor I had a computer at home. After all, Apple did not build their first fully-assembled computer until 1977 (the year after I started university), and IBM did not release their first PC microcomputer until 1981. Hanging around the computer room at the Australian Broadcasting Commission was not really an option, even if my father’s colleagues did make me feel welcome.
So, I actually received my first piece of computer tuition from a fellow postgraduate student (David Killingly), who had seen the bioinformatics revolution coming in the early 1980s and had therefore completed a computing certificate whilst also doing his PhD. Part of this tuition involved getting the Wagner78 program to run on the University’s mainframe computer. Steve Farris had given a copy of this program to one of the other PhD students, reportedly with the comment that she probably wouldn’t be able to get it to run. So, it seems to me that David’s success was likely a first for Australia, and quite possibly for the world.
This taught me something about Fortran programming, which allowed me to successfully get other desirable programs (e.g. Decorana, Twinspan) to run on microcomputers some years later (most of the problems were confined to the input/output code, which was easy to fix). From there, I branched out into doing the same thing for Basic programs, learning by trying to decipher pre-existing programs. My first university lecturing job (in 1986) required me to teach (among other things, such as taxonomy and ecology) Pascal programming, for which I decided to consult some books rather than using guesswork, as I had before. More recently, I successfully learnt some Perl programming using a combination of these two methods.
Note that I am not discussing any ability to create “proper” computer programs, but merely what I call “disposable programming”. By this I mean the modification of existing programs if they don’t do exactly what I want, or the writing of simple little programs to do one-off tasks that are either too complex or repetitive to do manually. I have rarely written a program that anyone else could use! Conversely, no-one is ever going to release a program to do the sorts of short-term tasks that I require. That is why I learned to write my own, so that I could throw them away without regret after they have finished being useful.
However, computer programming is just one part of using a computer in biology. The other important part is dealing with different operating systems, and especially learning to type commands whenever the WIMP interface (windows, icons, mouses [sic], pull-down menus) is too cumbersome. The first microcomputer I ever used was a Cromemco C-10, which most of you will never have heard of. It ran its own CDOS operating system, and was one of the most unstable things you have ever seen — to this day, I still press the Save button every few seconds, which is a habit I developed from using that [expletive deleted] machine. We only used it for word processing (a program called WriteMaster), which is all it was good for. There was almost no other software for it; indeed, one of the postdocs (John Smith) had to write a printer driver in order to connect it to the Diablo printer that the Department had bought. (I still remember having to answer “No” to the question: “Is your name David Morrison?”, because otherwise the program would quit. This went on for years.)
I also learnt to use Unix while working as a research assistant for the late Mary Tindale. The Royal Botanic Gardens in Sydney was in the process of buying their first minicomputer, and I got involved in this process in spite of the fact that I was supposed to be studying acacias for the Flora of Australia. This helps explain why I was one of the few of Mary’s assistants who never published any new species — I was given species that required little taxonomic work, so that I could spend my time computerizing (mainly with Peter Weston and the late Ken Hill). This probably wasn’t what ABRS thought they were spending their money on, but Mary seemed happy to turn a blind eye.
I thus learned a lot about computers, but my knowledge of acacias remains somewhat sketchy, even to this day. I once tried to rectify this anomaly by using the Delta Intkey computer program to help me prepare the Acacia treatment for the Flora of New South Wales (with considerable help from Stuart Davies, and also David Mackay), thus combining computing with acacias, with some success.
The final part of my computer education came with the so-called office that I was given for my first lecturing job, which was actually part of a small disused lab (instead of four walls and a door, I had three walls and a shower curtain). Down the other end of the lab was a brand-new Macintosh Plus computer, which my Department had been loaned as a “seed” computer (this was 1986, remember). Actually, the salesman subsequently left the company he worked for, so that we never returned the computer (which presumably makes it stolen by now), and I got my sweaty hands on it. I knew about these computers because Roger Carolin (ever the innovator) had recently bought one. This computer allowed me to run MacClade (the first program I ever bought), which, in the days when we all used solely phenotypic data, sure beat the hell out plotting the characters onto the trees by hand. I didn’t actually buy my own home computer until the mid 1990s, but it was a Macintosh, and I have owned them both at work and at home ever since.
So, why have I just given you my life story? Apart from its intrinsic interest to myself, it is actually relevant to the book that I am supposed to be reviewing. You see, the book is intended to prevent other biologists from having to go through the same long-winded process that I went through. The book fast-tracks you through the whole process of learning to seriously use a computer in biology (rather than merely pottering with packaged programs), and is therefore precisely the book that I needed 30 years ago. (The authors note: “Much of what we ourselves use in practice was garnered through self-directed experience, and we have tried to collect this knowledge in one place to make it easier for other scientists” p. 4.) Where were the authors when I needed them?
The idea is that if you know a bit more than how to double-click on icons then this allows you to treat a computer as a serious research tool in modern biology. I take no great credit for realizing this 30 years ago, and basing my career on the idea, but it is even more true today than it was back then, since the bioinformatics revolution was then merely a gleam in the eye of people like Mike Dallwitz (the author of Delta) and Richard Pankhurst. So, I find it somewhat surprising that there are still so few books on this subject.
The point here is that if you often won’t be able to find someone else to do the computing for you — if you can’t do it yourself then it won’t get done at all. For example, I have recently found the need to extract small amounts of information from text files that are tens or hundreds of megabytes in size, which is trivial if you can write a few lines of computer code but pretty much impossible otherwise — this is why I decided to learn some Perl programming. The only alternative was to hire someone else to do it, and the task was a bit too trivial to interest a professional (who would be far too expensive, anyway). Besides, that is a bit like getting someone else to eat your lunch for you every time you feel hungry!
Another way of looking at this issue is simply productivity. If you spend several weeks learning to use a computer productively, this can save you months if not years of time later on, or even open up possibilities that were previously excluded from you. This is not true for any other electronic tool used by biologists, which typically need to be learned properly to be of any use at all. There is also a computer on almost every scientist’s desk, while most of the other tools are housed in laboratories — scientists usually spend more time at their computer than in their lab or in the field. Given this, why do so many people waste their time copying files around, reformatting data files for different programs, repeating tasks manually, and reinventing various wheels (see http://software-carpentry.org/blog)? Computers are no longer glorified word-processors, and biologists are no longer confined to using a few simple “canned” computer programs with their “one size fits all” functions.
The book by Haddock & Dunn does not try to teach you much computing, but instead it provides a “problem-centric” self-study guide to give you a taste of what computing is and what it can do for you. You’re not going to become a computer scientist, but if you follow the book and try the practical exercises then you will become a seriously competent user of scientific computing. In that sense it is precisely the right sort of book for a biologist, not surprisingly since the authors consider themselves to be “biologists who also happen to have backgrounds in computing” (p. 2). The book eminently succeeds at the authors’ stated goals: “We expect that many biologists will use this book to improve the efficiency of their research, help scale up existing projects, or develop the skills needed for new types of studies” (p. 4).
The authors have chosen the Apple Macintosh as the computer of choice for their book, at least partly because you can run both Unix and Macintosh programs on the same machine, and this is a seriously helpful situation when you don’t know what type of program you will need for your next piece of research. Indeed, it is very likely to be a program that will run under Unix rather than under Windows, for example. This does not mean that Windows machines are excluded from the book, but much of the necessary information is relegated to an Appendix.
The programming language of choice for the book is Python, a language that I have had nothing to do with so far, but which as a result of the book I am now well qualified to deal with, should I ever require it for my disposable programming. Python’s recognized advantages for biological computing include: it is easy to learn and easy to read; it is interpreted and thus multiplatform (Python programs run on most operating systems); it offers free access to source code; and there are internal and external libraries of pre-existing code already available (notably the biology-centric Biopython).
The book also provides examples for use in other situations, notably when using mathematical toolkits such as Matlab or R. I have some familiarity with the latter but not the former. Indeed, R is fast becoming the tool (and language) of choice for biological mathematics, which is going to be a bit of a cultural shock for most biologists, unless they are used to working out their own computer commands in an apparently cryptic language.
Languages like Python and R take longer to learn than do most canned packages, but their modular nature allows the user to mix-and-match a wide range of methods that have been developed for data analysis (or to develop their own methods). The main question is whether the advantages of doing, say statistics or phylogenetics, in an environment like R or Python outweigh the extra learning cost. An increasing number of people seem to think that they do (or, at least, that their research assistant should learn).
In addition to getting the reader to understand operating systems and computer programming, the authors have tackled the topics of: searching and modifying text files using regular expressions; writing shell scripts; combining tools using pipes; relational databases; working with graphics programs; and interfacing with electronic equipment. You will even learn how best to organize data in spreadsheets to simplify subsequent processing and analysis, as well as learning more than you expected about preparing figures on computers. In all cases the focus is on flexible tools that can be adapted for many purposes, rather than on pre-packaged programs (no matter how popular they may be).
The book is full of practical advice, as well as sound teaching. The sections vary in their usefulness, depending on what you might be trying to use a computer for; and there are missing sections that could have been added, such as web page development, and techniques for using the web to find out why your computer program has recently stopped working. Oddly, the sections on “Working on Remote Computers” and “Installing Software” are placed near the end of the book, rather than near the beginning, which is where I would have placed them based on my own learning experience. Much as I like the book, I would also have preferred some more emphasis on the evils of “black-box bioinformatics”, in which data analyses are performed on large amounts of data without any regard for the quality (or even logic) of the underlying biology. There are many people who genuinely believe that sheer quantity of data will swamp any possible inadequacies in quality, but processing data in bulk is a classic way of making uncheckable mistakes.
There are examples throughout the book, illustrating in a practical way the points being made. However, the suggestion (p. 25) that one might take the zoological name “Physalia physalis (Linnaeus)” and want to delete the brackets might seem inappropriate to a taxonomist! I would have been more impressed, also, if the word “Göteberg” (p. 193) had been spelled with an “o” rather than the second “e” (although a web search has just revealed to me that there are a number of U.S. websites that use this erroneous spelling of the city you know as Gothenburg).
Importantly, not all of the examples used in the book are about molecular biology, which is a major plus. There is a depressing tendency in the modern world to see all of biology as molecular biology, and therefore to assume that everyone is a molecular biologist who wants only to process DNA or amino acid sequences. This assumption is far from the truth, and it is commendable of the authors to recognize this fact explicitly. It will be interesting to see whether the Biopython project, which gathers together Python-written programs specifically for “biology”, ever breaks itself out of its biology = molecular viewpoint. I have seen it suggested that R is actually more useful than Python for ecologists, which is likely to be true for most statistical analyses. However, it might be more efficient to access the R statistical features from within a Python program (e.g. using the package PypeR), because R is not really a good language for novices (i.e. it is idiosyncratic and less consistent than other languages, so that there is a steep learning curve).
There is an associated web page for the book (http://practicalcomputing.org/), which has updated instructions (e.g. for recent versions of software), download files (including the authors’ own data, as well as programs to be installed, and the computer code for all of the practical exercises), and errata for the book’s text.
All in all, this is an extremely useful book, which covers a lot of material not covered elsewhere. It is not a book for casual readers, nor is it a reference manual, but instead it is intended for those who really want to make some progress with their mastery of computers as tools for scientific use. The price seems to be pretty outrageous, but you will get your money back pretty quickly with your increased productivity. (This may seem a bit like spending your own money to provide your employer with some benefit!)
Biologists become biologists in order to do biology, this much is abundantly clear. If they wanted to be computer scientists then they would have studied computing instead. Nevertheless, modern biologists spend much of their time sitting in front of a computer screen, so it is perhaps best if they learn to sit there as productively as possible. This book will help them (although it won’t help deal with the endless stream of bureaucratic emails).