Software developers can no longer rely on increasing clock speeds alone to speed up single-threaded applications... developers must learn how to properly design their applications to run in a threaded environment. - Multi-Core Programming: Increasing Performance through Software Multi-threading. Shameem Akhter and Jason Roberts
If like me, you're programming on a twenty-year-old single-threaded performance-critical legacy code base, this book is for you. It assumes only a programming competence, begins with the motivation for multi-threading, the context and history of the hard and the soft, then methodically works through from the high level concepts of concurrency and parellism down to the primitive building blocks and machine implementation.
After pestering Smart Friends and trolling the net for the last month reading platform-specific and API-specific documentation, Wikipedia, and academic papers trying to to wrap my head around multi-threading, I was quite happy to find everything I needed to learn nicely packaged, well-written, and organized under one cover. Although, conceptually, threading Graphing Calculator to run in parallel on multi-core machines is straightforward, there are a large number of design choices I must make before I begin writing code which will have a long-term effect on the implementation and debugging in the coming months. It will be difficult to go back and correct poor choices, but I have no experience working on multi-threaded applications to draw from in constructing a parallel architecture.
Akhter and Roberts discuss the common problems and the tradeoffs involved in the common solutions at both a conceptual level, and also with a detailed discussion of code using the common API choices. All of this is framed from all levels from how the programmer sees things, how the OS sees it, and how the hardware sees it.
My first question as a programmer is how to subdivide the problem: task decomposition, data decomposition, and data flow decomposition. In Graphing Calculator, this is easy. Already, each equation representing a different graph separately computes its result before the software draws all the results from each separate graph object composing them in a unified 2D or 3D or 4D view. This is a natural task decomposition easily parallized. Within a single graph object, some types of equations also lend themselves to a data decomposition. GC represents a function of a complex variable in the complex plane by computing the complex function at each point. This can be very slow. Fortunately, it is embarrassingly parallelizable, as the calculation at each point is independent.
Akhter and Roberts dedicate one chapter to discuss the different Win32/MFC, .NET, and POSIX threading APIs, and another to OpenMP. No mention is made of the Carbon MP, Cocoa, and Mach threading APIs on Mac OS X, nor Boost threads, or the current proposals being worked upon to create a standard threading API within C++0x. It would be nice for me as a programmer if the high level application architecture and low-level API choice were unrelated concerns. On Mac OS X, for example, ultimately, all threading APIs are implemented (through some number of abstraction layers) in terms of Mach threads. However, each API has its own terminology, quirks, and concepts which leak through and affect the high level abstraction making the implementation API choice more important that it ought to be.
The book does an excellent job of presenting solutions to common parallel programming problems and comparing the low level tools one uses, from lock-free algorithms using atomic operations, synchronization with semaphors, locks and condition variables, memory fences and barriers, and messaging approaches. Their chapters on debugging techniques and Intel tools were alone worth the book. The authors have already faced all the problems I am likely to face in the coming months, and wrote this book so that folks like me just starting out can benefit from their experience.
I write this now on a 1 GHz G4 laptop with a single processor. I am excited looking forward to my work for the next year at the possibilities enabled by four and eight-core machines running at upwards of 3 GHz. The speed will enable entirely new classes of interactive visualization.