After a month of reading about threading and a week of coding infrastructure for parallel execution, yesterday I parallelized the Graphing Calculator code for 2D inequalities. This morning I tried it on a dual-core machine while visiting a colleague and was gratified to see visual confirmation of the parallelism as updates from a slow computation marched across the screen in two parallel bands.
Graphing Calculator spends most of its time doing calculations affectionately refered to as embarrassingly parallel. Evaluating 2D inequalities, for example, might be thought of in terms of evaluating a function separately at every point on the screen. Divide the screen geometry into pieces for each thread and you're done. It's not quite that simple, though.
You could divided the screen into even and odd pixels, for example. That might make sense in terms of load-balancing, since it might be easier to do computations in different places. But, though each pixel is conceptually independent, adjacent pixels are in nearby memory and share cache lines resulting in memory contention, cache-line ping-pong, and much slower parallel execution from the communication overhead of maintaining cache coherency. Similarly, dividing the screen into vertical bands may result in unintended cache-line sharing depending on the widths of the bands and the width of the cache lines. Dividing the screen along horizontal scan lines limits the cache-line sharing to where the end of one block and the beginning of the next block. This is a common issue in designing for performance. Seemingly symmetric choices (like the order of nested loops) can affect performance orders of magnitude due to undlerlying low-level details of a machine's memory cache hierarchy.
Graphing Calculator uses interval arithmetic to speed up drawing inequalities so that it doesn't always have to evaluate the function at every point on screen. GC starts by evaluating the interval extension of the function once over a rectangular interval spanning the domain visible. The interval evaluation can act as a proof that the function is true or false everywhere on that interval. Otherwise, GC subdivides the rectangle into two pieces and recurses. For simple functions, large areas of the screen are evaluated and drawn quickly. This introduces a challenge for load balancing given the simple decomposition into horizontal strips. The top half of the screen evaluated in one thread may take only a single interval function evaluation and complete quickly, while the bottom half of the screen may take much much longer to computer. One could attempt to share the work between threads at the level of the recursive subdivision step in this divide-and-conquer algorithm. I suspect that memory sharing or lock contention for the list of rectangles to-be-computed could become an issue, but I would have to experiment and measure to be sure.