" /> A Programmer's Apology: September 2006 Archives

« August 2006 | Main | October 2006 »

September 27, 2006

Software Archaeology

Pham Nuwen spent years learning to program/explore. Programming went back to the beginning of time. It was a little like the midden out back of his father's castle. Where the creek had worn that away, ten meters down, there were the crumpled hulks of machines -- flying machines, the peasants said -- from the great days of Canberra's original colonial era. But the castle midden was clean and fresh compared to what lay within the Reprise's local net. There were programs that had been written five thousand years ago, before Humankind ever left Earth. The wonder of it -- the horror of it, Sura said -- was that unlike the useless wrecks of Canberra's past, these programs still worked! And via a million million circuitous threads of inheritance, many of the oldest programs still ran in the bowels of the Qenq Ho system. Take the Traders' method of timekeeping. The frame corrections were incredibly complex -- and down at the very bottom of it was a little program that ran a counter. Second by second, the Qenq Ho counted from the instant that a human had first set foot on Old Earth's moon. Bit if you looked at it still more closely ... the starting instant was actually some hundred million seconds later, the 0-second of one of Humankind's first computer operating systems. - A Deepness in the Sky, Vernor Vinge

They tend to make their changes as small surgical changes that affect the component as little as possible. And the more these surgical changes add up, the more fragile the code gets. As these changes build up, the code in the component starts to resemble a piece of conglomerate rock. The more people add onto the base, the more blobs of other rocks get added to the conglomerate. When I work on the code like this, it sometimes feels like I’m an archaeologist digging through the layers of an ancient city – if you pay really close attention, you can see which developers wrote which portions of the code. - Larry Osterman

At first glance, business software developers have little in common with Indiana Jones. But the emerging field of software archaeology applies some of the same skills, if not the dashing adventure. - Simon Sharwood

"Software archaeology" conjures up images of people poring over documents in a forgotten language, looking at broken artifacts whose purpose is unknown. - Ralph Johnson

It's been a long couple of days deep in the bit mines.

September 26, 2006

Intel pledges 80 cores in five years

Intel pledges 80 cores in five years

Intel has built a prototype of a processor with 80 cores that can perform a trillion floating-point operations per second.

Exciting times ahead.

September 25, 2006

Memory barrier

The question on implementing a memory barrier generated a lot of e-mail discussion. The short answer is:

On OS X you can use OSSynchronizeIO in OSAtomic.h.
VC++ has builtins: _WriteBarrier, _ReadBarrier, _ReadWriteBarrier. For x86, these merely instruct the compiler to not reorder operations past them in certain ways. Later x86 chips have the mfence, lfence, sfence instructions, which IIRC only apply to some of the newer SSE2 memory operations.
See below the cut for more gruesome details...

Unless you implement all of your data within the atomic variables themselves, you need a memory barrier. Otherwise, consider the following:

store result to gResult              -- A
atomic set kDone flag in gFlags      -- B
On PowerPC, there's no guarantee that A is seen by other processors before B.

On CW OSSynchronizeIO is implemented as:

static __inline__ void OSSynchronizeIO(void)
  {
  #if defined(__ppc__)
  __asm__ ("eieio");
  #endif
  }

So on CW it boils down to the question: CW won't re-order instructions around inline asm, right?

Right. Even though it would be inlined, the compiler can't make assumptions about what the function call is doing. That inline asm could be doing something like a system call with all sorts of potential side effects.

On CW iOSSynchronizeIO is implemented as __asm__ ("eieio"). Is that insufficient in an MP system to ensure that write re-ordering won't break the use of a "good" flag?

It is sufficient, as it enforces the order of accesses to the cache for cacheable address space, and cache coherency between processors is handled by the hardware.

I was under the impression that it was not sufficient in cases where the flag is protecting shared data.

The shared data will be in the cache (or main RAM if it got flushed out of the cache), and the cache is coherent across processors. The memory barrier makes sure any accesses to the shared data by the current thread and processor before the barrier are completed before any accesses after the barrier, so the shared data must be in the cache (or memory) before the flag change is.

True. However, you have to be very careful when using this optimisation (that is, using eieio instead of lwsync). Check out the following note from B.2.2.2 of "PowerPC Virtual Environment Architecture" book II...

However, for storage that is neither Write Through Required nor Caching Inhibited, eieio orders only stores and has no effect on loads. If the portion of the program preceding the eieio contains loads from the shared data structure and the stores to the shared data structure do not depend on the values returned by those loads, the store that releases the lock could be performed before those loads. If it is necessary to ensure that those loads are performed before the store that releases the lock, lwsync should be used instead of eieio. Alternatively, the technique described in Section B.2.3 can be used.
This makes eieio a "write barrier." For read or read/write, use lwsync.
Note that locking/unlocking a pthread mutex will issue the correct barriers (I checked).

Consider the case where you do something like:

struct {
   int m1;
   int m2;
} gFoo;

int gLock;

static void MyFunc(void)
{
   int local1;

   gFoo.m1 = 1;
   local1 = gFoo.m2;
   eieio();
   gLock = false;

   // At this point, there's no guarantee that the load for local1 has
   // been done.  If clearing gLock allows another thread to modify
   // gFoo, the load of local1 is in a race with that other thread's
   // modifications.
}
Ain't weak consistency grand!

Am I correct in understanding that IA32 does no visible write re-ordering?

It does for some of the SSE2 memory operations, but not for regular ones. Appendix B of the "PowerPC Virtual Environment Architecture" book II makes for some fascinating reading.

If you want to read more, here are pages on strict consistency, sequential consistency, and relaxed consistency.

Thank you for tuning in to today's episode of Adventures in Parallel Computing.

September 24, 2006

Failures modes

After several weeks of reading books, source code, API and system documentation, I'm ready to start coding. But first, time to take a break blogging, by attempting to enumerate all the brand new kinds of failure modes when programs run in parallel. Most of these occur when time-ordering assumptions are broken by parallel threads of code in contention for some shared resource, often objects stored in shared memory accessed from multiple threads.

Deadlock: Thread #1 stops waiting for resource-A held by Thread #2 which is stopped waiting for resource-B held by Thread #1. Both threads wait forever. Deadlock is a symptom of poor design.

Self-deadlock: Thread #1 stops waiting for a resource, not realizing it already holds that resource. When the resource is a mutex, a recursive-mutex solves this problem.

Live lock is a variation of deadlock where each thread releases the resource it owns attempting to eliminate the deadlock, but then each only manages to acquire one of the necessary resources, can not continue, releases, and loops uselessly in that fashion.

Race conditions can be subtle bugs where a result depends on timing and different results occur as the ordering of instructions running on different processors changes each time.

Word-tearing can occur when a read or write to memory is not an atomic operation. For example, a floating point double value occupies 8 bytes. Usually, those eight bytes are read or written all at once. However, if they are mis-aligned and cross a cache-line or page boundary, those 8 bytes may be written or read in pieces. One thread reading a double from memory may see a nonsense value made up of some of the bytes of an old value and some of the bytes of a new value written by a different thread.

Cache-ping-pong: Locality of reference is crucial for high performance on a single processor now that CPUs run so much faster than they can access main memory. However, multiple threads, even though they may be doing independent calculations on independent unrelated values may still have an invisible connection if the memory those values occupy happen to live in the same cache-line. In that case, the cache-consistency mechanism which enables using shared memory for communication in a multiprocessor system will be constantly refreshing that cache-line on each core keeping them in sync, but slowing down each thread enormously while this occurs (even though each core uses a disjoint subset of the memory on that cacheline). This may be difficult to detect since the results will, eventually, be correct.

Too many threads can swamp the system in the overhead of managing them. One needs to consider the actual number of CPU cores (or virtual cores in the present of hyperthreading) and experiment to determine the optimal number of threads for the software design. Typically, one submits jobs to work queue which are run by persistent threads watching the queue.

Priority inversion occurs when a high priority task is slowed by a lower priority task holding a resource it needs.

False sharing occurs when the work done by multiple threads is not really independent, and may run slower than were it done by a single thread. Cache-line ping pong is an example of this where the memory layout introduces a dependency between threads which may not be apparent from the data flow or the source code.

Non-thread-safe APIs: Any function you call from a thread, whether in your own code, a library, or the operating system, must also be thread-safe. The C++ standard, for example, does not require this (as it imposes a performance cost that not all users will want to pay), so one must refer to the documentation of the specific library implementation you are using.

September 21, 2006

Audience Participation Friday

<tap><tap> Is this on?

Blogging's an odd form of informal public speaking where it's a surprise to discover that anyone's listening. So, if you're reading this, please take a moment to introduce yourself in the comments.

September 20, 2006

Fences

Graphing Calculator spends most of its time computing functions typed by the user. Occasionally it will draw results as calculations progress. From a high altitude, re-architecting the code to work across multiple CPU cores is straightforward: all computations will occur in preemptive threads. Occasionally, the main thread wakes up to draw a snapshot of the results either in progress or when they are all completed. As always, the devil is in the details. Graphing Calculator currently has thirty distinct graph classes, each representing a different algortihm to compute and visualize some mathematical object inferred from the form of the equation the user types.

In many of these classes, the size of the result is known in advance and allocated as an array of structures. The structure contains a .good flag initialized to false to indicate when the other data in that element has been computed. The compute thread evaluates elements in that structure (not necessarily sequentially) and sets .good to true after each element is finished. When the main thread is called on to update the screen, those elements in the array that are good are drawn. (For context, this particular drawing step, is part of a larger one which draw's all equations composing them with translucency whenever anything changes. To get transparency correct, it is necessary to redraw everything in back to front order. So it is frequently necessary to redraw elements rather than merely draw them once after computing them.)

The new problem in a multi-threaded world with processors that commonly use out-of-order exectution to increase instruction-level parallelism, is that the hardware order of memory operations no longer matches the software order, and different CPU cores may see memory locations change in different orders. A "fence instruction" or "memory barrier" guarantees completeness of all pre-fence memory operations and halts all post-fence memory operations until completion of the fence instruction. - Akhter and Roberts.

So my question now is: Do I want to try to keep most of the existing code unchanged and use a explicit fence to make it MT-safe? (And if so, what can I use to implement the memory barrier.) Or do I want to use a higher-level thread-safe data structure, which will involve (probably a lot) more re-writing.

A few points to remember:

  • I'm transitioning performance-critical twenty year old numeric intensive legacy code
  • the basic structure, (array of elements, ->good flag, drawing partial results) is repeated across several dozen distinct classes
  • the size of the array may be known a-priori before hand, and the entire array allocated. It can be rather large. Preallocating the entire array is far more efficient than storing separate results in pieces scattered across memory
  • Transitioning a large code base will be safer, less buggy, and easier to test, the closer the MT-safe constructs I use are to the current single-threaded implementation.

In conversation, here are some suggestions folks made towards making this construct MT-safe:

Use a threadsafe queue. Threadsafe queues are easy to write. Have a threadsafe set of done elements.

It will also work to just have a lock on the array and scan it when locked if that's what you really want to do. And then lock it before setting doneFlag and unlock it afterwards.

Rather than having a doneFlag, why not a pointer to the data? So it's NULL or not-NULl and if not-NULL then you know the data has been allocated, instantiated, and is ready.

I don't think there is a portable way to guarantee ordering of memory operations. However, if you just need Mac & Windows, there are probably ways of doing it using inline assembly. (ie, eieio)

If you don't want to change things and are only worrying about the serialization of the .good = 1 setting. You'll have to use something from the POSIX threads library to absolutely guarantee it. However, there are things you can do to guarantee it in practice. For example, putting code in separate compilation modules so the compiler can't optimize you. That is guaranteeing the order of writes.

You have to use POSIX threads or compiler extensions. They are your only options.
On Windows boxes, you see a lot of code using a single global mutex, which works but isn't optimal.

I think you have better data structure alternatives, but only you can assess the cost & risk of changing your legacy codebase. I would use a library-provided data structure: so it's someone else's bug to think about this.

Putting one big mutex over the entire array might actually be your best bet. Simple. Works. How many threads are you going to have competing for it? One writer thread. One reader (main thread). Well, the problem with it then would be that the writer will block while you iterate over the array. So it can't go off and continue calculation if the reader is slow.

Why don't you just have an array of pointers to objects that are done, initialized to null? Have it mutex-protected. When done, grab the mutex, write the pointer, release the mutex. I don't think the reader has to look at the mutex for this to work and be safe.

You have to get both in-order execution in the processor and the expected instruction ordering out of your compiler, don't forget. A fence instruction guarantees completeness of all pre-fence memory operations and halts all post-fence memory operations until completion of the fence instruction.This fence mechanism ensures proper memory mapping from software to hardware memory models.

Sounds like eieio. It is there to guarantee the (appearance at least) of in-order execution, as far as memory IO is concerned.

If the writer thread grabs the mutex, writes .good, releases the mutex - is it safe for the reader thread to see .good == true without bothering with the mutex?

September 18, 2006

Multi-core Programming

Software developers can no longer rely on increasing clock speeds alone to speed up single-threaded applications... developers must learn how to properly design their applications to run in a threaded environment. - Multi-Core Programming: Increasing Performance through Software Multi-threading. Shameem Akhter and Jason Roberts

If like me, you're programming on a twenty-year-old single-threaded performance-critical legacy code base, this book is for you. It assumes only a programming competence, begins with the motivation for multi-threading, the context and history of the hard and the soft, then methodically works through from the high level concepts of concurrency and parellism down to the primitive building blocks and machine implementation.

After pestering Smart Friends and trolling the net for the last month reading platform-specific and API-specific documentation, Wikipedia, and academic papers trying to to wrap my head around multi-threading, I was quite happy to find everything I needed to learn nicely packaged, well-written, and organized under one cover. Although, conceptually, threading Graphing Calculator to run in parallel on multi-core machines is straightforward, there are a large number of design choices I must make before I begin writing code which will have a long-term effect on the implementation and debugging in the coming months. It will be difficult to go back and correct poor choices, but I have no experience working on multi-threaded applications to draw from in constructing a parallel architecture.

Akhter and Roberts discuss the common problems and the tradeoffs involved in the common solutions at both a conceptual level, and also with a detailed discussion of code using the common API choices. All of this is framed from all levels from how the programmer sees things, how the OS sees it, and how the hardware sees it.

My first question as a programmer is how to subdivide the problem: task decomposition, data decomposition, and data flow decomposition. In Graphing Calculator, this is easy. Already, each equation representing a different graph separately computes its result before the software draws all the results from each separate graph object composing them in a unified 2D or 3D or 4D view. This is a natural task decomposition easily parallized. Within a single graph object, some types of equations also lend themselves to a data decomposition. GC represents a function of a complex variable in the complex plane by computing the complex function at each point. This can be very slow. Fortunately, it is embarrassingly parallelizable, as the calculation at each point is independent.

Akhter and Roberts dedicate one chapter to discuss the different Win32/MFC, .NET, and POSIX threading APIs, and another to OpenMP. No mention is made of the Carbon MP, Cocoa, and Mach threading APIs on Mac OS X, nor Boost threads, or the current proposals being worked upon to create a standard threading API within C++0x. It would be nice for me as a programmer if the high level application architecture and low-level API choice were unrelated concerns. On Mac OS X, for example, ultimately, all threading APIs are implemented (through some number of abstraction layers) in terms of Mach threads. However, each API has its own terminology, quirks, and concepts which leak through and affect the high level abstraction making the implementation API choice more important that it ought to be.

The book does an excellent job of presenting solutions to common parallel programming problems and comparing the low level tools one uses, from lock-free algorithms using atomic operations, synchronization with semaphors, locks and condition variables, memory fences and barriers, and messaging approaches. Their chapters on debugging techniques and Intel tools were alone worth the book. The authors have already faced all the problems I am likely to face in the coming months, and wrote this book so that folks like me just starting out can benefit from their experience.

I write this now on a 1 GHz G4 laptop with a single processor. I am excited looking forward to my work for the next year at the possibilities enabled by four and eight-core machines running at upwards of 3 GHz. The speed will enable entirely new classes of interactive visualization.

September 14, 2006

Hardware

On the occasion of the 50th anniversary of the hard disk drive I've been pondering the evolution of the hardware Graphing Calculator grew up on.

The original hard disk drive (IBM 305 RAMAC) in 1956 stored only 5MB of data on fifty 24-inch disks and weighed over 250 kg (over 550 pounds). About the size of two large refrigerators and as tall as a man, you could lease the whole unit for about $250,000/year in today's dollars - John Cole

I began coding Graphing Calculator's equation editing in 1985 for the original Macintosh with an 8 Mhz 68000 processor and 400K floppy disks. The challenge then was to make it responsive enough to keep up with typing equations. Since editing an equation changes its two-dimensional structure, the entire equation layout was redrawn with each keystroke. One trick was to first draw the equation with the selection so that where your eye was focusing was responsive. Peripheral vision didn't notice the lag elsewhere. Milo did not give the user any choice of equation font or size in order to simplify and speed up the update calculations.

We wrote Graphing Calculator 1.0 in 1993 for the original Power Mac 6100 with a 60 Mhz PowerPC 601. At that time, we wrote:
In our tests, calculations using the PowerPC processor's single-precision floating-point multiply-add instruction were 20,000 times faster. This means that if we had started a lengthy floating-point calculation in 1984 at the release of the Macintosh, and that calculation were still being worked on by the computer, it would take a Power Macintosh starting now just four hours to catch up.

The equation editing code was the same written for Milo in the 1980's. On the Power Mac, updates during typing was no longer a problem, so we used that code unchanged and focused our attention on visualizing graphs and using the speed of the machine to improve the user interface. The lessons we learned then remain relevent today.

Taking full advantage of any machine requires understanding the hardware. On an 8 Mhz Motorola 68000, performance was limited by the CPU which took many 125-nanosecond cycles to complete each instruction. With no floatint point hardware, individual arithmetic operations took hundreds of instructions. On a 60 MHz PowerPC 601, many instructions took only one 16.7 nanosecond cycle. Furthermore, the PowerPC was one of the first superscalar machines, meaning that it had multiple execution units (integer, floating point, and branch) and could do multiple instructions per cycle. However, accessing memory was much slower, so the CPU kept an on-chip cache of recently accessed memory. In many cases, performance was entirely determined by memory usage patterns. Calculations which could fit their data entirely in the on-chip cache ran 100x faster than the same caculation on a larger data set which exhausted the cache and needed to access main memory. Programming for performance now meant choosing algorithms and data structures with locality of reference so they worked on data in pieces small enough to fit in the cache at one time.

Today, the world is again different. Now machines have multiple cores, each with their own on-chip cache (perhaps shared with other cores on that chip, perhaps not). Programming for performance now means choosing algorithms that can run in parallel and minimize communication between parallel threads.

September 13, 2006

Multitasking

I don't multitask well. I am most productive when I have twelve to eighteen hours completely uninterrupted in silence to focus on a problem. Even minor distractions can destroy my train of thought. When I am able to stay "in the zone" for long periods I can be very productive. Many folks have written about this phenomenon. It explains some of the reclusive or anti-social nature of hackers and their nocturnal habits. Joel remarks that is surprising that any programming can be accomplished in the cube farms typical in software companies.

Computers, on the other hand, have no trouble multitasking. They switch tasks many times a second. Swap out the registers, stack, and other CPU state and a processor will continue merrilly along on a different problem with no complaints. Us programmers have the difficult job of architecting our software to use this multitasking to improve our users experience of our software.

Once upon a time, computers were simpler. You could understand what they were doing following a single thread of execution. Decades of Moore's Law have accustomed us to ever larger disks, more memory, and faster processors. Working on the same software for twenty years, I've watched features that were once unthinkable become interactive and taken for granted. When Graphing Calculator 1.0 shipped, the feature that really demonstrated the speed of the original Power Macs with 60 MHz 601 PowerPCs, even more so than the visually impressive 3D graphing, was the ability to graph inequalities such as x2+y2+sin2(4*x)+sin2(4*y)<n2. What took several minutes a frame in 1993 now animates many frames per second.

As a designer, I continually ask myself How can the latest technology benefit my users? Performance is one of the fundamental usability issues. When something can happen many times a second, it is interactive and playful. When something takes seconds or minutes, users are forced into a batch or query-response mode of interaction where they must spend more of their attention thinking about formulating the problem for the computer, rather than thinking about the problem itself.

What does this all have to do with multi-tasking? The Free Lunch Is Over. Programmers can no longer rely on ever-faster clock speeds. To take advantage of the power of new machines, we now have to design for parallism. Apple now ships quad-core machines, and is expected to soon ship eight-core machines. (Let's not forget the GPU.)

I've spent the last month reading about threads, semaphores, mutexes, recursive mutexes, read-write locks, condition variables, work queues, critical sections, deadlocks, race conditions, word tearing, fences, barriers, atomic operations, and all manner of synchronization and communication tools and issues. Threading APIs are a maze of twisty passages, all different: Posix threads, Carbon MP threads, Cocoa threads, OpenMP, Win32/MFC, .Net, Mach threads, and Boost threads.

The real challenge is the software archaeology of reviewing twenty years of source code line by line to expose and make explicit any hidden assumptions of synchronization and communication which could be ignored when there is a single thread of execution, but which require careful thought when different parts of the code execute in parallel.

This is already a long post, so I'll leave it at that for now.

September 09, 2006

Staunton Chess Function

pawn.gif rook.gif

Eric Eason constructed these equations to model Staunton chess pieces. (Click on the images above for the Graphing Calculator file.)

September 08, 2006

Square Wheels

Square Wheels

James Taylor posted on the Graphing Calculator users list showing square and octogonal wheels moving smoothly along a road made of catenary arcs. I rewrote the equations in terms of a complex variable z=x+iy. Download Graphing Calculator file.

Check out Stan Wagon's square wheeled bicycle here.