After several weeks of reading books, source code, API and system documentation, I'm ready to start coding. But first, time to take a break blogging, by attempting to enumerate all the brand new kinds of failure modes when programs run in parallel. Most of these occur when time-ordering assumptions are broken by parallel threads of code in contention for some shared resource, often objects stored in shared memory accessed from multiple threads.
Deadlock: Thread #1 stops waiting for resource-A held by Thread #2 which is stopped waiting for resource-B held by Thread #1. Both threads wait forever. Deadlock is a symptom of poor design.
Self-deadlock: Thread #1 stops waiting for a resource, not realizing it already holds that resource. When the resource is a mutex, a recursive-mutex solves this problem.
Live lock is a variation of deadlock where each thread releases the resource it owns attempting to eliminate the deadlock, but then each only manages to acquire one of the necessary resources, can not continue, releases, and loops uselessly in that fashion.
Race conditions can be subtle bugs where a result depends on timing and different results occur as the ordering of instructions running on different processors changes each time.
Word-tearing can occur when a read or write to memory is not an atomic operation. For example, a floating point double value occupies 8 bytes. Usually, those eight bytes are read or written all at once. However, if they are mis-aligned and cross a cache-line or page boundary, those 8 bytes may be written or read in pieces. One thread reading a double from memory may see a nonsense value made up of some of the bytes of an old value and some of the bytes of a new value written by a different thread.
Cache-ping-pong: Locality of reference is crucial for high performance on a single processor now that CPUs run so much faster than they can access main memory. However, multiple threads, even though they may be doing independent calculations on independent unrelated values may still have an invisible connection if the memory those values occupy happen to live in the same cache-line. In that case, the cache-consistency mechanism which enables using shared memory for communication in a multiprocessor system will be constantly refreshing that cache-line on each core keeping them in sync, but slowing down each thread enormously while this occurs (even though each core uses a disjoint subset of the memory on that cacheline). This may be difficult to detect since the results will, eventually, be correct.
Too many threads can swamp the system in the overhead of managing them. One needs to consider the actual number of CPU cores (or virtual cores in the present of hyperthreading) and experiment to determine the optimal number of threads for the software design. Typically, one submits jobs to work queue which are run by persistent threads watching the queue.
Priority inversion occurs when a high priority task is slowed by a lower priority task holding a resource it needs.
False sharing occurs when the work done by multiple threads is not really independent, and may run slower than were it done by a single thread. Cache-line ping pong is an example of this where the memory layout introduces a dependency between threads which may not be apparent from the data flow or the source code.
Non-thread-safe APIs: Any function you call from a thread, whether in your own code, a library, or the operating system, must also be thread-safe. The C++ standard, for example, does not require this (as it imposes a performance cost that not all users will want to pay), so one must refer to the documentation of the specific library implementation you are using.