A Programmer's Apology: Fences

Graphing Calculator spends most of its time computing functions typed by the user. Occasionally it will draw results as calculations progress. From a high altitude, re-architecting the code to work across multiple CPU cores is straightforward: all computations will occur in preemptive threads. Occasionally, the main thread wakes up to draw a snapshot of the results either in progress or when they are all completed. As always, the devil is in the details. Graphing Calculator currently has thirty distinct graph classes, each representing a different algortihm to compute and visualize some mathematical object inferred from the form of the equation the user types.

In many of these classes, the size of the result is known in advance and allocated as an array of structures. The structure contains a .good flag initialized to false to indicate when the other data in that element has been computed. The compute thread evaluates elements in that structure (not necessarily sequentially) and sets .good to true after each element is finished. When the main thread is called on to update the screen, those elements in the array that are good are drawn. (For context, this particular drawing step, is part of a larger one which draw's all equations composing them with translucency whenever anything changes. To get transparency correct, it is necessary to redraw everything in back to front order. So it is frequently necessary to redraw elements rather than merely draw them once after computing them.)

The new problem in a multi-threaded world with processors that commonly use out-of-order exectution to increase instruction-level parallelism, is that the hardware order of memory operations no longer matches the software order, and different CPU cores may see memory locations change in different orders. A "fence instruction" or "memory barrier" guarantees completeness of all pre-fence memory operations and halts all post-fence memory operations until completion of the fence instruction. - Akhter and Roberts.

So my question now is: Do I want to try to keep most of the existing code unchanged and use a explicit fence to make it MT-safe? (And if so, what can I use to implement the memory barrier.) Or do I want to use a higher-level thread-safe data structure, which will involve (probably a lot) more re-writing.

A few points to remember:

I'm transitioning performance-critical twenty year old numeric intensive legacy code
the basic structure, (array of elements, ->good flag, drawing partial results) is repeated across several dozen distinct classes
the size of the array may be known a-priori before hand, and the entire array allocated. It can be rather large. Preallocating the entire array is far more efficient than storing separate results in pieces scattered across memory
Transitioning a large code base will be safer, less buggy, and easier to test, the closer the MT-safe constructs I use are to the current single-threaded implementation.

In conversation, here are some suggestions folks made towards making this construct MT-safe:

Use a threadsafe queue. Threadsafe queues are easy to write. Have a threadsafe set of done elements.
It will also work to just have a lock on the array and scan it when locked if that's what you really want to do. And then lock it before setting doneFlag and unlock it afterwards.
Rather than having a doneFlag, why not a pointer to the data? So it's NULL or not-NULl and if not-NULL then you know the data has been allocated, instantiated, and is ready.
I don't think there is a portable way to guarantee ordering of memory operations. However, if you just need Mac & Windows, there are probably ways of doing it using inline assembly. (ie, eieio)
If you don't want to change things and are only worrying about the serialization of the .good = 1 setting. You'll have to use something from the POSIX threads library to absolutely guarantee it. However, there are things you can do to guarantee it in practice. For example, putting code in separate compilation modules so the compiler can't optimize you. That is guaranteeing the order of writes.
You have to use POSIX threads or compiler extensions. They are your only options.
On Windows boxes, you see a lot of code using a single global mutex, which works but isn't optimal.
I think you have better data structure alternatives, but only you can assess the cost & risk of changing your legacy codebase. I would use a library-provided data structure: so it's someone else's bug to think about this.
Putting one big mutex over the entire array might actually be your best bet. Simple. Works. How many threads are you going to have competing for it? One writer thread. One reader (main thread). Well, the problem with it then would be that the writer will block while you iterate over the array. So it can't go off and continue calculation if the reader is slow.
Why don't you just have an array of pointers to objects that are done, initialized to null? Have it mutex-protected. When done, grab the mutex, write the pointer, release the mutex. I don't think the reader has to look at the mutex for this to work and be safe.
You have to get both in-order execution in the processor and the expected instruction ordering out of your compiler, don't forget. A fence instruction guarantees completeness of all pre-fence memory operations and halts all post-fence memory operations until completion of the fence instruction.This fence mechanism ensures proper memory mapping from software to hardware memory models.
Sounds like eieio. It is there to guarantee the (appearance at least) of in-order execution, as far as memory IO is concerned.
If the writer thread grabs the mutex, writes .good, releases the mutex - is it safe for the reader thread to see .good == true without bothering with the mutex?

A Programmer's Apology

Fences

TrackBack

Post a comment