« Failures modes | Main | Intel pledges 80 cores in five years »

Memory barrier

The question on implementing a memory barrier generated a lot of e-mail discussion. The short answer is:

On OS X you can use OSSynchronizeIO in OSAtomic.h.
VC++ has builtins: _WriteBarrier, _ReadBarrier, _ReadWriteBarrier. For x86, these merely instruct the compiler to not reorder operations past them in certain ways. Later x86 chips have the mfence, lfence, sfence instructions, which IIRC only apply to some of the newer SSE2 memory operations.
See below the cut for more gruesome details...

Unless you implement all of your data within the atomic variables themselves, you need a memory barrier. Otherwise, consider the following:

store result to gResult              -- A
atomic set kDone flag in gFlags      -- B
On PowerPC, there's no guarantee that A is seen by other processors before B.

On CW OSSynchronizeIO is implemented as:

static __inline__ void OSSynchronizeIO(void)
  {
  #if defined(__ppc__)
  __asm__ ("eieio");
  #endif
  }

So on CW it boils down to the question: CW won't re-order instructions around inline asm, right?

Right. Even though it would be inlined, the compiler can't make assumptions about what the function call is doing. That inline asm could be doing something like a system call with all sorts of potential side effects.

On CW iOSSynchronizeIO is implemented as __asm__ ("eieio"). Is that insufficient in an MP system to ensure that write re-ordering won't break the use of a "good" flag?

It is sufficient, as it enforces the order of accesses to the cache for cacheable address space, and cache coherency between processors is handled by the hardware.

I was under the impression that it was not sufficient in cases where the flag is protecting shared data.

The shared data will be in the cache (or main RAM if it got flushed out of the cache), and the cache is coherent across processors. The memory barrier makes sure any accesses to the shared data by the current thread and processor before the barrier are completed before any accesses after the barrier, so the shared data must be in the cache (or memory) before the flag change is.

True. However, you have to be very careful when using this optimisation (that is, using eieio instead of lwsync). Check out the following note from B.2.2.2 of "PowerPC Virtual Environment Architecture" book II...

However, for storage that is neither Write Through Required nor Caching Inhibited, eieio orders only stores and has no effect on loads. If the portion of the program preceding the eieio contains loads from the shared data structure and the stores to the shared data structure do not depend on the values returned by those loads, the store that releases the lock could be performed before those loads. If it is necessary to ensure that those loads are performed before the store that releases the lock, lwsync should be used instead of eieio. Alternatively, the technique described in Section B.2.3 can be used.
This makes eieio a "write barrier." For read or read/write, use lwsync.
Note that locking/unlocking a pthread mutex will issue the correct barriers (I checked).

Consider the case where you do something like:

struct {
   int m1;
   int m2;
} gFoo;

int gLock;

static void MyFunc(void)
{
   int local1;

   gFoo.m1 = 1;
   local1 = gFoo.m2;
   eieio();
   gLock = false;

   // At this point, there's no guarantee that the load for local1 has
   // been done.  If clearing gLock allows another thread to modify
   // gFoo, the load of local1 is in a race with that other thread's
   // modifications.
}
Ain't weak consistency grand!

Am I correct in understanding that IA32 does no visible write re-ordering?

It does for some of the SSE2 memory operations, but not for regular ones. Appendix B of the "PowerPC Virtual Environment Architecture" book II makes for some fascinating reading.

If you want to read more, here are pages on strict consistency, sequential consistency, and relaxed consistency.

Thank you for tuning in to today's episode of Adventures in Parallel Computing.

TrackBack

TrackBack URL for this entry:
http://voiceofthecoast.com/mt/mt-tb.cgi/83

Comments

If you look at osfmk/ppc/commpage/spinlocks.s from the darwin kernel, you can find these two routines:

spinlock_32_unlock_mp:
        li    r4,0
        isync       // complete prior stores before unlock
        eieio       // (using isync/eieio is faster than a sync)
        stw   r4,0(r3)
        blr
spinlock_64_unlock_mp:
        lwsync      // complete prior stores before unlock
        li    r4,0
        stw   r4,0(r3)
        blr

It seems that you have to use both isync and eieio for a write barrier, possibly due to some errata on certain ppc models. Also, IBM's documentation recommends against using eieio for this purpose, though they don't explain why (see note 6 under the table after scrolling down a bit).

Further, they use lwsync rather than isync+eieio on the G5 (also for 32 bit programs, the above implementations are put on the commpage at startup based on the detected cpu model), but I don't know whether this is due to lwsync being faster or because isync+eieio is not always safe on a G5.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)