Main

September 12, 2007

Virtual math spaces

CyberMath is a shared virtual environment for exploring mathematics. I've long wanted to make Graphing Calculator into an authoring tool for such an interactive immersive space. The popularity and success of World of Warcraft hints at the possibilities in coming years.

Does anyone here use Second Life or have any knowledge of Croquet? The technology for building these spaces is maturing. I'm now wondering how to make it accessible to teachers and curriculum authors so that they can focus on the mathematical content and pedagogy while constructing mathematical landscapes.

I would welcome any advice.

August 30, 2007

Lunar Eclipse

Enjoy two rather different views of Monday night's total lunar eclipse. I watched from Chabot with a huge crowd, frenetic activity, many telescopes big and small, television news crews shining floodlights periodically on people in sleeping bags and people in line for the big telescope, the Castilleja High School astronomy class - the TV crews loved interviewing them. One Chabot astronomer in a pointy hat and cape told amusing and educational stories all night long. Quite a few shooting stars. One particularly large one just at the start of totality after the diamond in the diamond ring faded away was a big crowd pleaser.

Kudos to Astropixie for pointing out the images!

July 04, 2007

Cross-platform development

I clicked "Restart" after Apple Software Update applied what I thought were minor system patches, and my computer rebooted into Windows....

Continue reading "Cross-platform development" »

June 19, 2007

Adventures in Optimization: OpenGL

Over the past few years, the hardware-accelerated rendering pipeline has rapidly increased in complexity, bringing with it increasingly intricate and potentially confusing performance characteristics. Improving performance used to mean simply reducing the CPU cycles of the inner loops in your renderer; now it has become a cycle of determining bottlenecks and systematically attacking them. This loop of identification and optimization is fundamental to tuning a heterogeneous multiprocessor system; the driving idea is that a pipeline, by definition, is only as fast as its slowest stage. Thus, while premature and unfocused optimization in a single-processor system can lead to only minimal performance gains, in a multiprocessor system such optimization very often leads to zero gains. - Cem Cebenoyan, NVIDIA, in GPU Gems

Continue reading "Adventures in Optimization: OpenGL" »

June 10, 2007

Adventures in Optimization: Threading

After identifying the superficial bugs slowing GC4, I set out to measure the degree of parallelism I was able to achieve with the new multi-threaded calculation code. Are there any other calls like GetWRefCon unintentially introducing dependencies stalling threads? Was GC able to keep multiple CPU cores fed wih independent work? In additional to the traditional questions of Where is the software spending its time? What is it doing there? I now have to ask those question separately of multiple threads of execution on multiple CPU cores and understand the answers in context, asking When is one thread of execution stalled waiting for results from another? With the right tool, the answer can be obvious.

ThreadView.png

Continue reading "Adventures in Optimization: Threading" »

Adventures in Optimization: Benchmarking

My first job at Apple in 1992 was benchmarking the quality of a dozen different handwriting recognition software libraries from different companies. We like to pretend that everything from health to cars to wars can be measured by a single number but real systems are multi-dimensional. Benchmarks are like standards: everyone likes to make their own. I approach software measurements with my old Stanford Physics lab training. I treat the software as tangible physical system, experiment to determine what I can measure reproducibly, look for mathematical relationships between variables, and attempt to construct a useful basis set of axes upon which to characterize their phase space. This did not make me any friend while benchmarking for Apple. Each company or internal group with a solution to sell had strong ideas about comparing systems to put their own solution in the best light. Furthermore, handwriting recognition was and is still an unsolved research problem. Comparing different research projects is quite different from comparing products of a mature technology. Ultimately, none of the choices proved adequate.

For Graphing Calculator, I use performance measurements to make programming choices when comparing high-level algorithms, low-level implementation details, different compilers, and compilation options. Holding the Command key while pressing the slider Play button will initiate a stopwatch to time the slider. When I'm focusing on a particular routine, this is useful to compare small code changes from one build to the next. In the Graphing Calculator menu, the Graphing Calculator > Benchmark command runs through thirty-two tests timing calculation and drawing. These help me compare how GC behaves on different hardware and how its behavior changes across versions years apart. The report looks like this:

Continue reading "Adventures in Optimization: Benchmarking" »

Adventures in Optimization: Prelude

Two weeks ago, I ran the Intel-native GC4 on an 8-core machine. It revved all 8 processing cores yet ran no faster than on a four-year old single-core G4 laptop at one-third the clock speed. A few minutes with Shark identified the culprit: a single innocuous call to GetWRefcon. In the single-user, single-core, single-threaded Macintosh toolbox of 1984, that call would have taken a few instructions. Mac OS X must share system resources across multiple users, processor cores, applications, and threads of execution all making demands at once. GetWRefCon is labelled Not Thread Safe: it is not guaranteed to work at all from anywhere but the main thread. Rather than crash or return a bogus result, it was causing all eight otherwise independent threads of execution to serialize on that one shared resource, passing the baton from one thread to the other, unable to execute in parallel. Fixing this was easy. Although the window structure is a shared resource protected by a lock, the RefCon itself that I actually needed contained read-only information all threads could read in parallel safely without locking. No need to go through GetWRefCon at all. The parallel threads should have been passed a pointer to the read-only document record directly.

After fixing that, some tests ran twenty times faster. Other tests however, still ran at the same speed as on the older, slower, single-core laptop. I devoted the last two weeks to figuring out why.

Dock.png

Continue reading "Adventures in Optimization: Prelude" »

May 31, 2007

glBufferSubData

Yesterday was spent debugging until two in the morning trying to diagnose a drawing artifact which occured only on Intel-native builds of GC4 when animating colors on a static 3D model defined using an implicit surface equation using OpenGL vertex buffer objects. The image above started out, correctly, as a solid red sphere. When animating the colors using the GC slider, the top of the sphere did not change, remaining red. The middle of the sphere animated correctly. The bottom of the sphere animated but rgb color values (a,b,c) would display shifted as (b,c,a).

Continue reading "glBufferSubData" »

May 28, 2007

GetWRefCon

Graphing Calculator's calculations are for the most part emabarrassingly parallelizable. I spent several months last year expressing that parallelism in code to take advantage of muti-core systems. Yesterday I compared an Intel native build of GC4 on an 8-core system side by side with GC3.5 running under Rosetta. As expected, GC4 pegged the meter on all 8-cores:

Activity.png

Imagine my surprise upon discovering that GC4 running native and parallel was no faster than GC3.5 running emulated on a single-core. Shark long ago earned a permanent spot in my dock as one of the best development visualization tools. It quickly informed me that 85% of the time was spent in GetWRefCon which calls GetWindowData which calls HIObject::IsRefValid and HLTBSearchRefTable which was serializing the parallel threads of execution. I had mistakenly thought of GetWRefCon as "free", misled by my 1980s-era Macintosh training when it was no more than a simple wrapper to dereference a WindowRecord structure field. Just another reminder that the only way to have any idea where your program is spending its time is to actually measure it. Stay tuned for a benchmarks report after I fix this.

December 06, 2006

The Problem with Programming

Tech Review asks Bjarne Stroustrup Why is most software so bad?

I think the real problem is that "we" (that is, we software developers) are in a permanent state of emergency, grasping at straws to get our work done. We perform many minor miracles through trial and error, excessive use of brute force, and lots and lots of testing, but--so often--it's not enough.

Software developers have become adept at the difficult art of building reasonably reliable systems out of unreliable parts. The snag is that often we do not know exactly how we did it: a system just "sort of evolved" into something minimally acceptable.

Read the rest of the interview to discover Kierkegaard's influence on the design of C++.

November 14, 2006

Intel launches quad-core Kentsfield CPU

Intel takes Quad Core to the desktop

Intel Corp has officially launched its first desktop quad-core processor, codenamed Kentsfield.... a 2.66GHz chip with a 1066MHz front-side bus.

November 08, 2006

Efficient thread safe static initialization

Dave MacLachlan, a Mac programmer at Google, discusses thread safe initialization of statics here and here on the Official Google Mac blog offering an interesting peak under the hood of Objective-C and avoiding the pitfalls of the dreaded DCLP.

Continue reading "Efficient thread safe static initialization" »

November 07, 2006

Discontinuous functions and C++0x

On the Graphing Calculator Users List, David Craig pointed out an interesting bug.

"A formula for a conic section in polar coordinates is r=A/(1+Ecos&theta) where A and E are constants. E is the eccentricity. When E is increased past one so that the section is hyperbolic, GC 'crosses' the hyperbola. This is distracting."

hyperbola_eq.png

hyperbola.png

The bug occurs because GC samples the function at discrete points and connects them with line segments. Where the denominator goes to zero, the function is not continuous, but GC does not notice and blithely connects points sampled on separate branches of the hyperbola.

Continue reading "Discontinuous functions and C++0x" »

October 28, 2006

Software and the Concurrency Revolution

Herb Sutter gave a talk (wmv video, pdf slides) here in Berkeley at Intel Research last month. Related publications:

October 17, 2006

C++0x

Colleagues are now meeting in Portland to discuss the future of the C++ standard. Later this week also in Portland, Library-Centric Software Design '06 at OOPSLA '06. A number of C++0x proposals address the problem of threading, allowing the programmer to specify atomicity and ordering constraints within the language and providing high-level functionality for managing synchronization. In particular:

Continue reading "C++0x" »

October 14, 2006

Dependencies

graph

Consider a dipole field as a gravity wave passes distorting space. On the left are positive and negative charges drawn as spheres, a vector field drawn as arrows, and field lines drawn as tubes. The right shows the objects under a coordinate transformation of a compression wave travelling along the x-axis.

Here is an animated movie.

Continue reading "Dependencies" »

October 11, 2006

Cooperative Multitasking

I've been deep in the bit mines all week, wired on Mountain Dew, reviewing the 35,000 lines of code which implement the actual graphing classes of Graphing Calculator, making the Compute method of each different graphing algorithm thread-safe and able to run preemptively. It's been fun to revisit a decade worth of code evolution and simplify it with the benefit of hindsight. Much of the complication was due to the necessity of preserving state across multiple calls to the Compute method of each class so that it could be time-sliced to keep the UI responsive in a single-threaded environment.

As I sat down to describe this, I remembered that I had already wrote what I wanted to say four years ago to a colleague who asked: I've always wanted to know how to do threading. Under OS X that will become more and more valuable as multi-processor machines become increasingly common. Where is there a good discussion of using threads?

At 3:05 PM -0800 10/25/02, Ron Avitzur wrote:

Continue reading "Cooperative Multitasking" »

October 06, 2006

Embarrassingly parallelizable

After a month of reading about threading and a week of coding infrastructure for parallel execution, yesterday I parallelized the Graphing Calculator code for 2D inequalities. This morning I tried it on a dual-core machine while visiting a colleague and was gratified to see visual confirmation of the parallelism as updates from a slow computation marched across the screen in two parallel bands.

Graphing Calculator spends most of its time doing calculations affectionately refered to as embarrassingly parallel. Evaluating 2D inequalities, for example, might be thought of in terms of evaluating a function separately at every point on the screen. Divide the screen geometry into pieces for each thread and you're done. It's not quite that simple, though.

Continue reading "Embarrassingly parallelizable" »

October 04, 2006

Double Checked Locking Pattern

Back in the day, studying physics, the topics of our lab classes and problem sets roughly followed the history of physics from Galileo to Newton to Einstein, from Faraday to Maxwell, from classical to quantum mechanics. I remember being quite excited when we started redoing experiments from the twentieth century.

While programming doesn't have the same depth of history, nearly everything I do day to day is straightforward using techniques fully understood decades ago. How to implement parallelism in C++, is still actively studied. C++ inherits from C the notion of an abstract machine. C++ compilers are allowed by the standard to generate any code which produces the same visible effects as that of the abstract C++ machine. Compilers go to enormous effort to optimize their code generation and will reorder instructions and eliminate code in ways that make the code run faster without changing the results. The compilers, however, (following the standard), only consider visible effects from the perspective of a single thread of execution. (This is necessary. Were compilers forced to consider multiple threads of execution too many powerful optimization would be disallowed.) The problem with C++ in this context is that the C++ abstract machine considers only single threaded execution and the language itself has no way to express the synchronization constraints needed for parallel programming.

Continue reading "Double Checked Locking Pattern" »

September 27, 2006

Software Archaeology

Pham Nuwen spent years learning to program/explore. Programming went back to the beginning of time. It was a little like the midden out back of his father's castle. Where the creek had worn that away, ten meters down, there were the crumpled hulks of machines -- flying machines, the peasants said -- from the great days of Canberra's original colonial era. But the castle midden was clean and fresh compared to what lay within the Reprise's local net. There were programs that had been written five thousand years ago, before Humankind ever left Earth. The wonder of it -- the horror of it, Sura said -- was that unlike the useless wrecks of Canberra's past, these programs still worked! And via a million million circuitous threads of inheritance, many of the oldest programs still ran in the bowels of the Qenq Ho system. Take the Traders' method of timekeeping. The frame corrections were incredibly complex -- and down at the very bottom of it was a little program that ran a counter. Second by second, the Qenq Ho counted from the instant that a human had first set foot on Old Earth's moon. Bit if you looked at it still more closely ... the starting instant was actually some hundred million seconds later, the 0-second of one of Humankind's first computer operating systems. - A Deepness in the Sky, Vernor Vinge

They tend to make their changes as small surgical changes that affect the component as little as possible. And the more these surgical changes add up, the more fragile the code gets. As these changes build up, the code in the component starts to resemble a piece of conglomerate rock. The more people add onto the base, the more blobs of other rocks get added to the conglomerate. When I work on the code like this, it sometimes feels like I’m an archaeologist digging through the layers of an ancient city – if you pay really close attention, you can see which developers wrote which portions of the code. - Larry Osterman

At first glance, business software developers have little in common with Indiana Jones. But the emerging field of software archaeology applies some of the same skills, if not the dashing adventure. - Simon Sharwood

"Software archaeology" conjures up images of people poring over documents in a forgotten language, looking at broken artifacts whose purpose is unknown. - Ralph Johnson

It's been a long couple of days deep in the bit mines.

September 26, 2006

Intel pledges 80 cores in five years

Intel pledges 80 cores in five years

Intel has built a prototype of a processor with 80 cores that can perform a trillion floating-point operations per second.

Exciting times ahead.

September 25, 2006

Memory barrier

The question on implementing a memory barrier generated a lot of e-mail discussion. The short answer is:

On OS X you can use OSSynchronizeIO in OSAtomic.h.
VC++ has builtins: _WriteBarrier, _ReadBarrier, _ReadWriteBarrier. For x86, these merely instruct the compiler to not reorder operations past them in certain ways. Later x86 chips have the mfence, lfence, sfence instructions, which IIRC only apply to some of the newer SSE2 memory operations.
See below the cut for more gruesome details...

Continue reading "Memory barrier" »

September 24, 2006

Failures modes

After several weeks of reading books, source code, API and system documentation, I'm ready to start coding. But first, time to take a break blogging, by attempting to enumerate all the brand new kinds of failure modes when programs run in parallel. Most of these occur when time-ordering assumptions are broken by parallel threads of code in contention for some shared resource, often objects stored in shared memory accessed from multiple threads.

Continue reading "Failures modes" »

September 20, 2006

Fences

Graphing Calculator spends most of its time computing functions typed by the user. Occasionally it will draw results as calculations progress. From a high altitude, re-architecting the code to work across multiple CPU cores is straightforward: all computations will occur in preemptive threads. Occasionally, the main thread wakes up to draw a snapshot of the results either in progress or when they are all completed. As always, the devil is in the details. Graphing Calculator currently has thirty distinct graph classes, each representing a different algortihm to compute and visualize some mathematical object inferred from the form of the equation the user types.

Continue reading "Fences" »

September 18, 2006

Multi-core Programming

Software developers can no longer rely on increasing clock speeds alone to speed up single-threaded applications... developers must learn how to properly design their applications to run in a threaded environment. - Multi-Core Programming: Increasing Performance through Software Multi-threading. Shameem Akhter and Jason Roberts

If like me, you're programming on a twenty-year-old single-threaded performance-critical legacy code base, this book is for you. It assumes only a programming competence, begins with the motivation for multi-threading, the context and history of the hard and the soft, then methodically works through from the high level concepts of concurrency and parellism down to the primitive building blocks and machine implementation.

Continue reading "Multi-core Programming" »

September 14, 2006

Hardware

On the occasion of the 50th anniversary of the hard disk drive I've been pondering the evolution of the hardware Graphing Calculator grew up on.

The original hard disk drive (IBM 305 RAMAC) in 1956 stored only 5MB of data on fifty 24-inch disks and weighed over 250 kg (over 550 pounds). About the size of two large refrigerators and as tall as a man, you could lease the whole unit for about $250,000/year in today's dollars - John Cole

Continue reading "Hardware" »

September 13, 2006

Multitasking

I don't multitask well. I am most productive when I have twelve to eighteen hours completely uninterrupted in silence to focus on a problem. Even minor distractions can destroy my train of thought. When I am able to stay "in the zone" for long periods I can be very productive. Many folks have written about this phenomenon. It explains some of the reclusive or anti-social nature of hackers and their nocturnal habits. Joel remarks that is surprising that any programming can be accomplished in the cube farms typical in software companies.

Computers, on the other hand, have no trouble multitasking. They switch tasks many times a second. Swap out the registers, stack, and other CPU state and a processor will continue merrilly along on a different problem with no complaints. Us programmers have the difficult job of architecting our software to use this multitasking to improve our users experience of our software.

Continue reading "Multitasking" »