Adventures in Optimization: Benchmarking
My first job at Apple in 1992 was benchmarking the quality of a dozen different handwriting recognition software libraries from different companies. We like to pretend that everything from health to cars to wars can be measured by a single number but real systems are multi-dimensional. Benchmarks are like standards: everyone likes to make their own. I approach software measurements with my old Stanford Physics lab training. I treat the software as tangible physical system, experiment to determine what I can measure reproducibly, look for mathematical relationships between variables, and attempt to construct a useful basis set of axes upon which to characterize their phase space. This did not make me any friend while benchmarking for Apple. Each company or internal group with a solution to sell had strong ideas about comparing systems to put their own solution in the best light. Furthermore, handwriting recognition was and is still an unsolved research problem. Comparing different research projects is quite different from comparing products of a mature technology. Ultimately, none of the choices proved adequate.
For Graphing Calculator, I use performance measurements to make programming choices when comparing high-level algorithms, low-level implementation details, different compilers, and compilation options. Holding the Command key while pressing the slider Play button will initiate a stopwatch to time the slider. When I'm focusing on a particular routine, this is useful to compare small code changes from one build to the next. In the Graphing Calculator menu, the Graphing Calculator > Benchmark command runs through thirty-two tests timing calculation and drawing. These help me compare how GC behaves on different hardware and how its behavior changes across versions years apart. The report looks like this:
Graphing Calculator 4.0d6 benchmark test at Wed Jun 6 11:50:33 2007 Graph pane is 500 x 500 pixels. Timer resolution: 6.5703e-05 seconds. Point 200 steps 7.63 seconds Arrow 200 steps 7.60 seconds Function 200 steps 9.36 seconds Complex-valued function 200 steps 13.10 seconds Parametric curve 200 steps 13.47 seconds Complex-valued parametric curve 200 steps 13.74 seconds Contour plot 200 steps 14.69 seconds Inequality 20 steps 1.84 seconds Complex-valued inequality 20 steps 2.18 seconds Differential equation 20 steps 3.73 seconds Color plot 20 steps 2.49 seconds Density plot 20 steps 2.19 seconds Complex-valued density plot 20 steps 2.54 seconds Coordinate transformation 20 steps 4.57 seconds Inverse coordinate transform 20 steps 5.36 seconds Point 200 steps 5.5 seconds 97.4 frames/s Vector 200 steps 6.4 seconds 98.9 frames/s Parametric curve 200 steps 5.97 seconds 99.3 frames/s Surface 200 steps 6.11 seconds 99.1 frames/s 1206640 triangles/s Surface (checkerboard) 200 steps 6.47 seconds 97.2 frames/s 1182449 triangles/s Surface (texture) 200 steps 6.34 seconds 99.3 frames/s 1207881 triangles/s Surface (lo-res) 200 steps 5.54 seconds 98.8 frames/s 3359 triangles/s Surface (lo-res, checkerboard) 200 steps 5.12 seconds 99.5 frames/s 3382 triangles/s Color animation 200 steps 6.38 seconds 99.2 frames/s 4222201 triangles/s Concentric spheres 200 steps 51.85 seconds 2.7 frames/s 1974451 triangles/s Landscape 50 steps 1.74 seconds 99.2 frames/s 1206798 triangles/s Implicit 50 steps 2.64 seconds 98.7 frames/s 5018935 triangles/s Surface (transparent) 50 steps 3.17 seconds 33.5 frames/s 407963 triangles/s Surface (transparent,lo-res) 50 steps 1.53 seconds 30.1 frames/s 1021 triangles/s Landscape (transparent) 50 steps 3.21 seconds 30.1 frames/s 366715 triangles/s Implicit (transparent) 50 steps 5.62 seconds 19.8 frames/s 1008924 triangles/s Concentric transparent spheres 1 steps 0.45 seconds 9.9 frames/s 1191667 triangles/s
This data from last week has two important clues answering why some tests were no faster on an 3GHz 8-core machine than on a 1GHz single-core machine. Some of the tests which use OpenGL rendering hovered just under 100 frames per second on both machines. Some of the tests which used software 3D rendering hovered at around 30 frames per second. I'm embarrassed to admit I stared at these clues for several days and followed many false leads before their significance became obvious. Three different throttling mechanisms are at work here.
Many of these tests use the Graphing Calculator animation slider to run through multiple frames. To keep the slider user interface closer to consistent behavior, the slider animation, by default, will step at most 30 values per second so that the rate of the slider does not vary too wildly depending on the complexity of the equation the user types. (The user can disable this throttle by pressing Command-Option-T.) When I wrote the benchmarking code on older slower computers, the benchmark examples were mostly under the 30 steps per second limit. Predictably, computers are much faster today. I classify this problem as a single oversight bug: the slider is now not throttled when the benchmark is running.
Graphing Calculator 1.0 of 1993 used cooperative multi-tasking. A single-thread of execution processed all user events, all drawing, and all computation. To keep the system responsive, computational tasks were done a little bit at a time during what was then known as "idle events" when no user keyboard or mouse events required attention. Though all of the heavy-lifting calculations now run in separate, independent, preemptive threads of their own, a little bit of that legacy "idle event" architecture remains: GC4 uses a periodic system timer to call what was once the GC Idle Event handler every 0.5, 0.01, or 0.001 seconds to see if it is time to advance the animation slider or blink the cursor. How and when it decides on the frequency of that one timer is complicated, and was buggy. That introduced the 100 frame per second upper limit in the benchmark.
The architecture behind how GC is using that one timer is wrong on many levels. Separate tasks should have have separate timers . There is no need to conflate blinking the cursor with polling threaded computations. When there is a highlighted selection rather than a blinking cursor there is no need for that periodic task at all. Polling is a sign indicating poor design. Polling too frequently wastes CPU cycles. Polling infrequently introduces latency. Either way, it is cleaner to have asynchronous jobs send messages when they finish to awaken whatever process is awaiting their results.
Lastly, it never benefits the user when software attempts to update the display faster than the physical display itself can show those updates. A 100 Hz display can physically show no more than 100 different images per second. This limit arises from the physical mechanism of an electron beam scanning right to left, top to bottom, onto the phosphors of a cathode ray tube. Though LCDs do not use this physical mechanism, Mac OS X treats their electronics as a 60 Hz display. Mac OS X implements coalesced updates in the Quartz framework. This blocks any application which tries to update more frequently than sixty times per second, which saves CPU cycles from drawing operations the user could not possibly see anyway. By trying to redraw the Graphing Calculator window too frequently, it was merely stalling waiting for beam synchronization. Typically, GC throttles its drawing itself but the Benchmark disable GC's internal redraw throttle for measurement purposes. I have not yet fully analyzed how coalesced update stalls affect the GC benchmark. Disabling Beam Sync in Quartz Debug speeds up the benchmark measurably, but I need to analyze where the stalls are occurring.
Take home lessons: I frequently have no idea where my own software is spending its time or why. The current set of benchmark tests is far too easy. I need to redesign them to be taxing on newer machines or the measurements will be nearly meaningless. The Mac OS X 10.4 of 2007 is an utterly different beast from the 128K Macs of my youth.