<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>A Programmer&apos;s Apology</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/" />
    <link rel="self" type="application/atom+xml" href="http://avitzur.hax.com/atom.xml" />
   <id>tag:avitzur.hax.com,2008://3</id>
    <link rel="service.post" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3" title="A Programmer's Apology" />
    <updated>2008-03-07T21:32:38Z</updated>
    
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type 3.2</generator>
 
<entry>
    <title>Art Project #11</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2008/03/art_project_11_1.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=206" title="Art Project #11" />
    <id>tag:avitzur.hax.com,2008://3.206</id>
    
    <published>2008-03-07T21:24:33Z</published>
    <updated>2008-03-07T21:32:38Z</updated>
    
    <summary> (Click the image to see the equations.)...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="Gallery" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p><center><a href="http://www.PacificT.com/Examples/CarFinal/"><img src="http://www.PacificT.com/Examples/CarFinal/graph.png" WIDTH=50% HEIGHT=50%></a></center></p>
<p>(Click the image to see the equations.)</p>]]>
        
    </content>
</entry>
<entry>
    <title>Virtual math spaces</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/09/virtual_math_spaces.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=142" title="Virtual math spaces" />
    <id>tag:avitzur.hax.com,2007://3.142</id>
    
    <published>2007-09-13T02:51:42Z</published>
    <updated>2007-09-13T03:00:37Z</updated>
    
    <summary> CyberMath is a shared virtual environment for exploring mathematics. I&apos;ve long wanted to make Graphing Calculator into an authoring tool for such an interactive immersive space. The popularity and success of World of Warcraft hints at the possibilities in...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<a href="http://www.nada.kth.se/~gustavt/cybermath/"><img src="http://www.nada.kth.se/~gustavt/cybermath/dh4s.jpg"></a>
<p><a href="http://www.nada.kth.se/~gustavt/cybermath/">CyberMath</a> is a  shared virtual environment for exploring mathematics. I've long wanted to make Graphing Calculator into an authoring tool for such an interactive immersive space. The popularity and success of World of Warcraft hints at the possibilities in coming years.</p>

<p>Does anyone here use <a href="http://secondlife.com/">Second Life</a> or have any knowledge of <a href="http://www.croquetconsortium.org/index.php/Main_Page">Croquet</a>? The technology for building these spaces is maturing. I'm now wondering how to make it accessible to teachers and curriculum authors so that they can focus on the mathematical content and pedagogy while constructing mathematical landscapes.</p>

<p>I would welcome any advice.</p>]]>
        
    </content>
</entry>
<entry>
    <title>Lunar Eclipse</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/08/lunar_eclipse.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=141" title="Lunar Eclipse" />
    <id>tag:avitzur.hax.com,2007://3.141</id>
    
    <published>2007-08-30T21:48:57Z</published>
    <updated>2007-08-30T21:57:04Z</updated>
    
    <summary>Enjoy two rather different views of Monday night&apos;s total lunar eclipse. I watched from Chabot with a huge crowd, frenetic activity, many telescopes big and small, television news crews shining floodlights periodically on people in sleeping bags and people in...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[Enjoy two rather different views of Monday night's total lunar eclipse. I watched from <a href="http://chabotspace.org/">Chabot</a> with a huge crowd, frenetic activity, many telescopes big and small, television news crews shining floodlights periodically on people in sleeping bags and people in line for the big telescope, the Castilleja High School astronomy class - the TV crews loved interviewing them. One Chabot astronomer in a pointy hat and cape told amusing and educational stories all night long. Quite a few shooting stars. One particularly large one just at the start of totality after the diamond in the diamond ring faded away was a big crowd pleaser. </p>
<p>
<a href="http://apod.nasa.gov/apod/image/0708/EclipsedMoonPugh.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px;" src="http://apod.nasa.gov/apod/image/0708/EclipsedMoonPugh.jpg" alt="" border="0"></a><a ..="" try="" {parent.deselectbloggerimagegracefully();}="" catch(e)="" {}="" href="http://apod.nasa.gov/apod/image/0703/tsemoon_Gartstein_720cropped.jpg"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px;" src="http://apod.nasa.gov/apod/image/0703/tsemoon_Gartstein_720cropped.jpg" alt="" border="0">
</a></p><p>Kudos to <a href="http://amandabauer.blogspot.com/2007/08/dark-lunar-eclipse.html">Astropixie</a> for pointing out the images!</p>]]>
        
    </content>
</entry>
<entry>
    <title>Cubes</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/08/cubes.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=140" title="Cubes" />
    <id>tag:avitzur.hax.com,2007://3.140</id>
    
    <published>2007-08-20T19:48:48Z</published>
    <updated>2007-08-20T20:04:55Z</updated>
    
    <summary>Nico Bakker sent in this document saying Here is another example of the beauty of Graphing Calculator. Thank you, Nico! Click the image to see the equations. Click here for the movie....</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="Gallery" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>Nico Bakker sent in this <a href="http://www.PacificT.com/Examples/Cubes/">document</a>  saying <i>Here is another example of the beauty of Graphing Calculator.</i></p> 

<p>Thank you, Nico!</p>

<p><center><a href="http://www.PacificT.com/Examples/Cubes/"><img src="http://www.PacificT.com/Examples/Cubes/graph.png" height="331" width="414"></a></center></p>
<p>Click the image to see the equations. </p></p><a href="http://www.pacifict.com/Examples/Cubes.mov"> Click here for the movie.</a></p>]]>
        
    </content>
</entry>
<entry>
    <title>&quot;Math is hard&quot;</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/07/math_is_hard.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=139" title="&quot;Math is hard&quot;" />
    <id>tag:avitzur.hax.com,2007://3.139</id>
    
    <published>2007-07-25T19:16:20Z</published>
    <updated>2007-07-25T19:40:40Z</updated>
    
    <summary>Mattel got a lot of flack for its talking Barbie doll which said &quot;Math is hard.&quot; I wanted to reprogram the voice chips to say &quot;Partial differential equations with Neumann boundary conditions are hard.&quot; It&apos;s not that I completely disagreed...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="books" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>Mattel got a lot of flack for its talking Barbie doll which said "Math is hard."  I wanted to reprogram the voice chips to say "Partial differential equations with  Neumann boundary conditions are hard." It's not that I completely disagreed with Mattel, I just thought Barbie should have been more specific.  Imagine the conversations: "Mommy, what's a Neumann boundary condition?" "Well you see dear, that's when you fix the value of the derivative on the boundary curve." But then, I've been working on nerd propaganda for decades.</p>
]]>
        <![CDATA[<p>In a similar vein, <a href="http://www.amazon.com/gp/cdp/member-reviews/AH88WGWK9PMDL?ie=UTF8&display=public&sort_by=MostRecentReview&page=5">this reviewer</a> on Amazon assumes the intelligence of the reader: <blockquote>Anyone who's been around children (or been a child themselves) knows about the "why?" game. It starts out with something like this: "Daddy (or Mommy), why is the sky blue?" So you explain about Rayleigh scattering and the fact that molecules in the atmosphere scatter photons with an efficiency that's inversely proportional to the fourth power of the wavelength. You are hardly finished when the next question shoots across your bow: Daddy (or Mommy) why is there an atmosphere?" So you dutifully explain planetary evolution, the expulsion of vast quantities of carbon dioxide that facilitated the evolution of life forms that exploit photosynthesis, producing oxygen, etc. Then the third question comes "Daddy (or Mommy) why do planets form?" You follow this question with a short lecture on the planetary nebular hypothesis. But the questions don't stop; they just keep coming and coming and coming.</blockquote>
I want to live in that world: where one just presumes that, of course when any child asks about the blue sky, their parent will dutifully explain Rayleigh scattering and planetary evolution!</p>
<p>Props to <a href="http://scienceblogs.com/aetiology/">Tara C. Smith</a> for reviewing Danica McKellar's new book, <a href="http://scienceblogs.com/aetiology/2007/07/danica_mckelllars_math_doesnt.php">"Math Doesn't Suck"</a> and <a href="http://scienceblogs.com/aetiology/2007/07/interview_with_math_whiz_autho.php">interviewing</a> the author. <blockquote>I'd like to show girls that math is accessible and relevant, and even a little glamorous! This society constantly bombards us with damaging social messages telling young girls that math and science aren't for them. I want to show them that yes, math is for them, and my goal was to write an entertaining book that presents math in a fun teen-magazine style, to keep this subject in as non-intimidating and non-stuffy an environment as possible.<br>

I want to see girls embrace math who never thought they could, and for them to understand the importance of developing a strong mind. Math is a fabulous mind strengthener - it's like going to the gym, for your brain! Most of all, I'm hoping to help girls strengthen their fortitude and feelings of self-esteem through finding the courage to tackle the often-challenging subject of mathematics. I want them to feel empowered; if they can do math, they can do anything!</blockquote>

While on the theme, check out an old Rebecca Eisenberg column, <a href="http://www.pacifict.com/GirlsNeedMath.html">Girls Need Math</a>. <blockquote>That's why it comes back to math. Math has no bias. It doesn't come from TV. It doesn't know what you're wearing. Math treats all people equally. Especially when you're in a hard class with all boys, when nobody's cheering you on from the sidelines, when it's not "cool" to be smart, math is a nice thing to have. When nothing else makes sense, math reaches an answer.</blockquote></p>]]>
    </content>
</entry>
<entry>
    <title>Cross-platform development</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/07/crossplatform_development.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=138" title="Cross-platform development" />
    <id>tag:avitzur.hax.com,2007://3.138</id>
    
    <published>2007-07-04T18:50:58Z</published>
    <updated>2007-07-04T22:23:58Z</updated>
    
    <summary>I clicked &quot;Restart&quot; after Apple Software Update applied what I thought were minor system patches, and my computer rebooted into Windows.......</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>I clicked "Restart" after Apple Software Update applied what I thought were minor system patches, and my computer rebooted into Windows....</p>]]>
        <![CDATA[<p>No, Apple has not thrown in the towel in the OS wars. I installed Boot Camp, Windows XP, and VMWare Fusion yesterday to work on Graphing Calculator 4 on Windows. I was concerned when the Windows Startup Disk control panel showed no Mac OS X option, but that was due to Windows not recognizing the Mirrored RAID set acting as my Mac OS X startup disk. I was more concerned when holding the option key down did not, as promised, provide a choice of OSes during the boot sequence. That was due to the boot loader occuring before the bluetooth keyboard was recognized. I was getting a trifle worried when, after plugging in a USB keyboard, holding the option key down, selecting the Mac OS X startup disk, the machine <i>still</i> booted into Windows. That seems to have been a bug again related to the mirrored RAID set which shows up  during boot as two disks, only one of which actually works.</p>
<p>Now that I have my Mac back, I can launch Windows in the background under VMWare and mix Windows XP  windows on my Mac OS X desktop. It's peculiar seeing the Microsoft Visual Studio project window next to the XCode project window next to the CodeWarrior project window. I've resisted the tempation to install Codewarrior for Windows just to complete the set. I can now build Graphing Calculator on Windows and run the Windows release in a window on my Macintosh next to the Mac OS X release to compare bugs.</p>
<p>It's not nearly as nice as during Graphing Calculator 3.2 development, however, when CodeWarrior on Mac OS Classic could "Build All" to compile both the Mac OS and Windows binaries under the same IDE. I expect debugging to be easier, however. Back then, I ran the Windows build under Virtual PC, but  VPC was too slow to run the CW IDE and debugger and on a single-processor machine I couldn't convince CodeWarrior's  two-machine debugging facility to work on a single computer with the other "machine" being faked by Virtual PC. It hit a deadlock condition starving one process of time while also waiting for it to respond.</p>]]>
    </content>
</entry>
<entry>
    <title>Adventures in Optimization: OpenGL</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/06/adventures_in_optimization_ope.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=137" title="Adventures in Optimization: OpenGL" />
    <id>tag:avitzur.hax.com,2007://3.137</id>
    
    <published>2007-06-19T19:53:53Z</published>
    <updated>2007-11-26T18:30:17Z</updated>
    
    <summary>Over the past few years, the hardware-accelerated rendering pipeline has rapidly increased in complexity, bringing with it increasingly intricate and potentially confusing performance characteristics. Improving performance used to mean simply reducing the CPU cycles of the inner loops in your...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<blockquote><i>Over the past few years, the hardware-accelerated rendering pipeline has rapidly increased in complexity, bringing with it increasingly intricate and potentially confusing performance characteristics. Improving performance used to mean simply reducing the CPU cycles of the inner loops in your renderer; now it has become a cycle of determining bottlenecks and systematically attacking them. This loop of identification and optimization is fundamental to tuning a heterogeneous multiprocessor system; the driving idea is that a pipeline, by definition, is only as fast as its slowest stage. Thus, while premature and unfocused optimization in a single-processor system can lead to only minimal performance gains, in a multiprocessor system such optimization very often leads to zero gains.</i> - Cem Cebenoyan, NVIDIA, in <a href="http://download.nvidia.com/developer/GPU_Gems/Sample_Chapters/Graphics_Pipeline_Performance.pdf">GPU Gems</a></blockquote>]]>
        <![CDATA[<p>I spent the last several weeks experimenting, instrumenting my code, profiling performance, and optimizing. By way of summary, here is a sequence of timing and Shark profiles at various stages examining  animated graphs of two implicit equations which represent different balances between numeric calculations and rendering complexity. Last night I wired up a Debug menu into GC4 to make comparisons easier.</p>

<p><img src="http://avitzur.hax.com/images/DebugMenu.png"></p>

<h3>Test Cases</h3>
<p><a href="http://avitzur.hax.com/images/ImplicitAnimation.gcf">ImplicitAnimation.gcf</a> animates <i>x</i><sup>2</sup>&thinsp;+&thinsp;<i>y</i><sup>2</sup>&thinsp;+&thinsp;<i>z</i><sup>2</sup>&thinsp;+&thinsp;sin(<i>nx</i>)&thinsp;+&thinsp;sin(<i>ny</i>)&thinsp;+&thinsp;sin(<i>nz</i>)&thinsp;=&thinsp;1.
</p><p>
<a href="http://avitzur.hax.com/images/ConcentricSpheresTest.gcf">ConcentricSpheresTest.gcf</a> animates cos(n*sqrt(<i>x</i><sup>2</sup>&thinsp;+&thinsp;<i>y</i><sup>2</sup>&thinsp;+&thinsp;<i>z</i><sup>2</sup>)=0.</p>

<h3>Baseline: Software Rendering. Single Compute Thread (8.4s, 19s)</h3>
<pre><img src="http://avitzur.hax.com/images/Profile1a.png">
<img src="http://avitzur.hax.com/images/Profile1b.png"></pre>
<p>For a  baseline, timing each animation using Graphing Calculator's software renderer using a single CPU core, gives 8.4s and 19s, respectively. The Shark profile looks like this, with most of the time spent calculating the equation, and more time rendering ConcentricSpheresTest.

<h3>OpenGL Hardware Rendering. Single Compute Thread (7.7s, 25s)</h3>
<pre><img src="http://avitzur.hax.com/images/Profile2a.png">
<img src="http://avitzur.hax.com/images/Profile2b.png"></pre>
<p>Switching to (single-threaded) OpenGL rendering, the timings become 7.7s and 25s. ConcentricSpheresTest actually slows down. The profiles show that two cores are now active in parallel as OpenGL renders the scene in the main thread while the WorkQueueConsumer prepares the next frame. Zooming into those pictures shows the main thread spending all of its time in OurSubmitTriangles/glDrawElements. In the second case, the overhead of submitting a huge number of tiny triangles to the GPU, (most of them occluded anyway), makes the animation slower despite the hardware accelerated rendering. However, this is already faster than GC 3.5, in which all computations were done in the main thread along with all rendering, cooperatively multi-tasked. Using even a single preemptive thread for calculations allows them to run independently in parallel with drawing and processing user UI events, as seen above.</p>

<h3>Single-threaded OpenGL. Multiple Compute Threads (2.2s, 12s)</h3>
<pre><img src="http://avitzur.hax.com/images/Profile3a.png">
<img src="http://avitzur.hax.com/images/Profile3b.png"></pre>
After breaking the numeric part of the work into jobs which can be run in parallel, the timings become 2.2s and 12s. The profiles now show a pronounced difference in character. The rendering time dominates the second case.  The first case looks like the compute and rendering loads are similar. (Important note: the scale on each profile image is different. Comparisons are meaningful only within a single image. Shark, unfortunately, does not provide a time "Scale Bar" on these views. Is anyone on the Shark team reading?)</p>
<p>In both cases, still, drawing serializes the work. While one CPU core executing the main thread draws, all the other cores go idle waiting for the next thing to start.</p> 

<h3>Vertex Buffer Objects (2.1s, 12s)</h3>
<pre><img src="http://avitzur.hax.com/images/Profile4a.png">
<img src="http://avitzur.hax.com/images/Profile4b.png"></pre>
<p>Using <a href="http://www.spec.org/gpc/opc.static/vbo_whitepaper.html">vertex buffer objects</a> changes the animation timing to 2.1s and 12s. With VBOs enabled in Graphing Calculator, rendering is a two-step process where first, the arrays of vertex and normal vectors, triangles, and colors are submitted with a hint to cache them in VRAM. The command to draw then needs submit only a reference to these buffers. When spinning a static model, this is can be an enormous win, as all the data can stay in VRAM to which the GPU has fast access, and only a new rotation matrix need be sent each frame over the relatively slow connection between GPU and CPU. In these tests, however, we are animating a still model which changes the vertex data each frame.</p>

<h3>Multi-threaded OpenGL (1.9s, 11s)</h3>
<pre><img src="http://avitzur.hax.com/images/Profile5a.png">
<img src="http://avitzur.hax.com/images/Profile5b.png"></pre>
Enabling multi-threaded OpenGL is easy, but that is just the first step. Any glGet calls drain the MTGL pipeline and synchronize execution, eliminating the benefits of parallelism. Communication between parallel threads of execution is very difficult to get right.  What it means for one thread to ask another thread about the (changing) OpenGL state when the relative order of operations between threads can vary is tricky, so the simplest approach to ensure correctness is to make all the communication one-way. After avoiding calls to glGetError, glGenBuffer, and other state queries, Shark still showed the GC main thread stalling at gleFinishCommandBuffer waiting for MTGL to drain in numerous circumstances. glDeleteBuffer was a spike in the profile. On mac-opengl, Richard Schreyer advised: <blockquote><p>Unfortunately, the implementation of glDeleteBuffers (and all other glDelete*) still blocks in all cases right now. Your best workaround is to not delete the buffer object, but instead call glBufferData(size=0) to free the storage.  You'll also need keep track of the used/free buffer object names yourself (I assume you're already doing this to avoid paying the same cost when calling glGen*).  Textures can be handled in the same way (width=0, height=0, depth=0).</p></blockquote>
<p>These profiles now show a new thread, gleCmdProcessor. Graphing Calculator makes all of its OpenGL calls from the main thread. With single-threaded OpenGL, all the CPU work done by the OpenGL driver also occurs in the application main thread. With multi-threaded OpenGL, GC's OpenGL calls made in main are queued into a command buffer which the gleCmdProcessor thread works through talking to the GPU.  In the first case above, gleCmdProcessor has relatively little work to do. In the second case, it dominates the profile, and still blocks the WorkQueueConsumer threads from continuing on to the next step.</p>
<p>The problem is still that although GC can submit jobs to distribute the numeric work across all CPU cores, GC tries to render the last frame before submitting the jobs for the next frame. This serializes the work leaving cores idle while one is working alone on the rendering. Although MTGL can work in parallel, GC still must call glBufferData from the main thread to submit the model data, and that takes time.</p>

<h3>Double Buffered Model Data (1.8s, 6.7s)</h3>
<pre><img src="http://avitzur.hax.com/images/Profile6a.png">
<img src="http://avitzur.hax.com/images/Profile6b.png"></pre>
When GC finishes the numeric computation for one step, it saves those results away to prepare for starting the next step. Only after the slider has been advanced and the next calculations begun does the main thread then submit those saved results to OpenGL to render. The animation times are now 1.8s and 6.7s. Now, the work done in main and gleCmdProcessor overlap the numeric work done on jobs in  the WorkQueue. The remaining areas in the first profile where the Work Queue threads are starved is due to the uneven complexity of the jobs the way GC chooses its data decomposition to parallelize the calculation. While creating more, smaller jobs would even out the work load, that also increases overhead and duplication of effort. In the second profile, the stall occurs because we only double-buffer the model data. The numeric computation gets one frame ahead, then waits while the main thread uses glMapBuffer to submit the last frame's model before it can save the newly computed model so that it can begin the next frame's calculation. We could use a circular buffer to get several frames ahead, but as we shall see in a moment, that ultimately won't help.</p>
<p>With this last step the GC user interface becomes unresponsive as MTGL can have commands for several seconds worth of frames buffered, and there is no way to kill them, or even to measure how many commands are buffered. This behavior occurs in the second case where the workload is dominated by rendering, but not in the first which is compute-bound evaluating functions. When moving a graph with the mouse, or even stopping an animation, one first sees up to several seconds of backlogged frames rendered before the 3D graph registers any feedback. To make GC responsive, glFlush is inserted into the draw loop each frame to intentionally block and drain the MTGL command buffer so it doesn't get more than a frame ahead. After this, the final timings for today are 1.8s and 8.9s.</p>

<h3>Threads!</h3>
<pre><img src="http://avitzur.hax.com/images/Threads.png"></pre>
<p>This profile shows GC running with three windows open, one of which has two OpenGL contexts (to illustrate a coordinate transformation with side by side views).  Each context gets its very own gleCmdProcessor thread in multi-threaded OpenGL. However, as I have only one video card, they all must take turns talking to the same GPU to drain their command buffers. We also see here, that regardless of
how many windows or equations GC is drawing, it creates (on this machine) 8 WorkQueueConsumer threads to process the jobs constructed by all windows. While one can create as many threads in software as one likes, the threads run on physical CPU cores. When there are more threads running that cores to run them, the scheduler must swap threads in and out to give them all time, increasing  overhead.  GC uses threads for two categories of work: to parallelize expensive operations across multiple CPU cores, and to improve user interface response time by avoiding any potentially lengthy tasks in the main thread which processes all user events. Multi-threaded OpenGL  aids enormously in the latter respect, keeping the main UI thread responsive, even in cases where MTGL does not otherwise speed up the actual rendering time.</p>
<p>Next installment: Computed textures, color arrays and fragment programs</p>]]>
    </content>
</entry>
<entry>
    <title>Adventures in Optimization: Threading</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/06/adventures_in_optimization_threading.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=136" title="Adventures in Optimization: Threading" />
    <id>tag:avitzur.hax.com,2007://3.136</id>
    
    <published>2007-06-10T23:11:09Z</published>
    <updated>2007-11-05T18:20:44Z</updated>
    
    <summary>After identifying the superficial bugs slowing GC4, I set out to measure the degree of parallelism I was able to achieve with the new multi-threaded calculation code. Are there any other calls like GetWRefCon unintentially introducing dependencies stalling threads? Was...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>After identifying the superficial bugs slowing GC4, I set out to measure the degree of parallelism I was able to achieve with <a href="http://avitzur.hax.com/2006/10/embarrassingly_parallelizable.html">the new multi-threaded calculation code</a>. Are there any other calls like <a href="http://avitzur.hax.com/2007/05/getwrefcon.html">GetWRefCon</a> unintentially introducing dependencies stalling threads? Was GC able to keep multiple CPU cores fed wih independent work? In additional to the traditional questions of <i>Where is the software spending its time? What is it doing there?</i> I now have to ask those question separately of multiple threads of execution on multiple CPU cores and understand the answers in context, asking <i>When is one thread of execution stalled waiting for results from another?</i> With the right tool, the answer can be obvious. </p>
<p><center><img alt="ThreadView.png" src="http://avitzur.hax.com/images/ThreadView.png" width="378" height="147" /></center></p>]]>
        <![CDATA[<p>Here in Apple Thread Viewer, the top eight rows represent threads running preemptively evaluating functions to plot a graph. The bottom row represents the main application thread which processes user events, creates the preemptive threads, asks periodically if they are done yet, collects their results, draws them, and tells them when to start on the next frame. The illustrates a simple flaw in the architecture. The main thread does not initiate parallel computation on the next frame until after it is done drawing the previous frame. The preemptive threads are thus blocked until drawing is complete. There is no need for this. It is simply a holdover from the 1993 design when drawing was very much faster than computation. Drawing is still faster than calculation, but not so much so that it can be ignored.</p>

<p>The Thread Viewer image above shows GC4 graphing an animating 2D inequality: 0.5=(cos((x+n))+cos((y*sin(&pi;/5)+x*cos(&pi;/5)))+cos((y*sin((2*&pi;/5))+x*cos((2*&pi;/5))))+cos((y*sin((3*&pi;/5))+x*cos((3*&pi;/5))))+cos((y*sin((4*&pi;/5))+x*cos((4*&pi;/5))))). That calculation is <a href="http://avitzur.hax.com/2006/10/embarrassingly_parallelizable.html">embarrassingly parallelizable</a>. Data decomposition into horizontal strips of the graph paper give each core a separate problem to work on with no dependencies to add communication overhead.</p>
<p>The thread profile below illustrates GC4 animating the implicit 3D surface: <i>x</i><sup>2</sup>&thinsp;+&thinsp;<i>y</i><sup>2</sup>&thinsp;+&thinsp;<i>z</i><sup>2</sup>&thinsp;+&thinsp;sin(<i>nx</i>)&thinsp;+&thinsp;sin(<i>ny</i>)&thinsp;+&thinsp;sin(<i>nz</i>)&thinsp;=&thinsp;1. It tells a different story.</p>
<pre><img alt="8Jobs.png" src="http://avitzur.hax.com/images/8Jobs.png" width="1864" height="172" /></pre>
<p>The top level periodic structure is clear, showing five frames of the animation. The top eight rows, again, are preemptive threads performing the numeric computations constructing the 3D model. The second to bottom row is the main application thread which processes user events, creates the preemptive threads, asks periodically if they are done yet, collects their results, submits them to OpenGL to draw them, and tells them when to start on the next frame. The bottom row is the thread which Mac OS X dedicates to <a href="http://developer.apple.com/technotes/tn2006/tn2085.html">multi-threaded OpenGL</a>.</p>

<p>Again, we see that computing the next frame is blocked until drawing is done. The fine lines in the main thread show it polling, asking periodically if the calculations are done yet. When they are all done it begins collecting the results and submitting them to the OpenGL pipeline. When that is done, OpenGL begins drawing. Calculation on the next frame does not begin until the previous frame is fully drawn and all its data structures released.</p>
<p>Furthermore, the CPU cores are not at all fully utilized. Although there are eight cores and eight compute threads, the data decomposition for implicit 3D surface equations does not create an equal workload for each thread. Some threads finish much sooner and are then idle until the next frame. Upon closer examination, it is clear that in each animation frame, the compute thread with the most work, which finished last, is the bottleneck holding everything else idle until it is done.</p>

<p>The next image shows the same equation, but with the data decomposed into 16 jobs rather than 8. There are still only 8 compute threads and only 8 CPU cores. Now, however, there are more jobs in the work queue which feeds those threads, so the first 8 jobs which finish will grab another job from the queue and keep working.</p>
<pre><img alt="16Jobs.png" src="http://avitzur.hax.com/images/16Jobs.png" width="1893" height="176" /></pre>

<p>Here is a trace showing the same work done by 32 jobs. The 3D implicit surface solver is not quite as embarrassingly parallel as the 2D inequality grapher. It takes more work to set up each parallel compute thread, and takes more effort to combine the results at the end. While there cores are less frequently starved for work with more jobs, the total amount of work done is higher.</p>
<pre><img alt="32Jobs.png" src="http://avitzur.hax.com/images/32Jobs.png" width="1884" height="167" /></pre>
<p>The rows in these pictures represent threads of execution which are things the software is doing in parallel. The color coding identifies the particular CPU core running each thread. This machine has 8 cores running, in this picture, ten threads. Only 8 can be active at any one time (and sometimes fewer, as there are other things happening on this computer not illustrated here). The Mac OS X scheduler decides which threads run on which cores. While it tries to keep each thread on the same core, sometimes a thread of execution will hop from one core to another. That can be expensive. The more threads the scheduler is running, the more difficult it is to keep threads from hopping around.</p>

<p>Zooming in on these pictures provide additional clues. They show the specific calls in the source code where one thread is paused awaiting use of a shared resource. (new and delete are thread-safe, but rely on shared state. If called at the same time from multiple threads, all but one thread will pause so that memory management is handled safely, sequentially. As long as they are not called too frequently, this is not a problem. I was calling them too frequently, but increasing the block size on my collection class fixed that.)  They show which OpenGL calls drain the OpenGL pipeline (unintentionally) synchronizing the multi-threaded OpenGL rendering to the application thread OpenGL command buffering. (glGenBuffers, glDeleteBuffers, aglEnable, and aglError all drain the pipeline before proceeding.) They show what where the time is being spent after the drawing but before the new calculations begin. (Releasing OpenGL vertex buffer objects from the last frame showed up as expensive memory management delaying the next frame. There's no fundamental reason the parallel computation threads can't be started before that tear down happens in the main thread.)</p>]]>
    </content>
</entry>
<entry>
    <title>Adventures in Optimization: Benchmarking</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/06/adventures_in_optimization_benchmarking.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=135" title="Adventures in Optimization: Benchmarking" />
    <id>tag:avitzur.hax.com,2007://3.135</id>
    
    <published>2007-06-10T22:16:23Z</published>
    <updated>2007-06-10T22:34:38Z</updated>
    
    <summary>My first job at Apple in 1992 was benchmarking the quality of a dozen different handwriting recognition software libraries from different companies. We like to pretend that everything from health to cars to wars can be measured by a single...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
            <category term="reminiscing" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>My first job at Apple in 1992 was benchmarking the quality of a dozen different handwriting recognition software libraries from different companies. We like to pretend that everything from health to cars to wars can be measured by a single number but real systems are  multi-dimensional. <i>Benchmarks</i> are like <i>standards</i>: everyone likes to make their own. I approach software measurements with my old Stanford Physics lab training. I treat the software as tangible physical system, experiment to determine what I can measure reproducibly, look for mathematical relationships between variables, and attempt to construct a useful basis set of axes upon which to characterize their phase space. This did not make me any friend while benchmarking for Apple. Each company or internal group with a solution to sell had strong ideas about comparing systems to put their own solution in the best light.  Furthermore, handwriting recognition was and is still an unsolved research problem. Comparing different research projects is quite different from comparing products of a mature technology. Ultimately, none of the choices proved adequate.</p>
<p>For Graphing Calculator, I use performance measurements to make programming choices when comparing high-level algorithms, low-level implementation details, different compilers, and compilation options. Holding the Command key while pressing the slider Play button will initiate a stopwatch to time the slider. When I'm focusing on a particular routine, this is useful to compare small code changes from one build to the next. In the Graphing Calculator menu, the Graphing Calculator > Benchmark command runs through thirty-two tests timing calculation and drawing. These help me compare how GC behaves on different hardware and how its behavior changes across versions years apart. The report looks like this:</p>]]>
        <![CDATA[<pre>
Graphing Calculator 4.0d6 benchmark test at Wed Jun  6 11:50:33 2007

Graph pane is 500 x 500 pixels.
Timer resolution: 6.5703e-05 seconds.

Point                            200 steps   7.63 seconds 
Arrow                            200 steps   7.60 seconds 
Function                         200 steps   9.36 seconds 
Complex-valued function          200 steps  13.10 seconds 
Parametric curve                 200 steps  13.47 seconds 
Complex-valued parametric curve  200 steps  13.74 seconds 
Contour plot                     200 steps  14.69 seconds 
Inequality                        20 steps   1.84 seconds 
Complex-valued inequality         20 steps   2.18 seconds 
Differential equation             20 steps   3.73 seconds 
Color plot                        20 steps   2.49 seconds 
Density plot                      20 steps   2.19 seconds 
Complex-valued density plot       20 steps   2.54 seconds 
Coordinate transformation         20 steps   4.57 seconds 
Inverse coordinate transform      20 steps   5.36 seconds 
Point                            200 steps   5.5  seconds   97.4 frames/s 
Vector                           200 steps   6.4  seconds   98.9 frames/s 
Parametric curve                 200 steps   5.97 seconds   99.3 frames/s 
Surface                          200 steps   6.11 seconds   99.1 frames/s  1206640 triangles/s 
Surface (checkerboard)           200 steps   6.47 seconds   97.2 frames/s  1182449 triangles/s 
Surface (texture)                200 steps   6.34 seconds   99.3 frames/s  1207881 triangles/s 
Surface (lo-res)                 200 steps   5.54 seconds   98.8 frames/s     3359 triangles/s 
Surface (lo-res, checkerboard)   200 steps   5.12 seconds   99.5 frames/s     3382 triangles/s 
Color animation                  200 steps   6.38 seconds   99.2 frames/s  4222201 triangles/s 
Concentric spheres               200 steps  51.85 seconds    2.7 frames/s  1974451 triangles/s 
Landscape                         50 steps   1.74 seconds   99.2 frames/s  1206798 triangles/s 
Implicit                          50 steps   2.64 seconds   98.7 frames/s  5018935 triangles/s 
Surface (transparent)             50 steps   3.17 seconds   33.5 frames/s   407963 triangles/s 
Surface (transparent,lo-res)      50 steps   1.53 seconds   30.1 frames/s     1021 triangles/s 
Landscape (transparent)           50 steps   3.21 seconds   30.1 frames/s   366715 triangles/s 
Implicit (transparent)            50 steps   5.62 seconds   19.8 frames/s  1008924 triangles/s 
Concentric transparent spheres     1 steps   0.45 seconds    9.9 frames/s  1191667 triangles/s 
</pre>
<p>This data from last week has two important clues answering why some tests were no faster on an 3GHz 8-core machine than on a 1GHz single-core machine. Some of the tests which use OpenGL rendering hovered just under 100 frames per second on both machines. Some of the tests which used software 3D rendering hovered at around 30 frames per second. I'm embarrassed to admit I stared at these clues for several days and followed many false leads before their significance became obvious. Three different throttling mechanisms are at work here.</p><p>Many of these tests use the Graphing Calculator animation slider to run through multiple frames. To keep the slider user interface closer to consistent behavior, the slider animation, by default, will step at most 30 values per second so that the rate of the slider does not vary too wildly depending on the complexity of the equation the user types. (The user can disable this throttle by pressing Command-Option-T.) When I wrote the benchmarking code on older slower computers, the benchmark examples were mostly under the 30 steps per second limit. Predictably, computers are much faster today. I classify this problem as a single oversight bug: the slider is now not throttled when the benchmark is running.</p>
<p>Graphing Calculator 1.0 of 1993 used <a href="http://avitzur.hax.com/2006/10/cooperative_multitasking.html">cooperative multi-tasking</a>. A single-thread of execution processed all user events, all drawing, and all computation. To keep the system responsive, computational tasks were done a little bit at a time during what was then known as "idle events" when no user keyboard or mouse events required attention. Though all of the heavy-lifting calculations now run in separate, independent, preemptive threads of their own, a little bit of that legacy "idle event" architecture remains: GC4 uses a periodic system timer to call what was once the GC Idle Event handler every 0.5, 0.01, or 0.001 seconds to see if it is time to advance the animation slider or blink the cursor. How and when it decides on the frequency of that one timer is complicated, and was buggy. That introduced the 100 frame per second upper limit in the benchmark.</p>
<p>The architecture behind how GC is using that one timer is wrong on many levels. Separate tasks should have have separate timers . There is no need to conflate blinking the cursor with polling threaded computations. When there is a highlighted selection rather than a blinking cursor there is no need for that periodic task at all. Polling is a sign indicating poor design. Polling too frequently wastes CPU cycles. Polling infrequently introduces latency. Either way, it is cleaner to have asynchronous jobs send messages when they finish to awaken whatever process is awaiting their results.</p>
<p>Lastly, it never benefits the user when software attempts to update the display faster than the physical display itself can show those updates. A 100 Hz display can physically show no more than 100 different images per second. This limit arises from the physical mechanism of an electron beam scanning right to left, top to bottom, onto the phosphors of a cathode ray tube. Though LCDs do not use this physical mechanism, Mac OS X treats their electronics as a 60 Hz display. Mac OS X implements <a href="http://developer.apple.com/technotes/tn2005/tn2133.html#TNTAG4">coalesced updates</a> in the Quartz framework. This blocks any application which tries to update more frequently than sixty times per second, which saves CPU cycles from drawing operations the user could not possibly see anyway. By trying to redraw the Graphing Calculator window too frequently, it was merely stalling waiting for beam synchronization. Typically, GC throttles its drawing itself but the Benchmark disable GC's internal redraw throttle for measurement purposes. I have not yet fully analyzed how coalesced update stalls affect the GC benchmark. Disabling Beam Sync in Quartz Debug speeds up the benchmark measurably, but I need to analyze where the stalls are occurring.</p>
<p>Take home lessons: I frequently have no idea where my own software is spending its time or why. The current set of benchmark tests is far too easy. I need to redesign them to be taxing on newer machines or the measurements will be nearly meaningless. The Mac OS X 10.4 of 2007 is an utterly different beast from the 128K Macs of my youth.</p>]]>
    </content>
</entry>
<entry>
    <title>Adventures in Optimization: Prelude</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/06/adventures_in_optimization_pre.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=134" title="Adventures in Optimization: Prelude" />
    <id>tag:avitzur.hax.com,2007://3.134</id>
    
    <published>2007-06-10T18:55:10Z</published>
    <updated>2007-06-10T19:21:19Z</updated>
    
    <summary>Two weeks ago, I ran the Intel-native GC4 on an 8-core machine. It revved all 8 processing cores yet ran no faster than on a four-year old single-core G4 laptop at one-third the clock speed. A few minutes with Shark...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p><a href="http://avitzur.hax.com/2007/05/getwrefcon.html">Two weeks ago</a>, I ran the Intel-native GC4 on an 8-core machine. It revved all 8 processing cores yet ran no faster than on a four-year old single-core G4 laptop at one-third the clock speed. A few minutes with Shark identified the culprit: a single innocuous call to GetWRefcon. In the single-user, single-core, single-threaded Macintosh toolbox of 1984, that call would have taken a few instructions. Mac OS X must share system resources across multiple users, processor cores, applications, and threads of execution all making demands at once. GetWRefCon is labelled <i>Not Thread Safe</i>: it is not guaranteed to work at all from anywhere but the main  thread. Rather than crash or return a bogus result, it was causing all eight otherwise independent threads of execution to serialize on that one shared resource, passing the baton from one thread to the other, unable to execute in parallel. Fixing this was easy. Although the window structure is a shared resource protected by a lock, the RefCon itself that I actually needed contained read-only information all threads could read in parallel safely without locking. No need to go through GetWRefCon at all. The parallel threads should have been passed a pointer to the read-only document record directly.</p><p>

After fixing that, some tests ran twenty times faster. Other tests however, still ran at the same speed as on the older, slower, single-core laptop. I devoted the last two weeks to figuring out why. </p>

<p><center><img alt="Dock.png" src="http://avitzur.hax.com/images/Dock.png" width="473" height="75" /></center></p>]]>
        <![CDATA[<p>Optimizing code is an investigative engineering interactive art form. Computer hardware, operating systems, and application software have become so unimaginably complex with myriad components interacting across multiple layers of abstraction that expectations, intuition and guesswork are most often misleading and distracting. Understanding what a program actually does while running requires instrumentation and tools.</p>

<p>Mac OS X provides a suite of visualization and analysis <a href="http://developer.apple.com/documentation/Performance/Conceptual/PerformanceOverview/PerformanceTools/chapter_4_section_4.html">tools</a> to help developers understand their software to get the best performance. I use these so frequently, they occupy a permanent presence on my Mac OS X Dock: <a href="http://developer.apple.com/tools/sharkoptimize.html">Shark</a>, <a href="http://developer.apple.com/graphicsimaging/opengl/opengl_serious.html">OpenGL Profiler, OpenGL Driver Monitor</a>, <a href="http://developer.apple.com/documentation/Performance/Conceptual/PerformanceOverview/InitialEvaluation/chapter_5_section_3.html">Quartz Debug</a>, Thread Viewer, Activity Monitor, and Spin Control. These tools allow programmers to peek under the hood of their applications while they are running, each giving a different perspective on what is going on. Fitting all the sensory overload of data together, determining which parts are significant and interpreting it, however, is another problem altogether.</p>]]>
    </content>
</entry>
<entry>
    <title>Spike!</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/06/spike.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=133" title="Spike!" />
    <id>tag:avitzur.hax.com,2007://3.133</id>
    
    <published>2007-06-05T05:54:22Z</published>
    <updated>2007-06-05T06:05:25Z</updated>
    
    <summary> Spike returns triumphant from the garden. After inhaling half an avocado, he held on greedily to the rest until he was ready for it. His wet food contains a fair bit of avocado. He seemed quite pleased with himself...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p><img alt="Spike on top of stairs" src="http://avitzur.hax.com/images/AvoSpikeStairs.jpg" width="400" height="300" /></p><p>
Spike returns triumphant from the garden. After inhaling half an avocado, he held on greedily to the rest until he was ready for it.   His wet food contains a fair bit of avocado. He seemed quite pleased with himself to discover that food really does fall from trees. I wouldn't let him back in until he finished. Close-up below the fold....</p>]]>
        <![CDATA[<img alt="Foaming green" src="http://avitzur.hax.com/images/AvoSpike.jpg" width="400" height="300" />]]>
    </content>
</entry>
<entry>
    <title>glBufferSubData</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/05/glbuffersubdata.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=132" title="glBufferSubData" />
    <id>tag:avitzur.hax.com,2007://3.132</id>
    
    <published>2007-06-01T02:27:11Z</published>
    <updated>2007-06-01T03:41:01Z</updated>
    
    <summary> Yesterday was spent debugging until two in the morning trying to diagnose a drawing artifact which occured only on Intel-native builds of GC4 when animating colors on a static 3D model defined using an implicit surface equation using OpenGL...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<center><a href="http://www.pacifict.com/glBufferSubDataArtifact/"><img src="http://www.pacifict.com/glBufferSubDataArtifact/graph.png" height="170" width="190"></a></center>
<p>Yesterday was spent debugging until two in the morning trying to diagnose a drawing artifact which occured only on Intel-native builds of GC4 when animating colors on a static 3D model defined using an implicit surface equation using OpenGL <a href="http://www.opengl.org/registry/specs/ARB/vertex_buffer_object.txt">vertex buffer objects</a>. The image above started out, correctly, as a solid red sphere. When animating the colors using the GC slider, the top of the sphere did not change, remaining red. The middle of the sphere animated correctly. The bottom of the sphere animated but rgb color values (a,b,c) would display shifted as (b,c,a).</p>]]>
        <![CDATA[<p>At first I suspected a synchronization bug in the new code which implements the data decomposition breaking apart function evaluation across multiple  cores for parallel execution. I did find bugs there, but not this one.  Since the problem occurs only on Intel-native builds, I suspected a byte-swapping, or endian error. Reviewing hundreds of lines of code did find a few endian bugs, but again, none related to this problem. A friend with far more OpenGL experience stayed up with me over instant messaging until one in the morning with diagnostic advice to instrument the code looking for hints, and helped me interpret the clues to narrow the search. Eventually we focussed on the one line of code which implements the instruction to update the <a href="http://www.songho.ca/opengl/gl_vbo.html">VBO</a> color data each frame of the animation:</p>
<pre> <a href="http://www.opengl.org/sdk/docs/man/xhtml/glBufferSubData.xml">glBufferSubData</a>(GL_ARRAY_BUFFER, offset, size, data);
</pre>
<p>Several hours then ensued verifying that the offset, size, and color data being passed were reasonable and correct, and that the VBO itself was correctly configured. Everything checked out, however, and no further hints presented themself. I knew it was time to go to sleep, when upon <a href="http://www.google.com/search?q=vertex-buffer-object+color+arrays+GL_FLOAT">googling</a>  relevant keywords to see if anyone else out there has encountered similar problems, the number one hit for my search critera was a <a href="http://lists.apple.com/archives/mac-opengl/2004/Sep/msg00124.html">thread</a> I started in 2004 on the Apple Mac-OpenGL mailing list seeking help to diagnose a <i>different</i> bug involving color arrays in VBOs in precisely these <i>same</i> lines of code when I originally wrote them working on GC 3.5!</p>
<p>As has often been the case, after a long frustrating day of unsuccessful approaches and apparently fruitless investigation, I woke up a few hours later eager to try one more test. I replaced the call to glBufferSubData with a sequence of calls that I would have expected to produce identical results:</p>
<pre> GLubyte* dest = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
 dest += colorOffset;

 for (int i = 0; i < fNumVertices; i++) {
   *(ColorArrayElement*)dest = colorArray[i];
   dest += sizeof(ColorArrayElement);
   }
		
 GLboolean success = glUnmapBuffer(GL_ARRAY_BUFFER);
</pre>
<p>The code above copies the same buffer to the same place, but does so explicitly by mapping the buffer to memory and copying it an element at a time. It, however, works, eliminating the artifact. The difference may be due to a lack of understanding of the OpenGL API on my part or it may be caused by a bug elsewhere in my code. I know there are still many. It could even be a bug in the OpenGL implemention.  I <a href="http://lists.apple.com/archives/mac-opengl/2007/May/msg00133.html">asked</a>  the Mac-OpenGL list and a <a href="http://lists.apple.com/archives/mac-opengl/2007/May/msg00134.html">response</a>  moments later clarified part of it. (If glBufferSubData operated asynchronously, the behavior might be explained as a synchronization error in GC. However, it is synchronous.) </p>
<p>When I was developing Graphing Calcualtor 3.5 in 2004, it's combined use of vertex buffer obects, full screen anti-aliasing, and GPU <a href="http://pacifict.com/Shaders.html">fragment programs</a> generated on the fly from equations put me on the bleeding edge of OpenGL new feature adoption. I spent many months debugging to distinguish between GC bugs due to my own carelessness or lack of understanding, OpenGL driver bugs (in Apple's GL software layer), OpenGL driver bugs or hardware bugs in ATI and nVidias hardware and software.  While each OpenGL feature on its own generally works as advertised, sometimes combining the newer features in ways that have not been tested can lead to surprising results.</p>
<p>Of all the many e-mail lists to which I subscribe,  <a href="http://lists.apple.com/archives/mac-opengl">Mac-OpenGL</a> has, by far, the best signal-to-noise ratio. I'm frequently impressed by the quality, speed, and friendliness of response on deeply technical and tricky issues. Even more amazing, the numerous Apple folks who post there do so on their own time and initiative as a public service to the developer community. Thank you!</p>]]>
    </content>
</entry>
<entry>
    <title>None Shall Pass!</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/05/none_shall_pass.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=131" title="None Shall Pass!" />
    <id>tag:avitzur.hax.com,2007://3.131</id>
    
    <published>2007-05-30T19:06:27Z</published>
    <updated>2007-05-30T19:14:37Z</updated>
    
    <summary>Spike guards the entrance to my office....</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<center><img alt="None shall pass!" src="http://avitzur.hax.com/images/Balrog.jpg" width="400" height="300" /></center><p>Spike guards the entrance to my office.</p>]]>
        
    </content>
</entry>
<entry>
    <title>GetWRefCon</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/05/getwrefcon.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=130" title="GetWRefCon" />
    <id>tag:avitzur.hax.com,2007://3.130</id>
    
    <published>2007-05-28T16:50:05Z</published>
    <updated>2007-05-28T18:22:10Z</updated>
    
    <summary>Graphing Calculator&apos;s calculations are for the most part emabarrassingly parallelizable. I spent several months last year expressing that parallelism in code to take advantage of muti-core systems. Yesterday I compared an Intel native build of GC4 on an 8-core system...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="programming" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>Graphing Calculator's calculations are for the most part <a href="http://avitzur.hax.com/2006/10/embarrassingly_parallelizable.html">emabarrassingly parallelizable</a>. I spent <a href="http://avitzur.hax.com/programming/">several months</a> last year expressing that parallelism in code to take advantage of muti-core systems.  Yesterday I compared an Intel native build of GC4 on an 8-core system side by side with GC3.5 running under Rosetta. As expected, GC4 pegged the meter on all 8-cores:</p><p><center><img alt="Activity.png" src="http://avitzur.hax.com/images/Activity.png" width="265" height="112" /></center></p><p> Imagine my surprise upon discovering that GC4 running native and parallel was no faster than GC3.5 running emulated on a single-core. <a href="http://developer.apple.com/tools/sharkoptimize.html">Shark</a> long ago earned a permanent spot in my dock as one of the best development visualization tools. It quickly informed me that 85% of the time was spent in GetWRefCon which calls GetWindowData which calls HIObject::IsRefValid and HLTBSearchRefTable which was serializing the parallel threads of execution. I had mistakenly thought of GetWRefCon as "free", misled by my 1980s-era Macintosh training when it was no more than a simple wrapper to dereference a WindowRecord structure field. Just another reminder that the only way to have any idea where your program is spending its time is to actually measure it. Stay tuned for a benchmarks report after I fix this.</p>]]>
        
    </content>
</entry>
<entry>
    <title>3D Models</title>
    <link rel="alternate" type="text/html" href="http://avitzur.hax.com/2007/05/3d_models.html" />
    <link rel="service.edit" type="application/atom+xml" href="http://voiceofthecoast.com/mt/mt-atom.cgi/weblog/blog_id=3/entry_id=129" title="3D Models" />
    <id>tag:avitzur.hax.com,2007://3.129</id>
    
    <published>2007-05-15T02:27:24Z</published>
    <updated>2007-05-15T17:19:01Z</updated>
    
    <summary>Now these are real 3D models (2004, 2005, 2006 3D projects) . The equation for the surface on the Mac OS X Graphing Calculator application icon above is shown here. Graphing Calculator draws the equation in the image below. Here...</summary>
    <author>
        <name>Ron Avitzur</name>
        
    </author>
            <category term="Gallery" />
    
    <content type="html" xml:lang="en" xml:base="http://avitzur.hax.com/">
        <![CDATA[<p>Now <a href="http://www.ferrismath.com/calc/images/2006projects/index.htm">these</a> are real 3D models (<a href="http://ferrismath.com/calc/2001-4projects/index.htm">2004</a>, <a href="http://ferrismath.com/calc/2005projects/index.htm">2005</a>, <a href="http://www.ferrismath.com/calc/images/2006projects/index.htm">2006</a> <a href="http://ferrismath.com/calc/3Dproject/Index.htm">3D projects</a>) .</p>
<center><a href="http://www.PacificT.com"><img src="http://pacifict.com/images/gcIconAgainstGray.png"></a></center>
<p>The equation for the surface on the Mac OS X Graphing Calculator application icon above is shown <a href="http://pacifict.com/Examples/Example19.html">here</a>. Graphing Calculator draws the equation in the image below.</p>
<center><a href="http://pacifict.com/Examples/Example19.html"><img alt="model.png" src="http://avitzur.hax.com/images/model.png" width="354" height="270" /></a></center>
<p>Here the equation is modelled in what looks like plasticine.</p>
<center><a href="http://www.ferrismath.com/calc/images/2006projects/pages/3Dproj06%20046_jpg.htm"><img alt="ClayModel.jpg" src="http://avitzur.hax.com/images/ClayModel.jpg" width="354" height="360" /></a></center>


]]>
        
    </content>
</entry>

</feed> 

