Quantcast
Channel: Hacker News
Viewing all articles
Browse latest Browse all 25817

Why didn't Larrabee fail?

$
0
0

Welcome to TiddlyWiki created by Jeremy Ruston; Copyright © 2004-2007 Jeremy Ruston, Copyright © 2007-2011 UnaMesa Association

Man - this blog isn't even a day old and I'm already getting shit about the information overload caused by using sophisticated new Wiki technology. One doesn't like that he can't read it on his Lunix box, another moans that there's too much stuff they can click on and can't I just use a text file?

No dammit! I like snazzy new gratuitous tech. This blog is going to be mainly about snazzy new gratuitous tech (with the occasional UphillBothWays nostalgia moan added for good measure). And since this is a thing as totally meaningless and self-centered as a blog, I'm going to use it like I stole it. Er... which I did (thanks [[Jeremy|http://www.tiddlywiki.com/]]). It's not like I'm demanding you use MS tech or anything hentai like that.

Er... news? No, I have none of that. Except that I went to see my accountant today and he said I owed The Man five digits. OK, so as an immigrant Brit living in the US seeking asylum from crappy pay and on a "you can stay if you don't cast nasturtiums on Mr. POTUS" visa making money for US companies and spending all my hard-earned cash on other US companies, would it be slightly impolitic to utter the phrase "No taxation without representation?" Yeah, thought not. Even after I get my green card, I have to tell HRH Brenda to shove it before I get to cast a pointless vote in my local United Franchise of America. Gonna book me a plane ticket to Boston, gonna tip a crate of your shitty bourbon into the harbour. Yes, "harbour". It's got a "u" in it. Blow me. Where's my Laphroaig, [[wench|TheWife]]?

To be fair, it's all because I got a mildly obscene Christmas bonus on account of the SuperSecretProject. You so want my boss. But you can't have him. He's mine. All mine!
Double precision - it's not magic. But people keep invoking it like it is.

''The problem with floats''

Floating-point numbers are brilliant - they have decent precision and a large range. But they still only have 32 bits, so there's still only 4billion different ones (actually there's fewer than that if you ignore all the varieties of ~NANs, assume infs are basically bugs, and turn off denormals because they're agonisingly slow). So they have tradeoffs and weaknesses just like everything else. You need to know what they are, and what happens for example when you subtract one small number from another - you get imprecision. 

Ideally every programmer should know the basics of floating-point numerical precision. Any time you do a subtract (or an add, implicitly), consider what happens when the two numbers are very close together, e.g. 1.0000011 - 1.0. The result of this is going to be roughly 0.0000011 of course, but the "roughly" is going to be pretty rough. In general you only get about six-and-a-bit decimal digits of precision from floats (2^23 is 8388608), so the problem is that 1.0000011 isn't very precise - it could be anywhere between 1.0000012 or 1.0000010. So the result of the subtraction is anywhere between 1.2*10^-6 and 1.0*10^-6. That's not very impressive precision, having the second digit be wrong! So you need to refactor your algorithms to fix this.

The most obvious place this happens in games is when you're storing world coodinates in standard float32s, and two objects get a decent way from the origin. The first thing you do in rendering is to subtract the camera's position from each object's position, and then send that all the way down the rendering pipeline. The rest all works fine, because everything is relative to the camera, it's that first subtraction that is the problem. For example, getting only six decimal digits of precision, if you're 10km from the origin (London is easily over 20km across), you'll only get about 1cm accuracy. Which doesn't sound that bad in a static screenshot, but as soon as things start moving, you can easily see this horrible jerkiness and quantisation.

''Double trouble''

The solution is "obvious" - if single precision isn't enough, switch to storing your positions in double precision! Unfortunately there's some practical problems. The most obvious is that some machines simply don't do doubles (~PS2, some mobile devices), and on machines that do support doubles, there's usually speed penalties. Even on nice modern ~PCs, actually getting double-precision can be tricky - it's not enough to just write "double" in C - that would be far too simple! Bruce Dawson has a nice round-up of the (extensive) problems - this article is guaranteed to surprise even very experienced coders, so you should read it: https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/

But let's say you went through all that build agony to get them working, and accepted the speed hit. Really you haven't solved the problem, you've just brushed it under the carpet. Doubles have exactly the same weaknesses as any floating-point representation - variable precision and catastrophic cancellation. All you've done is shuffle the problems into a corner, stuck your fingers in your ears and yelled "lalalalalala". They'll still come back to bite you, and because it's double precision, it'll be even rarer and even harder to track down. And in exchange you've hurt your execution speed and caused build problems.

''Fix it!''

So what's the alternative? Good old fixed-point numbers. They're an old-school technique from before the days of 68881 and 80x87 coprocessors. They're simply integers that you treat as if at some defined place in them is a decimal point - usually expressed as something like "24.8" - which means a 32-bit integer that represents a real-world value with 24 bits of whole value and 8 bits of fraction. So for example the real number X would be represented as the integer (int)(X*256) with appropriate rounding.

For general maths, fixed point sounds like a real pain - and I'm not going to lie, it is. I've written raytracers in fixed-point maths, and the constant management of precision really sucks - even with good macro libraries. Float32 is a huge improvement in usability for 99% of maths code. But for absolute position and absolute time, fixed-point works really well. All you ever do with these is subtract one position from another (to produce a float32 delta), or add a (float32) delta to a position. You almost never need to do maths like multiply two positions together - that doesn't even "mean" anything. 64-bit fixed-point integers are supported on pretty much every platform. Even if they're not natively supported, they're easily emulated with things like add-with-carry instructions (and those platforms will have awful float64 support anyway).

And fixed-point representations have completely reliable precision. They're just integers - there's no scariness there, they're completely portable, easy to debug, and you know exactly what you're getting in terms of precision. This has huge implications for robustness and testability.

''Works on my machine''

If you're like me, when you develop code, you do so in a tiny sandbox game level so you don't have to spend five minutes loading the assets for a full level. And yay - everything works fine. The physics works, the gameplay movement works, there's no falling-out-of-maps stuff, everything is smooth. So you check it in and then several weeks later someone says there's some bad stuff happening in level X - objects falling through the floor, or moving erratically, or the physics is bouncing oddly. So you test level X and it seems fine, and then you have the "works on my machine" argument with the tester, and it takes you several days to figure out that it only happens in one part of level X, and wouldn't you know it - that's the part where you're a long way from the origin and some part of the physics code is subtracting two large numbers and getting an imprecise result.

The nice thing about fixed point is it's consistent. You get the same precision everywhere. If your time step and physics epsilons work at the origin, they work everywhere. And before you moan that fixed-point doesn't have the range - 32 bits of fixed-point gets you anywhere on Earth to about 3mm. Not enough? Well 64 bits of precision gets you to the furthest distance of Pluto from the Sun (7.4 billion km) with sub-micrometer precision. And it's all consistent precision - no lumpy parts or fine parts or places where you suddenly get denormals creeping in and your performance goes through the floor for no well-identifiable reason.

''About time''

That's position, but let's talk about time. Time is where it can really fall over, and it's the place that will most often bite people. If you are doing any sort of precise timing - physics, animation, sound playback - you need not just good precision, but totally reliable precision, because there tend to be a bunch of epsilons that need tuning. You're almost always taking deltas between absolute times, e.g. the current time and the time an animation started, or when a sound was triggered. Everything works fine in your game for the first five minutes, because absolute time probably started at zero, so you're getting lots of precision. But play it for four hours, and now everything's really jinky and jittery. The reason is that four hours is right about 2^24 milliseconds, so you're running out of float precision for anything measured in milliseconds, which is why physics and sound are particularly susceptible - but almost any motion in a game will show this jittering. For added hilarity, if a tester does run across a problem with timing precision, they save the game, send it to a coder, and the coder loads it and... it doesn't happen - because time got reset to zero! This is a very effective way to drive QA completely insane. Again, fixed-point gives you completely reliable precision, no matter how long the game has been running.

It's interesting to note that fixed-point time has an easy precedent to follow - [[Network Time Protocol|http://en.wikipedia.org/wiki/Network_Time_Protocol#Timestamps]]. They use a "32.32" format in seconds, meaning there's 32 bits measuring whole seconds, and 32 bits measuring fractions of a second. This has a precision of 233 picoseconds and rolls over every 136 years, both of which should be good enough for most games. In addition, it's very easy for humans to read and use - the top 32 bits is just the time in seconds. (as an side - they are considering extending it to a 64.64 format to fix the rollover in 2036 - this gives them completely gratuitous range and precision - to quote: "the 64 bit value for the fraction is enough to resolve the amount of time it takes a photon to pass an electron (at the speed of light). The 64 bit second value is enough to provide unambiguous time representation until the universe goes dim.")

''Protect yourself''

Even if you don't believe anything else in this post, even if you absolutely insist that floating point will be just fine for everything - please at the very least understand that setting your time to zero at the start of your app is a terrible thing to do. You will fool yourself. If there is any chance at all that someone might play your game for four hours, or leave it on the main menu for four hours, or leave it paused mid-game for four hours, initialize your clock at much much more than four hours! At least that way you'll find the problems sooner.

''Implementation tips''

I think the easiest way to encapsulate this (if you're an OOP fan) is to make a "position" class that holds your chosen representation, and the only operations you can perform on that class (aside from copies and the usual) are subtracting one from another to give a standard float32 vec3, and adding a float32 vec3 to a position to get another position. So it means you can't accidentally use a position without doing calculations relative to another position. The same goes for time - internally you might use a 32.32 number, but all the outside world needs to know is what ~TimeA-~TimeB is, and that can be a float32 quite happily. Similarly, the only operation you need to do on a time is adjust it forwards or backwards by a timestep, and it's fine to have the timestep as a float32. You should find that those operations are all you actually need. If not, then it's probably time to refactor some code, because you're likely to have problems when doing things at different distances from the origin, or at different times.

The one thing I have seen is people making BSP trees out of the entire world, and then storing the planes as A,B,C,D where: Ax+By+Cz+D>0. Which is fine, except that's also going to have precision problems. And it's going to have precision problems far earlier, because x,y,z are another object's position. And now you're multiplying two positions together and subtracting a bunch of them and asking if they're positive or negative and the problem with that is that you will halve your available precision. So even doubles will only get you 52/2 = 26 bits of precision, which is rubbish. And experience with BSP trees has shown that they're extremely intolerant of precision errors. The solution for this case is to store a point on the plane and the plane's normal. Otherwise even decently-size levels are going to end up in agony (Michael Abrash has a story exactly like this about Quake 1 - they had tiny tiny levels and they had problems!). Restricting yourself to just taking deltas between two positions will highlight problems like this and allow you to rethink them before they happen.

''TLDR''
* Don't start your times at zero. Start them at something big. Ideally do the same with position - by default set your origin a long way away.
* float32 has precision problems even in normal circumstances - you only get about seven digits of precision.
* float64 is a tricky beast to use in practice - writing "double" in C is not sufficient.
* Variable precision is a nightmare for reproducibility and testability - even float64.
* Fixed-point may be old, but it works well for absolute time and position.
* Help yourself guard against precision-cancellation problems by not exposing absolute time and position to most parts of your app. Any time you think you need them, you're almost certainly going about the problem wrong.
* New OffendOMatic rule: any time you want to use doubles in a game, you probably haven't understood the algorithm.
I've always hated two things about sprintf. The main annoyance is that you have to have the format string, and then after that comes all the arguments. It really doesn't read well, and it's too easy to get the arguments the wrong way round or add or forget one. Related to this is the fact that there's no type-safety, so when you do screw up, it can lead to silent and/or really obscure memory-scribbler bugs.

One alternative is the cout/iostream style format with a lot of << stuff everywhere. That makes my eyeballs itch with the amount of C++ involved, and it's ugly to use with string destinations.

A third option is smart strings, but the complexity and constant alloc/dealloc behaviour also offends me. I'll be honest - I worry a lot of that magic allocation stuff is just far too complex for mortals to debug or profile. It's far too easy to write a line of simple-looking code and destroy performance.

I was hoping not to reinvent the wheel, and a few people pointed me at [[FastFormat|http://www.fastformat.org]]. It has a mode that seems to sort-of do what I want, but the documentation is a disaster and the scope of the thing is clearly huge and way beyond what I need, so rather than wade through it for two hours, I thought I'd see if I could write something simple in two hours that does what I want. Turns out, the answer is "yes".

So, the goals are:
* No dynamic allocation.
* Easier to read and control than sprintf()
* Type-safe (in as much as C can ever be type-safe - bloody auto-converts).
* Overflow-safe (i.e. it will refuse to scribble, and will assert in debug mode).

The thing I'm happy to give up is extendable-length strings, i.e. I don't mind pre-declaring the max length of any string. That's kinda obvious since I don't want dynamic allocation, but it's worth stating explicitly.

So the very first thing I wrote almost worked. The first bit was easy - it's this:

{{{
struct StaticStr
{
    unsigned int MaxLen;
    char *pStr;

    StaticStr ( char *pStrIn, int MaxLengthIn )
    {
        MaxLen = MaxLengthIn;
        pStr = pStrIn;
    }

    ~StaticStr()
    {
        // Nothing!
    }

    // Cast to a normal string. It's deliberate that it's not a const (but be careful with it).
    char *Cstr()
    {
        return pStr;
    }

    // Assignment.
    StaticStr & operator= ( const StaticStr &other )
    {
        ASSERT ( strlen ( other.pStr ) < this->MaxLen );
        strncpy ( this->pStr, other.pStr, this->MaxLen );
        this->pStr[this->MaxLen-1] = '\0';
        return *this;
    }

    // Assignment from a C string.
    StaticStr & operator= ( const char *pOther )
    {
        ASSERT ( strlen ( pOther ) < this->MaxLen );
        strncpy ( this->pStr, pOther, this->MaxLen );
        this->pStr[this->MaxLen-1] = '\0';
        return *this;
    }

    StaticStr & operator+= ( const StaticStr &other )
    {
        int ThisLength = strlen ( this->pStr );
        ASSERT ( ( ThisLength + strlen ( other.pStr ) ) < this->MaxLen );
        strncat ( this->pStr, other.pStr, this->MaxLen - ThisLength - 1 );
        return *this;
    }

    // This is actualy an append - it's really += rather than + but the typing gets tedious.
    StaticStr & operator+ ( const StaticStr &other )
    {
        return *this += other;
    }

    // Append of a C string.
    StaticStr & operator+= ( const char *pOther )
    {
        int ThisLength = strlen ( this->pStr );
        ASSERT ( ( ThisLength + strlen ( pOther ) ) < this->MaxLen );
        strncat ( this->pStr, pOther, this->MaxLen - ThisLength - 1 );
        return *this;
    }

    StaticStr & operator+ ( const char *pOther )
    {
        return *this += pOther;
    }
};

// Slightly ugly...
// Used like:
//      StaticStrDecl ( sstemp1, 1024 );
#define StaticStrDecl(name,length) char StaticStringCalled ## name[length]; StaticStr name ( StaticStringCalled ## name, length )
//      char *pTemp[1024];
//      StaticStrWrap ( sstemp1, pTemp );
#define StaticStrWrap(name,strname) StaticStr name ( strname, sizeof(strname)/sizeof(strname[0]) )
}}}

And that works fine for sticking strings together, so you can do code like this:

{{{
    StaticStrDecl(string1, 1024);
    char TempString[100];
    StaticStrWrap(string2, TempString);
    string1 = "Hello";
    string2 = " world";
    string1 += string2;
    string1 += " again";
}}}

...or more obviously usefully, code like this:

{{{
    string2 = " world";
    (string1 = "Hello") + string2 + " again";
}}}

Note the annoying braces. If you just do this:

{{{
    string1 = "Hello" + string2 + " again";
}}}

...the way you'd like to, it fails to compile because there's no + operator that takes a left hand of a char*, and it doesn't figure out that a different precedence would fix things. It's a pretty minor annoyance. I'm sure somebody with more C++ operator overloading experience can tell me how to fix it, but honestly I'm not sure I want to know.

Technically I shouldn't have allowed the + operator, because it's an append, and you should have to write this:

{{{
    (string1 = "Hello") += string2 += " again";
}}}

...but I hate the typing. Part of the impetus is to make things easier to read, not harder. Also, for reasons I really don't want to pollute my brain with ''(so don't email me)'' the operator precedence changes, so you actually get:

{{{
    (string1 = "Hello") += (string2 += " again");
}}}

...which means although string1 is correct, string2 now contains " world again", which is not really want I wanted. Of course you can still get strange things happening in some cases if you forget the braces:

{{{
    string2 = "Hello";
    string1 = string2 + " world";              // I forgot the braces, but this still compiles & runs.
}}}

Now both of them contain "Hello world". Maybe with some sort of gratuitous const gymnastics I could fix it, but right now I'm going to ignore it and hope it doesn't bite me in the arse later. Yeah, right. Ugh.

Anyway, so what about actual number printing? Well, my first attempt was to use some temp strings and some functions to wrap sprintf, such as

{{{
    StaticStr & Int ( int i )
    {
        int written = _snprintf ( pStr, MaxLen, "%i", i );
        ASSERT ( ( written >= 0 ) && ( (unsigned)written < MaxLen ) );
        return *this;
    }
}}}

The way you use it is like this:

{{{
    (string1 = "Hello ") + string2.Int(1) + " world";
}}}

...and this works. string2 is "1" and string1 is "Hello 1 world". The problem comes when you want to print multiple numbers using the same temp:

{{{
    (string1 = "Hello ") + string2.Int(1) + " world " + string2.Int(2);
}}}

The problem is the two Int() functions get called before any concatenation happens, so what you get is "Hello 1 world 1". Why they're called last first is also a mystery - I was at least expecting to get "Hello 2 world 2". It's probably poorly defined anyway. Note that this isn't an operator precedence problem - adding lots of braces doesn't fix the problem. Nuts.

I had a cup of tea and a sit-down and a think and then an alternative came to me. It turns out it's much nicer to type, completely avoids the temporary problem, and is faster (not that speed is exactly a priority, especially when printing floats, but it's never a bad thing).

First step, add a seemingly pointless helper class:

{{{
struct StStInt
{
    int mValue;
    StStInt ( int Value )
    {
        mValue = Value;
    }
};
}}}

That's it - that's all it does - no other methods. There's also ~StStFloat. "~StSt" is just an abbreviation for "~StaticStr." You'll see why I want to shorten it in a bit. Then I add this method to ~StaticStr:

{{{
    StaticStr & operator+ ( const StStInt &Other )
    {
        int StrLen = strlen ( pStr );
        int written = _snprintf ( pStr+StrLen, MaxLen-StrLen, "%i", Other.mValue );
        ASSERT ( ( written >= 0 ) && ( (unsigned)written < MaxLen ) );
        pStr[MaxLen-1] = '\0';
        return *this;
    }
}}}

Most of the complexity here is dealing with the completely bonkers behaviour of _snprintf and various overflow checking - the actual conversion stuff is simple. Now you can write fairly elegant stuff like:

{{{
    string2 = " world ";
    (string1 = "Hello ") + StStInt(1) + string2 + StStInt(2);
}}}

You still need the extra braces, annoyingly, but it works just fine. There's no temporaries either - the values are just _snprintf'ed right onto the end of the existing string. The fact that string2 doesn't get modified is nice as well, though I worry that might be undefined behaviour, so maybe don't push it.

The next step was to handle formatting, because sometimes you do want it. The routines get only slightly more complex:

{{{
struct StStFloat
{
    float mValue;
    int mPrecision;
    int mMinWidth;
    StStFloat ( float Value, int Precision = -1, int MinWidth = -1 )
    {
        mValue = Value;
        mPrecision = Precision;
        mMinWidth = MinWidth;
    }
};
}}}
...and...
{{{
    StaticStr & operator+ ( const StStFloat &Other )
    {
        int StrLen = strlen ( pStr );
        int written = -1;
        if ( Other.mMinWidth < 0 )
        {
            if ( Other.mPrecision < 0 )
            {
                written = _snprintf ( pStr+StrLen, MaxLen-StrLen, "%f", Other.mValue );
            }
            else
            {
                written = _snprintf ( pStr+StrLen, MaxLen-StrLen, "%.*f", Other.mPrecision, Other.mValue );
            }
        }
        else
        {
            if ( Other.mPrecision < 0 )
            {
                written = _snprintf ( pStr+StrLen, MaxLen-StrLen, "%*f", Other.mMinWidth, Other.mValue );
            }
            else
            {
                written = _snprintf ( pStr+StrLen, MaxLen-StrLen, "%*.*f", Other.mMinWidth, Other.mPrecision, Other.mValue );
            }
        }
        pStr[MaxLen-1] = '\0';
        ASSERT ( ( written >= 0 ) && ( (unsigned)written < MaxLen ) );
        return *this;
    }
}}}

And that means you can do elegant things like:

{{{
    (string1 = "Hello ") + StStFloat(1.0f) + ":" + StStFloat(2.0f, 2) + ":" + StStFloat(3.0f, -1, 15) + ":" + StStFloat(4.0f, 3, 10);
}}}

Which produces string1="Hello 1.000000:2.00:       3.000000:     4.000". Don't ask me why the default precision for _snprintf likes so many decimal points.

Anyway, so I got my wish - I got a zero-dynamic-allocation, typesafe, format-free and fairly readable version of sprintf. Happy coder joy joy!

Your mission, should you choose to accept it, is to go out and find the wheel that I just reinvented, thus saving me the hassle and embarrassment of using this library evermore. This blog will self-destruct when I get around to it.
Someone prodded me about this the other day, so I thought I should get on and do it. I gave a GDC 2003 talk about SH, but I was never really happy with it - half an hour isn't really enough to cover it well. I haven't changed the slides, but now I have some notes for it. I corrected an error, finally wrote those elusive fConstX values down, but mainly I talk about some surrounding details about the console implementation I actually used it in - what is probably the coolest bit of all - or at least it's the bit the artists really liked about the new system. [[Spherical Harmonics in Actual Games notes]]
The rest have been removed from the front page to save space. Here are all my blog entries in chronological order:

[[Why didn't Larrabee fail?]]
[[The sRGB learning curve]]
[[NaNs cause the craziest bugs]]
[[Memory stacks and more resizable arrays]]
[[Premultiplied alpha part 2]]
[[Texture coordinate origin]]
[[Elite Dangerous on the Rift]]
[[Display rate, rendering rate, and persistence]]
[[Wrangling enums]]
[[Simple Perforce setup for the solo coder]]
[[Even more precision]]
[[Resizable arrays without using STL]]
[[Polynomial interpolation]]
[[How not to interview]]
[[Sparse-world storage formats]]
[[Matrix maths and names]]
[[New Job]]
[[A sprintf that isn't as ugly]]
[[Saving, loading, replaying and debugging]]
[[StarTopia love]]
[[Logging, asserts and unit tests]]
[[Data Oriented Luddites]]
[[Moore's Law vs Duck Typing]]
[[Platform-agnostic types]]
[[Texture streaming systems with two levels of cache]]
[[Visibility and priority tests for streaming assets]]
[[Squares or hexes]]
[[Dwarf Minecart Tycoon]]
[[Larrabee talk roundup and media attention]]
[[GDC 09]]
[[How to walk better]]
[[Larrabee ISA unveiled at GDC 2009]]
[[CAs in cloud formation]]
[[Regular mesh vertex cache ordering]]
[[Siggraph Asia 2008]]
[[Plague]]
[[Siggraph 2008]]
[[ShaderX2 available for free]]
[[Larrabee and raytracing]]
[[Renderstate change costs]]
[[Larrabee decloak]]
[[Blog linkage]]
[[Smart coder priorities]]
[[Rasteriser vs Raytracer]]
[[Texture formats for faster compression]]
[[Knowing which mipmap levels are needed]]
[[SSE]]
[[Patently evil]]
[[Trilights]]
[[Bitangents]]
[[More vertex cache optimisation]]
[[Reinventing the wheel - again!]]
[[Shadowbuffers and shaders]]
[[Utah Teapot]]
[[GDC survival guide]]
[[Pixomatic slides]]
[[Notepad++]]
[[More on SH]]
[[Licenses]]
[[Added some hindsights and notes on Spherical Harmonics]]
[[Cellular Automata article added]]
[[Impostor article added]]
[[Shadowmap vs shadowbuffer]]
[[Vertex Cache Optimisation]]
[[Strippers]]
[[Scene Graphs - just say no]]
[[Dodgy demos]]
[[Premultiplied alpha]]
[[Babbage was a true genius]]
[[Someone cited my VIPM article]]
[[VIPM article in HTML]]
[[RSS part 3]]
[[VGA was good enough for my grandfather]]
[[RSS part 2]]
[[A matter of precision]]
[[Game Middleware list]]
[[RSS banditry]]
[[...there was a load of words]]
[[In the beginning...]]
A set of articles set up by Mike Action and friends: http://altdevblogaday.com/
I've just been reading up on the Analytical Engine. I've known the background to Babbage's life and works for ages, and I knew he was incredibly clever - the Difference Engine is an amazing feat when you consider it's made of mechanical gears and powered by a human cranking on a lever. But fundamentally all it does is a bunch of parallel additions. For the time, it would have done them extremely fast and have had awesome reliability, but the actual computations were perfectly well-understood. By the way, if you're ever in London, you have to go visit the Science Museum in South Kensington and see the [[Difference Engine Mk2|http://en.wikipedia.org/wiki/Difference_engine]] that they built. It's truly fascinating watching it work, and work well. The thing I didn't realise until I read a bit more about it is that it's actually built to achievable 19th-century tolerances, which means Babbage actually could have built the whole thing, and it would have worked convincingly and usefully. His problems were political and financial rather than mechanical, and it didn't help that he was a pompous jerk (like many geniuses).

But again, perfectly well-understood maths in the Difference Engine. Couldn't do anything revolutionary, just would have done it far better than the existing mechanisms (a room full of people adding stuff manually). No, the real genius came with the Analytical Engine. Again, I've always known it was the first programmable computer, but when people say that - you always imagine that well yes, it was slightly smarter than a [[Jacquard Loom|http://en.wikipedia.org/wiki/Jacquard_loom]], and maybe if you squinted a bit and jumped through many flaming hoops you might see it was getting close to Turing-capable, and if you examined manuals a lot and were cunning you could get it to do something kinda like a proper algorithm. Certainly when I've looked at the functionality of some of the 1940s and 50s computers, that's what they always looked like.

No, that's not the Analytical Engine at all. It's not a bag of bolts that some mathematician can show in 200 pages of jargon can be made to be Turing-complete. It's much better than that - it's basically a 6502 with bells on (literally).

[[Here are lots of details|http://www.fourmilab.ch/babbage/cards.html]] (including an emulator!), but basically it has a fairly straightforwards machine language - it does addition, subtraction, multiplication and division. It has two input registers and an output register. You can do loads from a store (memory) with 1000 entries, to the input registers, and when you load the second register, it does the operation you ask, and puts the result in the output register, which you can move to the store. You can also do shifts left and right to move the decimal point around in fixed-point maths. If the result of a computation has a different sign to the first input, or overflows, it makes a "run-up lever" move upwards. One could imagine tying a small bit of cloth to this lever, and then one might term this "raising a flag". You can issue commands that say to move backwards or forwards by a set number of commands, and you can tell it to do this all the time, or only if the run-up level is raised. Hey - unconditional and conditional branches.

I mean that's it - it's so absurdly simple. It has load, store, the basic six arithmetic operations, and conditional branches. Right there in front of you. It's a processor. You could code on it. Look, here's some code (with my comments):

{{{
N0 6  ;; preload memory location 0 with the number 6
N1 1  ;; preload memory location 1 with the number 1
N2 1  ;; preload memory location 2 with the number 1
x     ;; set the operation to multiply
L1    ;; load memory location 1 into the ALU
L0    ;; load memory location 0 into the ALU - second load causes the multiply operation to happen
S1    ;; store the result in location 1.
-     ;; set the operation to subtract
L0    ;; load location 0
L2    ;; load location 2 - causes subtraction to happen.
S0    ;; store to location 0
L2    ;; load location 2
L0    ;; load location 0 - causes subtraction to happen.
      ;; If the result sign is different to the first argument, the run-up-lever is raised
CB?11 ;; if the run-up-lever is raised, move back 11 instructions.
      ;; Like today's CPUs, the location of the "instruction pointer" has already moved past this instruction,
      ;; so back 11 means the next instruction is the "x" above.
}}}

The result of this function is in location 1. Notice that location 2 never gets stored to - it's the constant value 1. Still having a bit of trouble - let me translate it to C:

{{{
int mystery_function ( int loc0 )
{
  int loc1 = 1;
  const int loc2 = 1;
keep_looping:
  loc1 = loc1 * loc0;
  loc0 = loc0 - loc2;
  if ( sgn(loc2 - loc0) != sgn(loc2) )
  {
    goto keep_looping;
  }
  return loc1;
}
}}}

...and now change "loc2" to be "1" and massage the "if" conditional appropriately:

{{{
int mystery_function ( int loc0 )
{
  int loc1 = 1;
keep_looping:
  loc1 *= loc0;
  --loc0;
  if ( loc0 > 1 )
  {
    goto keep_looping;
  }
  return loc1;
}
}}}

You can't still be having trouble. It's a factorial function! Apart from the slight wackiness of the loop counter, which is an idiom just like any other language, it's all absurdly simple and powerful. The only big thing he was missing from the ALU was direct boolean logic for doing complex control logic. But it's not a huge lack - first, it can all be emulated without too much pain with decimal logic (OR is an add, AND is a multiply, etc), and secondly after about five minutes playing with the thing he would have realised that he needed something like that, and the mechanicals for boolean logic are trivial compared to decimal multiplication and division.

Just to gild the lilly, Babbage wanted to attach a plotter to it. You could send the most recent computed result to the X-coordinate or Y-coordinate of the plotter, and you could raise or lower the pen. He could have coded up Tron light-cycles on the damn thing.

The one thing he missed (and to be fair so did half the computer pioneers of the 1940s) is the ability to compute addresses, i.e. instead of always looking up the data in location N where N is punched into the card itself, the ability to calculate N with maths. However, he did think of precomputed lookup-tables, which is one form of this, but only loading from it, not storing to it. Amusingly, he then decided that actually, since the table - e.g. of logarithms - had been generated by the AE in an earlier run, and that doing the lookup involved the AE displaying a number and then ringing a bell and getting a man to go rummage through a stack of punched-cards for the right one, which would take time and be error-prone (he also had a primitive error-checking system!), it might just be better if the machine recomputed the desired value there and then, rather than using a LUT at all. Which is almost exactly where we are in computing now, where LUTs are frequently slower and less accurate than just re-doing the computation!

And he did all of this in a total vacuum. He had nothing but paper and pen and brass and plans and his brains, and he invented a frighteningly useful, powerful and easy-to-program processor. Including comments and debugging support. They didn't even have a machine as powerful as a cash-register at the time - nothing! So when I say Babbage was a genius - he's one of the greats. If he'd succeeded even slightly, if the Difference Engine had got half-finished and produced useful results and people had taken him seriously - all that Turing and von Neumann stuff that we think of as the birth of modern computing - obsolete before they even thought of it, because Babbage got there nearly a century before.

Put it like this - if he had been born almost exactly 100 years //later// and instead of brass and steam had worked in relays and electrickery in the 1950s, he would have made almost exactly the same breakthroughs and probably built almost exactly the same machine - it just would have been a bit faster (but less reliable - bloody valves!). It's almost shameful that in those 100 years, nobody advanced the art of computing in the slightest. And yet think of all the things that have happened in the 50 years since then. I read Sterling and Gibson's book "The Difference Engine" ages ago, and thought it was all a bit of a flight of fancy - pushing the probability curve to breaking-point - the way SF authors should. Now - I'm not sure it was at all fanciful. Imagine if we were living in a world that had had computers for three times as long. Gordon Moore's famous law would have gone totally ballistic by now. Microsoft's motto would have been "a computer in every can of coke", because we'd have had a computer on every desk since about 1910, and the big thing that helped the Allies beat the Nazis wouldn't have been radar, it would have been the Web.

There's just one bizarre thing I can't figure out. Babbage initially specified the Analytical Engine to 40 //decimal// digits. Later, he upped it to 50. 50 decimal digits is about 166 bits. That's gigantic, even for fixed-point arithmetic. These days we find 32 bits just fine (9-and-a-bit decimal digits) - 64 in some specialised places (19 digits). But 166 bits is nuts - that is collossal over-engineering. And it's not because he didn't understand the maths - it's extremely clear from his writings that he understood fixed-point math perfectly well - shifting stuff up and down after multiplication and division, etc. In two separate places he explained how to do arbitrary-precision arithmetic using the "run-up lever" (i.e. the AE version of "add-with-carry") in case people needed 100 or 200 digit numbers. To put this in perspective, the universe is at least 156 billion light-years wide - that's 1.5 x 10^27 meters. A single proton is 10^-15 meters across. So the universe is roughly 1.5 x 10^42 protons in size - which only takes 43 decimal digits to represent. Babbage decided that wasn't enough - he needed the full 50 digits. Also, he specified a memory of 1000 entries of these 50-digit numbers, and that didn't include the code, just data - that's 20kbytes of random-access data. For a smart guy whose major problems were finding good enough large-scale manufacturing technologies and finding the cash to build his gigantic room-sized engine, that seems pretty dumb. If he'd built a 3-digit AE, his instruction set would have exceeded the computing capabilities of every 8-bit machine in the home microcomputer revoloution of the 1980s. 5 digits and he'd have beaten the 16-bit PDP11, ST, Amiga and 8086 for per-instruction computing power. If he'd only aimed a little lower, he could almost have built the thing in his own workshop (his son constructed a three-digit Difference Engine from parts found in his house after his death!), instead of spending half his time arguing with his fabrication engineers and begging parliament and the Royal Society for ever-increasing sums of money. What a waste. But still - a genius of the finest calibre.

My theory is, Babbage was actually given a Speak-and-Spell by Dr. Who. Or maybe Sarah Connors. It's the only rational explanation.
I heard an eye-opening rant by Chris Hecker the other day.

No, not //that// rant - I already knew the Wii was one and a half GameCubes stuck together with duct tape. No, the one about bitangents.

"The what?" I hear you ask. "You mean binormals?" As it turns out - no. The rule about what "bi" means is that some objects have an arbitrary set of properties, and you need two of them. It doesn't matter which two - you just need them to be two different examples. So a curve in 3D space has not just one normal, but loads of normals - a whole plane of them in fact. But you only need two of them - pretty much any two. You call one the "normal" and you call the other the "binormal". The standard classical-language hijack of "biX" just means "two X" or "the second X". And that's where binormal comes from - curves. Note that a curve only has a single tangent vector (well, OK, it has two, but one is the negative of the other).

OK, so now on surfaces, you have a single normal. There really is only one normal. But there's loads of tangents - an entire plane of them - the plane of the surface. And so you need to pick two of them (typically if the surface is textured we're concerned with the lines of constant V and the lines of constant U). So one is called the tangent, and logically the other one should be called the "bitangent". Yes, you heard me - not "binormal".

And that's the rant. When doing bumpmapping on a surface, you should talk about the normal, tangent and //bitangent//. Nobody's quite sure why people cottoned on to the word "binormal", and I've certainly spent the last ten years talking about binormals, but you know what - it's still wrong. From now on, I will speak only of bitangents (though that's going to cause chaos with Granny3D's nice backwards-compatible file format).

Here endeth the lesson.
The game of the film, made by [[MuckyFoot]]. Except it had almost nothing to do with the film because we were making it at the same time as the film, didn't know much about their plot, so we just used the same characters in a different story. It had some neat features. The 360-degree fighting was kinda nifty - it used dual-stick "Robotron melee" controls - left one moved Blade, right one would attack in the direction pushed, and the exact attacks used were all down to timing and which weapon you had out.

It came out on Xbox and ~PS2, and I wrote the Xbox rendering engine and all the other platform-specific stuff (sound, controllers, save/load, etc). From the start we decided the two platforms were too different to share much code, and we were mostly proved right, though in the end we did share more than we thought - mainly the streaming code and some of the lighting stuff. Originally the ~PS2 wasn't going to try to stream data, because we assumed the DVD seek times would kill us, but it worked much better than we thought, and the extra memory really helped. I'm fairly proud of the graphics engine - got a lot of good tech in it such as streaming almost all rendering resources, some neat code to deal with multiple dynamic light sources, compile-on-demand shaders, and some really good VIPM stuff. Other Xbox-only features were some nice self-shadowing with my first experiments with shadowbuffers, a cool-looking cloth simulation for Blade's cloak.

Incidentally, the lighting was a huge pain. You see, Blade is this black guy who wears black leather and black shades, and he hunts vampires, so we're not going to have any bright sunlit levels - we're pretty much always in sewers at night. Result - he's basically invisible. It's fine to show him as a lurking shadow for a film, but in a game you kinda have to be able to see what's going on. Pretty much the only reason you could see him at all was because we cranked up the specular highlights like crazy, and made everything else in the scene fairly bright colours, so at least he's silhouetted.

Despite being happy with it on a technical level, it was still a rather rushed game, and had some pretty rough edges. Some of the levels were rather dull - it would have been better if it had been a shorter game, but back then it wasn't acceptable to ship anything less than 20 hours of play, even if a lot of that play was a bit dull. C'est la vie.
If you're going to add this blog to your blogroll, and thanks very much for doing so, please use http://www.eelpi.gotdns.org/blog.wiki.html, and not the address you actually see in your browser - that one will change if/when I move ~ISPs. Thanks.
Pascal Sahuc emailed me a link to a neat paper on cloud simulation with a cheap CA. [["A Simple, Efficient Method for Realistic Animation of Clouds"|http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.4850]] by Dobashi and Kaneda. In fact it's absurdly cheap - each cell has only three bits, and it's all boolean tests. I'm amazed that such wonderful results come out of such simple rules and three bits of state. They do smooth the field before rendering, but it still [[looks really good|http://video.google.com/videoplay?docid=3133893853954925559]].
Another article got converted to HTML, and some hindsights. I wish I'd found a fun game to put the code in - it seemed to work really well, but I couldn't think of anything really cool to do with it. [[Papers and articles section|http://www.eelpi.gotdns.org/papers/papers.html]]
My hobby game [[Dwarf Minecart Tycoon]].
I went over some of the fundamentals behind the Data Oriented Design movement in [[Moore's Law vs Duck Typing]]. I'm not going to actually cover what DOD is because I honestly don't (yet) know, and I'm not sure even the main proponents do either. We understand the problem, and we have some examples of solving specific cases, but I'm not sure there's yet a good framework of rules or (wouldn't it be nice) set of libraries to make sure you don't have the problem in the first place while also keeping code readable and flexible. I've also not been paying //that// much attention since I'm no longer a full-time games coder - although I still write plenty of code, the complex stuff doesn't have to be super fast, and the stuff that has to be fast isn't very complex. So I'll leave it to others to explain DOD since I'll just screw it up.

But what I do want to make is a meta-comment. The DOD folk are widely seen as a bunch of ninja curmudgeon bit-heads. Although I think they'd probably be quite proud of that label, the image it conveys is of folks who can hand-tune any assembly routine known to man but loathe anything higher-level than C99. They can argue about set associativity, DTLB walking and tell you how many clock cycles it is to main memory. Which is all very clever, but if //you// don't know that stuff they think you're Not A Well Rounded Programmer, and if you use the word "virtual" they'll commence much wailing and gnashing of teeth. In the other corner we have hipsters coming out of university courses with heads full of Java and Python and looking blankly at you when you talk about bitfield manipulation and the justifiable use of "goto". When you say "compile" they think you mean "tokenize", and they regard exception-handling as part of standard control flow rather than an admission of failure in code design.

So unfortunately, the DOD message gets conflated with the other two big language flame-wars that have been going on for a while. So let me clearly enumerate them all:

1. Object Oriented Programming is not the answer to everything.
2. C++ is a terrible implementation of an OOP language.
3. Using lots of indirections can trash your caches, destroy performance, and you'll never be able to find it again.

.#3 is the big new DOD thing we're all ranting about.

.#2 is really difficult to argue too hard against. There's so many examples of better languages out there, and so many examples of dumb bugs caused by C++ implementation decisions. And amusingly it's one thing the hipsters and the 8-bitters can completely agree on. But it's also difficult to do anything about it. It's a practical reality that nobody is going to switch to another language any time soon, and the current leading contenders (Java, Python) smack into problem #3 pretty hard. Of course we should all switch to LISP - let me know when that happens and I'll join you dancing around the bonfires of our editions of K&R, but until then I need it by my side to debug some muppet's lack of parentheses or to look up for the zillionth time how to declare a function returning a pointer to an array of const function pointers. I put bitching about C++ in the same bucket as complaining about the frailties of the human spine and why Macs only have one mouse button - there's nothing you can do about it, so just shut up and deal with it.

.#1 is really the only interesting discussion. It's certainly not new and the DOD folk didn't start it, but for most games coders the DOD context may be the first place they've seen people being quite so brazenly ~OOP-phobic. Even then it's not //that// interesting a discussion, because the answer is clearly not "OOP has no uses", nor is it "OOP is the answer to everything", but somewhere in the middle, so it's the position of that line that matters.

I happen to think that OOP is overused. In particular I think OOP is used when the thing you want is not //objects// but //abstractions//. Unfortunately the major way C++ knows how to do abstractions is with objects. Thus, any time you want an abstraction (e.g. a cache of previous results) the path you end up getting pushed down by C++ is OOP ("make an object that represents the cache, and now make it inherit from..." aaaarg''gghhh!''), and so what we have is a bunch of ~OOPy shitty unreadable code. But it's shitty because it's ~OOPy and it shouldn't be, not because OOP is inherently a bad design paradigm (when it's appropriate).

The other thing I see is people thinking DOD and OOP are opposite ends of the spectrum. This is C++'s fault again, though to be fair a bunch of other ~OOPy languages screw it up as well. In C++ when you have an object (a real game object that is - like a tank or a person) with a bunch of member values such as (position, orientation, health, ammo), it puts all those values contiguous in memory. But DOD says that the rendering only cares about the first two items, so why are you polluting your caches with the other two when rendering? You should have two separate arrays (pos+orient) and (health+ammo) and have one object reference separate items in each array. That's actually totally fine by OOP - all it says is that you want to be able to type obj->pos and obj->ammo, you want to be able to call obj1->~ShootAt(obj2) and so on - it doesn't say anything at all about how the data structures should be laid out. So in theory DOD and OOP should be completely compatible - OOP is a design and syntax philosophy, while DOD cares about data layout and traversal order. An example of a language where these two concepts are kept separate are databases - how you access the data (i.e. views/queries) is decoupled from how the data is stored. Unfortunately, C++ makes it really difficult to separate data layout from syntax. I suspect you can do it with gratuitous overloading and/or templating, or you can force everything to use accessor functions, but then the maintainability of the code plummets rapidly. So yes, //in C++// DOD and OOP tend to tug in different directions. But again, that's just K&R&S kicking you in the nuts. You really should be used to it by now.
[[HelloThere]] [[Why didn't Larrabee fail?]] [[The sRGB learning curve]] [[NaNs cause the craziest bugs]] [[Memory stacks and more resizable arrays]] [[Premultiplied alpha part 2]] [[Texture coordinate origin]] [[Elite Dangerous on the Rift]] [[All blog entries]]
...in progress...

FP precision problems between machines.

Compiler optimizations.

Changing "constants".

Recording & holding RNGs

Player input record (now need to append to a savegame?) and playback.

A way of compressing vertex data using the principles of displacement mapping, but without needing complex hardware support.

The idea is to encode the vertex as three indices to other vertices, plus two barycentric coordinates, plus a displacement along the interpolated normal. This gives you a vertex that is about 8 bytes in size, which is far smaller than most representations, reducing memory footprint and bandwidth requirements. This stream can be easily decoded by ~VS1.1 shader hardware, which is now ubiquitous.

For more details see:
* Dean Calver's article in ~ShaderX2 "Using Vertex Shaders For Geometry Compression" [[(available for free now)|http://www.realtimerendering.com/blog/shaderx2-books-available-for-free-download/]]
* [[My GDC2003 paper on Displacement Mapping|http://www.eelpi.gotdns.org/papers/papers.html]]
I originally wrote this as a response to [[a blog post by Eric Haines|http://www.realtimerendering.com/blog/60-hz-120-hz-240-hz/]], but it turned into a bit of an essay along the way. I thought I'd repost it here and also add some clarifications.



There's two separate concepts that people frequently conflate - flicker fusion and motion fusion. They're sort-of similar, but they're absolutely not the same, and they occur at different frequencies. At time of writing, [[the Wikipedia article on flicker fusion|http://en.wikipedia.org/wiki/Flicker_fusion_threshold]] is a mess, constantly switching between the two without warning. It's also misleading - there are plenty of people that can see flicker well above 60Hz (and I am one of them), and the variance is large in humans. In this post I'll try my best to be clear - I'll use "Hz" when talking about flicker, and "fps" when talking about motion.

Although this is something we study a huge amount in VR ~HMDs, I'm not going to say too much about them in this post, because any stuff I post about VR goes on the work blog site, not here! So most of this is centered around ~TVs, monitors and cinema.

''Flicker fusion''

Flicker fusion is simple - the lowest speed at which a flickering light stops flickering. It varies between about 40Hz and 90Hz for most people. There is also a distinction between conscious flicker and perceptible flicker. Using myself as an example, I cannot consciously see flicker above about 75Hz. However, if I look at a 75Hz flicker for more than about an hour, I'll get a headache - which is certainly "perceptible". The flicker needs to be above 85Hz to not hurt. This 10Hz gap is not uncommon, though it's largely driven by anecdotal evidence from ~CRTs, as setting people in front of flickering displays for hours on end is not something we do much any more. In my case, this is all experience gained from my youth in the early 90s. High-speed ~CRTs were expensive - but I finally found a nice second-hand iiyama monitor that would consistently give me 1600x1200 pixels of text and also run at more than 75Hz. We believe that a display running at 90Hz is "fast enough" to make everyone happy, though it's possible there's a few rare exceptions out there (and if you know you are one, email me - we'd love to use you as a guinea-pig!)

People are significantly more sensitive to flicker in their peripheral vision than in the middle (also about 10Hz-15Hz difference). This may be why some people can tolerate 60Hz CRT ~TVs and monitors, but fewer can tolerate 60Hz flourescent lighting flicker.

A lower duty cycle raises the threshold of flicker fusion. I don't have good data on this, but from memory going from 50% duty-cycle to 25% needs about another 10Hz to reach flicker fusion. But the curve on this is funky and non-linear (fairly obviously, 100% duty cycle = zero flicker at any frequency), so don't try to extrapolate too much - just know that duty cycle is important.

''Motion fusion''

Motion fusion is the rate at which successive images appear to start moving smoothly. Note that unlike flicker, "moving smoothly" is a continuum, and higher framerates absolutely do improve the sense of motion. For most people, smooth motion starts at around 20fps (which is why most film is still at 24fps), but it consciously increases in "quality" to 120fps and possibly beyond.

''Combining the two''

So how do these interact? Well, cinema typically has 24 frames per second, which is usually enough for motion, but typically has a 50% duty cycle, and so would flicker like crazy if simply displayed one frame every 24Hz. So cinemas flash each frame two (48Hz), three (72Hz) or four (96Hz) times to reduce the flicker. I don't have great data on this - so projectionists let me know if I'm wrong here - but I believe in the past most cinemas used 48Hz, but modern cinemas mostly use 72Hz, as 48Hz is too low for a "surround" experience - it causes too much flicker in most peoples peripheral vision.

The choice of 50/60Hz for CRT ~TVs was driven by flicker fusion limits, and with no frame-storage ability in early ~TVs, they were forced to actually display that many frames. They have a lower duty cycle than cinema - usually around 25% (and it's an exponential drop-off rather than a square wave, so even that number is handy-wavy), which may explain why they needed a higher frequency than cinema. However they cheated with resolution and interlaced to get 30 frames per second. However, it's not a true 30fps in the cinema sense, because even with interlacing you can get 60 "motions" a second - though this does depend on the camera and the recording equipment. I'm not that well-versed in TV tech though - don't know how often you get 60fps of half-frame motion, and how often it gets squashed down to 30fps of full-frame motion.

We know from video games that 60fps looks visibly better than 30fps for most people on fast-moving content. Whether they care enough is a different question, and one that continues to rage. For myself, I can easily see the difference, but one doesn't really look "better" than the other to me on a monitor. However, in a VR HMD it's a very different matter for a variety of reasons!

''What about 120 and 240?''

Many modern LCD ~TVs advertise that they have 120fps and 240fps modes, even with regular 60fps inputs. How do they do that, and why?

120fps is fairly obvious - it just looks smoother. Just like 60fps looks smoother than 30fps, 120fps looks smoother than 60fps for most people. It's a more subtle effect than 30fps->60fps, but it's certainly there - I see the difference mainly in smooth pans. Of course the TV has to get that extra frame data from somewhere, and they do it by extracting motion vectors between successive 60fps frames and then inventing new frames by tweening. This causes some odd artifacts sometimes, and although the algorithms are much better these days I still prefer to have them off and just watch 60fps data. The tweening also adds latency - fine for a film, bad for interactive content.

OK, so what's the deal with "240fps"? Well, first up, it ain't 240fps - that's marketing bullshit. It's 120 frames of data, interleaved with 120 frames of black. Or, as an engineer might put it - 120fps with a 50% duty cycle on illumination. But that doesn't look as good on a billboard. LCD screens do this by turning the (usually LED) backlight off for those black frames, which also gives the LCD pixels a chance to switch states while they're not being illuminated.

So why do this? Why turn the display off for half the time? Normally, LCD displays are full-persistence - the pixels are lit all the time by a backlight. They take a while to switch between colours, but they're always lit - they're always showing something. That means that in things like smooth pans, your eyes are following the pan across the screen, but the pixels are of course moving in discrete 120fps jumps, not continuously. This "smears" the image across your retina and results in blur. We get this really badly in VR - for more details see [[Michael Abrash's great post on the subject|http://blogs.valvesoftware.com/abrash/why-virtual-isnt-real-to-your-brain-judder/]] - but it also happens on ~TVs when your eyes are following moving objects. The solution is the same - low-persistence, otherwise known as "turning the pixels off between frames". If you turn the pixels off, they won't smear.

So a 50% duty cycle is good for sharp images, so why hasn't this been done before on LCD screens? The problem is if you do a 50% duty cycle at 60Hz, it will cause flicker (at least with the widescreen TV monstrosities we have these days). That's why LCD ~TVs had to get to 120Hz before going low-persistence. And it really does look better for a lot of content - the edges get sharper even though you can sometimes lose some brightness.

But this prompts a question - why not just show 60fps data at 120Hz with low persistence? Just show each frame twice, like cinema does. Why do you need the motion-interpolation tech? Well, coz otherwise the pixels don't smear, they double-image. Think of the same smearing effect, but with the middle of the smear missing - just the two ends. So now any time something moves, you get a double image. This is a very visible effect - even people who are perfectly happy with 30fps rendering and can't see flicker over 50Hz can still see this double-edge effect at 120fps low-persistence! Cinema works around this by adding masses of blur - gigantic gobs of it any time anything moves. But existing TV content isn't authored with this much motion-blur, so it would double-image. And that's why the tweening is generally forced on if you go 120Hz low-persistence.

(it's an interesting historical question why ~CRTs - which were inherently low-persistence, with about a 25% duty cycle depending on the phosphors - why did they not look terrible? My theory is that we just didn't know any better - the competition was 24fps/48Hz cinema, and that sucked even more!)
If you're going to release a demo, try and minimally playtest it first on some people who (a) have not already played your game and (b) have $50 they can spend on anything they like. If for some bizarre reason they fail to immediately hand over the $50 in exchange for a full copy of the game, you might want to think about tweaking your demo.

Here's the checklist. It's not exactly rocket-science, but it's truly astounding how many manage to violate not one point on it, but every single one. It's even more astounding how random downloads from thid- and fourth-tier publishers on Fileplanet or whereever manage to consistently score several bazillion points above the first-tier publishers parading their stupidity on the heavily-controlled and scruitinised Xbox Live. Maybe that's because those lowlier publishers have to work for their money - who knows?

(1) To paraphrase Miyamoto - a late demo can eventually be good, while a rushed demo is bad forever. To put it another way - the demo isn't for the fanbois who read the gaming press and would buy a turd-in-a-box from comapny X. The demo is for the non-hardcore looking for something new. If you release the demo early and it sucks, you will turn away everyone, including the fanbois. If you ship it after the game goes to master and make sure it's not horrible, it will have the most glaring bugs fixed, be more balanced, and you might pull in some longer-term interest and keep yourself from being hurled into the bargain-bin two weeks after release.

(2) If at all possible, include the tutorial. I know you think your precious snowflake is so intuitive it's just pick-up-and-play. But again remember - your demo is NOT for the fanboy who bought the first eight episodes of your game series. It's to find the people who don't have preorders. They may not know how to play your game. They may have never played a game like yours. They may have never played any game. That is probably why they have a surplus of $50 bills. One of them could be yours.

(3) If you really can't be bothered with a tutorial, at least have some help screens that tell people what the buttons on the controller do. Maybe even what the strange bars on the screen do, or what the icons on the mini-map do, and what to do about each of them. Especially when they flash - flashing says "I'm important". Not explaing what important stuff is shows disdain for the player. And don't make those screens be the loading screens because guess what - they vanish while the player is reading them. Nothing shows a new player the middle finger quite so directly as a screen full of important text that you then don't let them read properly. They haven't even pressed a button and already they hate you!

(4) Make the demo nice and easy. Demos are not there to present a challenge to the player, they're there to demonstrate the full set of features to them. So for example in a certain fighting game renowed for its boob physics, you might want to slow everything down a bit and make the computer opponents not quite so ... well ... ninja. Then the player will actually be able to experiment with all the cool throws you spent so long doing the game code and animation for, instead of being mid-air juggled for 75% of the health bar the instant they do anything but madly bash the uppercut button.

(5) What is your Unique Selling Point? Why is your game teh ace and all others teh suk? After all, that is why you spent 500 man-years on this product, and that is why your publisher is beating you over the head about deadlines. I may be talking crazy here, but I'm thinking you may wish to explain it to the player. If you don't, the player will be looking at your product and thinking deep philosophical questions such as "why would I spend $50 on this?" You do not want them thinking those sort of questions - you want them thinking "wow - I'm glad I spent $50 on this brilliant game." If you can't manage that, at least go for "shame I spent $50 on this game - the USP sounded cool but it wasn't that much fun after level 8". Your bank manger is equally happy with either.

This is your first contact with a large chunk of your potential audience. They have $50, and they are deciding whether to give it to you, so you can feed your wife/husband/kids/cat/llama/habit. Marketing has done its job, and the punter is finally devoting their full and undivided attention to your game and playing your demo. Now it's all up to you - this is your fifteen minutes of fame. If that person doesn't like your demo, it doesn't matter how hard you crunched to get the game out the door.
As a side hobby from the SuperSecretProject, I've been tinkering with writing a game. When I was writing games professionally I really didn't feel like doing more games at home, but now it's a refreshing change from designing instruction sets. However, the question was - what game to do? I didn't want to try to think up something new (dammit Jim, I'm a programmer not a game designer!), so I just blatantly stole concepts from two of my favourite games - [[Dwarf Fortress|http://www.bay12games.com/dwarves/]] and [[Transport Tycoon|http://en.wikipedia.org/wiki/Transport_Tycoon]]. I like them for very similar reasons - you set up a system, and little autonomous agents run around in that system. That sort of indirect influence is really appealing to me as a programmer as well as a game-player. So starting with the high-level concept of "Dwarf Fortress with trains", it could go pretty much any place really.

The point of this is to do things I've not really done before, such as writing map structures, managing inventory, path-finding, simple UI, load/save. Whereas I have already done tons of graphics and sound stuff, so for this project I literally don't care about graphics. I literally have a bunch of teapots running around in a gouraud-shaded fixed-function-vertex-shader environment. I'm just using the ~D3DXCreate* functions and abusing my poor graphics card with the number of draw calls, and I refuse to care! This is very much in the spirit of Dwarf Fortress, which uses a full ~OpenGL pipeline to emulate a 16-colour ASCII terminal - though I do at least have a proper 3D camera for the sake of playability. Much as I love DF, the 2D-only interface is a huge barrier to gameplay.

I have no pretensions to ever ship the game - it may not even ever be fun to play (since my goal is far more that it should be fun to program!). It's certainly not playable now. But as a programmer there's lots of interesting little things that have kept me intrigued that have never really come up in my life as a graphics coder before. So I thought I'd write about some of them.
''[Edit - this was all good info at the time, but now everything Just Works so most of this post is obsolete. But I don't like to remove stuff from the web, so it remains here for posterity.]''

I've been waiting for this a long time. Elite on the BBC Model B (at school) and the Speccy (my home machine) was the game that made me sit up and realise that 3D rendering was a thing, and it was awesome, and I had to figure it out. I also rapidly discovered that it was a thing that is very difficult, but it inspired me to keep pushing my coding knowledge from BASIC to FORTH to assembly and learning all sorts of tricks. It was only years later on the Atari ST that I felt I actually knew what I was doing and was finally a competent graphics coder, and that moment came when I had a Viper rotating on my screen as fast as the ST version of Elite. It was only much later when I went back and learned some of the details of the 8-bit versions that I realized just how many tricks had had to be played, and why the Amiga/ST versions were comparatively straightforward by comparison.

Later at university I met David Braben a few times, and despite my rampant fanboyism, he helped me get my start into the industry. And I've always been a graphics programmer at heart through my various jobs, and on almost every project I have ever worked on, the very first thing I get running is a rotating Viper - I can type the coordinates and faces in from memory (which isn't actually that difficult, but it's a neat party trick).

And now I find myself at Oculus working on VR. And lo - the streams have crossed, and two awesome things are coming together very nicely. The ~DK2 version of Elite is still a work in progress, but it's already mighty impressive. I thought I'd write a guide to getting it working. I did this on my "virgin" home machine rather than my work one, using only the public resources, so hopefully you should not have have too many surprises replicating it. These instructions are for ~DK2 on a Win7 machine - I haven't tried Elite on ~DK1, or on any other versions of Windows yet.
* I recommend a reasonably fast desktop machine for VR. In general, very few laptops give a good VR experience. I have a machine with a ~GeForce 670 and a machine with a Radeon 7870 - neither are bleeding edge, but both work very well.
* Install the latest public Rift runtime or SDK. At the time of writing this is SDK v0.4.4, display driver v1.2.2.0, camera driver v0.0.1.7, but later versions should also work.
* A common mistake is to forget to plug in the sync cable between the camera and the junction box on the HMD's cable. The junction box end can be a little stiff at first, so make sure it's pushed all the way in.
* Also remember to remove the sticky plastic shipping cover on the camera's lens. In a stroke of genius on our part, it's transparent to visible light which makes it easy to miss, but completely opaque to IR. Do NOT remove the silver-coloured cover - that is meant to be there (it reflects visible light, but it's transparent to IR).
* Make sure the Rift is turned on. The button on the top right is the power button - press it so the light is yellow or blue (not off). Be careful when putting the Rift on - it's easy to press this button again and turn the Rift off just as you put it on your head. I do this at least once a week.
* If all is ready, the light on the Rift should be blue, and the light on the camera should be blue, and both serial numbers should be displayed in the Oculus Config Util. If not, check your cables!
* Open the Oculus Configuration Utility, set up a user profile, enter your details, and calibrate it to your head shape using the "Advanced..." button. It is important you follow the instructions about the side adjustment dials correctly - set them all the way out while actually doing the calibration, and then set them back to where they are comfortable, and set the slider in the Config Util to match the dial setting you chose (note the picture in the Config Util is of the dial on the right side of the HMD).
* Check everything is working by clicking "Show Demo Scene" in the Oculus Config Util. It should track your head rotation and translation smoothly, and the desk and objects on it should seem to be a realistic size.

OK, now you have the Rift set up in the default "Direct" mode. However, at the time of writing Elite is still using the older "Extended" mode like a lot of other apps, which can be a little fiddly to set up. The Elite chaps say they're working on the Direct-mode support, but right now it's experimental, and I haven't tried it yet.
* In the Oculus Config Util, in the Tools menu, select Rift Display Mode and select "Extend Desktop to the HMD" and hit Accept. This will make the Rift show up as a monitor in Windows.
* Make sure the Rift is turned on (blue light) when doing this next bit and hasn't gone to sleep.
* Right-click on your desktop, select Screen Resolution, and you should see at least two monitors are connected, one of which will be a monitor called "Rift ~DK2"
* This will be a monitor that is 1080x1920 in size, and the orientation will say "Landscape" even though it's clearly taller than it is wide. This is normal.
* Change the orientation to "Portrait" (not "Portrait (Flipped)"), which confusingly will change it to a 1920x1080 monitor. All apps use it in this mode, not just Elite - you'll only need to do this once.
* Make the "Multiple Displays" setting say "Extend these displays" (this is why we call this "Extended" mode - because you're extending your desktop onto the Rift. Don't try to actually use it as a desktop though!)
* Don't worry if the Rift shows up as monitor 1 or monitor 2 - either is fine. Your Windows desktop will still be on your primary monitor either way.
* Do NOT click "Make this my main display" for the Rift ~DK2 monitor otherwise all your icons and start bars will move to the Rift in horrible ways. Seriously, don't try this.
* Re-run the Oculus Config Util demo scene (the desk with the cards on it) - it should work correctly in Extended mode the way it in Direct.

Now the Rift is set up for Extended mode, which is a handy thing to know as lots of apps still use it. You can flip it back to Direct mode by using the Oculus Config Util app as before. Now let's get Elite working.
* Install and start Elite.
* On the main menu, select Options, then Graphics.
* Scroll down to the "Quality" section and put "Preset" to "Low". You can fiddle with the various settings later, but Elite looks gorgeous even in low, and it will give you the best possible VR experience. Remember - in VR, framerate is king. The most gorgeous shaders in the world are no good if it's not making framerate. On my machines framerate still drops below 75fps inside the really pretty spaceports (you will see this as a double-image "judder"), but during combat where it matters it's mostly smooth as silk.
* Hit "Apply". Coz this is the tricky bit.
* Go into Options, Graphics again.
* Find the heading called "3D" and select one of the "Oculus Rift" modes. I highly recommend using headphones rather than speakers for the full awesomeness.
* Do NOT hit apply yet (if you do, make sure the HMD is sitting horizontal, e.g. put it on your lap, or you won't be able to see anything!)
* Now go to the heading called "Monitor" (3rd from top) and select the other one - usually called "Secondary". If you have more than two options, I'm not sure what to pick, sorry.
* If you did it right, the "Fullscreen", "Refresh Rate", "Vertical Sync" and "Frame Rate Limit" lines should be redded out, and the resolution should be 1920x1080. This means Elite correctly detected the Rift as being a special display.
* OK, now read the next few lines of directions before hitting "Apply".
* The screen will go black - quickly put on the Rift!
* Waggle the mouse until you get a cursor. You may need to waggle it a long way left or right to get it to appear.
* It is asking you "are you sure you want to keep these changes", click "yes".
* If you don't do it in ten seconds, which is a surprisingly short time if you're not expecting it, it will drop back to the normal monitor and you have to do it all again.
* OK, you're wearing the Rift, and hopefully you see a menu screen in the top left of your view! Yes, it's difficult to read. Welcome to experimental software, test pilot!
* Now play the game.
* It at any time you want to switch back to the desktop, remember to select "3D" to "Off" AND "Monitor" to "Primary" both together before hitting "Accept", otherwise things get difficult. If you do cock things up - don't panic, just wait 10 seconds, it will revert to previous settings, and you can try again.
* At any time you can press F12 to reset the Rift orientation - this includes in menu screens. Very useful if you switched to the Rift when it was pointing in a strange direction.

Some things to help quality of life with a Rift in Elite:
* If like me you're playing with a mouse (my ~X52Pro is on backorder), I would remap some keys:
** "Set speed to 50%" to "C" (and remember "X" is 0% by default)
** "Cycle next fire group" to "Caps Lock"
** "UI Back" to "Left Shift"
** Remember that most keyboards have a knobble or blob on the F and J keys, so whenever you need keys on that row like "G"="Next Target" or "J"="Jump", they're pretty easy to find.
** Most other functions are non-time-critical (landing gear, cargo scoops, etc) and so can be accessed by going through the right-hand panel.
* The default colour of the HUD is quite reddish. The ~DK2 gives optimal resolution in colours with more green or white in them. Elite has some limited control over the colour palette. If you find the file {{{GraphicsConfiguration.xml}}} (in a folder such as {{{C:\Users\YourUserName\AppData\Local\Frontier_Developments\Products\FORC-FDEV-D-1003\}}}) and open it in a text editor, you can tweak it a bit. A really handy interactive page for choosing your own set is at [[Elite: Dangerous HUD colour theme editor|http://arkku.com/elite/hud_editor/]] and they have more instructions. However, many of the possible colour choices leave you with problems such as:
** Can't tell in-station menus items that are redded out from ones that are not.
** Can't tell red, blue and yellow dots on the ~Friend-or-Foe tracker.
* My own colour palette is a fairly modest change, but it does turn the text a bit more yellow (i.e. it adds some green), which helps readability without breaking those other things too much:
{{{
	<GUIColour><Default><LocalisationName>Standard</LocalisationName><MatrixRed> 1, 0.4, 0 </MatrixRed><MatrixGreen> 0, 1, 0.4 </MatrixGreen><MatrixBlue> 0, 0, 1 </MatrixBlue></Default>
}}}

Elite is already an amazing experience in the Rift, if a little klunky at times. I'm sure with some tweaking it will become even more amazing. Happy flying, remember to look up, and see you out there. Regards, CMDR FOURSYTH
With the usual anti-bot obfuscations, my address is: tom ~~dot~~ forsyth ~~at~~ eelpi ~~dot~~ gotdns ~~dot~~ org.
Bonus karma points to Ben Garney of Garage Games who spotted a silly math error of mine in [[Knowing which mipmap levels are needed]]. Now corrected.
In 2006 when starting this blog, I wrote a moan about double precision and why storing absolute time and position should be done with fixed point instead. Re-reading it, there were a bunch of shorthands and bits I skipped, and it was unstructured and hard to read - like most rants. Also, it talked about the ~PS2 and early ~DX9-class ~GPUs, and nobody really cares about them any more. And somehow I managed to only talk about positions, and completely forgot about time. And the typos - so many typos! So I thought it could do with a cleanup, and here it is: [[A matter of precision]]
Lots of fun. Slightly quieter than last year, but according to many it was because companies are only sending the key people, rather than the massed hordes. Also seemed like a lot fewer of the random mousemat manufacturers, DVD duplicators and the like.

Michael Abrash and I gave our talks. He almost filled his gigantic 1000-person room, and my little 300-person room was totally full - not even standing room available. I made sure I left 10 minutes at the end of my talk for questions, but we ended up with half an hour of questions and they had to kick us out of the room for the next session. So I call that a success.

A general site for all your Larrabee needs is: http://www.intel.com/software/larrabee/

Our slides are available here (scroll down near the bottom): http://intel.com/software/gdc

Michael Abrash's article in Dr. Dobbs Journal is now live: http://www.ddj.com/architect/216402188

The "C++ Larrabee Prototype Primitives" available from here: http://software.intel.com/en-us/articles/prototype-primitives-guide/  These allow you to write Larrabee intrinsics code, but then compile and run it on pretty much anything just by adding a #include into the file. The intent was to make it as easy as possible to prototype some code in any existing project without changing the build.

I went to see [[Rune|http://runevision.com/blog/]]'s talk on locomotion (see [[How to walk better]]), and it was really superb. Great tech, great (interactive!) slides, and he's a really good presenter. If you've read his stuff before, go see it again as there's a bunch of stuff I hadn't seen before - tricks like how to stop the ankle flexing bizarrely as people step down from high places.
I was asked recently what the best way to approach [[http://www.gdconf.com/|GDC]] was for the newbie. GDC is a big busy place, and it's easy for people to get a bit lost. Here's a few handy hints:

* Take an hour beforehand to sit down and go through the entire timetable. In each timeslot, mark the top three things you'd like to see, and give them a score from "might be interesting" to "must see". Remember to at least look at the tracks that aren't your main discipline - they sometimes have interesting or crossover stuff. Write these scores down, and when you actually get to GDC, transfer them onto the at-a-glance schedule thingie you get in the welcome pack, then you can refer to it quickly during the day.

* Prioritise sessions you don't know much about but sound interesting over things that are right in the middle of your primary field. For example, I'm a graphics programmer who's already worked a lot with DX10, so there's probably not a great deal of point going to a talk called "D3D10 Unleashed: New Features and Effects" - it's not going to teach me much new stuff. But I've not all that done much general-purpose multi-processor work, so the talk called "Dragged Kicking and Screaming: Source Multicore" sounds fascinating and will probably give me some new insight when I do start doing multi-processor work.

* Be prepared to miss a lot of the lectures - GDC is pretty chaotic. Allow it to be chaotic - don't try to control it too much. If you're talking to someone interesting, and they're not rushing off to something, keep talking to them. You can always go into the lecture for the second half, or at worst you can pick up the lecture notes later. But you might never meet that interesting person again.

* Don't spend too long on familiar stuff. GDC's meant to be about new stuff, so don't just hang out at a bar and gossip with your friends all day. Hang out at a bar and gossip with //new// people instead.

* Don't spend too long in the expo. Have a quick walk around fairly early on and check the place out, see what might be interesting, but don't hang around or play with stuff. There's two halls this year, and with E3 gone it's even bigger and probably noisier than previous GDCs. If you do find something interesting, note it down for later so when you do find yourself with a spare half hour, or during lunch, you can go back and have a better look. Similarly, if the expo is crowded, go do something else. It's probably lunch-time or something. Come back in an hour and it will be quieter. The booths and the people on them will still be there, and you'll actually be able to see something and talk without shouting. Also don't do the expo on the first day or the last day - that's what everyone else does.

* Remember that there's two expo halls this year.

* If after five minutes of a lecture you find the lecturer is dull, or presents badly, or is just reading off the slides, leave. It's obvious you're going to get just as much from reading the slides later as you will from seeing the talk in person, and it'll be quicker. Your time is valuable - go to the next-best-rated lecture on your list. Some lectures are 10x better in person - those are the ones you should be watching. If you feel too embarrassed to just walk out, take your phone out as if you had it on vibrate and just got a call or a text message, and walk out glaring at it.

* During the lectures, have a notepad and pen out. When you have a question, scribble it down. At the end of the lecture, the presenter will ask for questions, and there's always a deafening silence for a few minutes while everyone remembers what their questions were. That's when you can stick your hand up or go to the microphone and ask your question.

* Fill in those feedback forms. The GDC committee do collate them and do take them pretty seriously when considering who to invite back to speak next year. The authors do also get the feedback and some of the comments, so if they have good info but are a terrible speaker, say so. And use the whole scale. I forget what it's out of, but if it's 1-10, 1 = bad, 5 = good, 10 = excellent. No point in messing around with 6, 7, 8 as scores - your voice will be lost in the statistical noise.

* Everyone always asks me about the parties and which ones to go to. I'm not a party animal, so I don't enjoy them for what they are. And as far as socialising, they're terrible. Talking is almost impossible and unless you're very wealthy or persistent, you'll be hungry and thirsty. Much better to hang out in a hotel bar or lobby with some friends. You'll still see plenty of game geeks doing the same, but you'll be able to talk and eat. And it's far more relaxing, which means you can gossip til the early hours and still be reasonably awake for the first lecture next day.

* Don't worry if you think you're missing all the cool stuff. Everybody always misses all the cool stuff. This year they're in SF rather than San Jose, so nobody knows where anything is, so even old pros will be confused.

* Go to the Indie Game Festival if you can. Definitely visit the IGF pavillion to check out the games and chat to the developers. The standard is awesomely high every year.

* Play the meatspace MMO game - this year it's "Gangs of GDC". Doesn't usually take up a significant amount of time. It's a good excuse to be silly and talk to people.

* The Programmers Challenge is back! Always amusing for the uber-geek.

* The Developer's Choice Awards are worthwhile and not as bogus as most of the other awards. Go to them and cheer for people that don't suck.
''WORK IN PROGRESS'' - which is why it's not on the main page yet.

A topic guaranteed to cause an argument amongst the shipping veterans is where you should do editing of levels, scripts, shaders, etc. There's three obvious options, although different aspects may use different ones, e.g. shader editing may be done in one place while level editing done in another:

-A plugin for your DCC tool of choice (Max, Maya, XSI, etc)
-A standalone editor.
-A part of your game.

The second two are obviously close - the difference being whether you have to do a save/convert/reload cycle every time you want to test something in the actual game. For some console developers, this is a more obvious distinction as it means they don't have to make a PC version of their game simply to test & edit it. To simplify the argument, I'm going to lump the second two together and say there's two choices - embedded in your DCC tool, or as a fully-custom editor that shares as much as possible with your game engine.

The advantages of making your game be the editor are obvious. You get to see things as they really are, almost instantly. You're using the real rendering engine, you're using the real asset-management system, and you're using the real physics engine. If some cunning design or model won't work inside your engine, you'll know about it immediately, not three months later when you've already authored a bunch of content. Your artists and designers can experiment however they like, because they see the results immediately and know if they look good or play well. The biggest advantage, if you make it right, is that level designers can set up objects and scripts, hit a button, and they're instantly playing the game. Keeping that design/test/tweak iteration cycle as fast and tight as possible is the key to excellent gameplay. The same is true of lighting (whether realtime or lightmap-driven) and sound environments - both of which rely on having large chunks of the environment around to see whether they work or not, and are almost impossible to do in small unconnected pieces, or without very fast previews.

The disadvantages are that it's more complex to make an engine that can run in two different modes - editing, and playing. The editing mode will be running on a powerful PC, often with unoptimised assets (optimising takes time and leads to longer turnaround cycles - you don't want that).

The argument for putting the editing tools inside the DCC tool are obvious - your artists already know how to use the tool, and it's a bunch of editing & rendering code that's already been written. Why reinvent the wheel? Ordinarily I'd agree - wheel reinvention is a plague, and burns far too much programmer time in our industry. But hang on a bit - is that really true?





Fundamentally, there's a few big problems we have to deal with:
*Previewing assets in the actual game environment. That means shaders, textures, lighting models, shadows, lightmaps. If the artists can't see all that as a fast preview while they're building the model, they're going to be doing a lot of guesswork.
*Level design. Placeholder artwork, hacked-in control schemes, pre-scripted actor movement - doesn't matter - you need to give the designers tools to work with so they can start getting the broad-strokes design stuff up and running so that everyone can get a feel for what the game is going to be about.
*Sound & animation design. Seems odd to put these two together? Not really. Both have a time component to them, and they need smooth blending. This makes them fundamentally different things to rendering and lighting. But you still need to be able to easily preview them in the real game. Also, animations frequently have sound triggers on them - footsteps, equipment sounds, etc. For bonus annoyance, typically the animator and the sound guy aren't the same person, don't know how to use each other's tools, and will want to edit the files independently.
*Keep iteration times down. And by "down" I mean less than 15 seconds from tweaking a model/texture/light/object placement/trigger/script to trying it out in the game. We all know that iteration and evolution is at the heart of a good game, so we need to keep the dead time in that iteration loop to the lowest possible.

My preferences:

Make your engine able to reload assets at the press of a button. Having to shut it down and start it up wastes time, especially as those are the areas you tend to optimise last in a game. You need your artists to be able to edit something in the DCC tool, hit a button, and have the results visible right then. This requires a decent amount of infrastructure, though a lot of that is shared with things like streaming systems (which are really good things to have), and stuff like dealing with Alt+Tab on ~PCs. Even if all you can do is reload existing assets, that's a big time-saver. If you can further and import new models with new texture names, etc. that's even better.

Make sure your engine can read assets directly from the exported data, or do so with minimal and fully-automated processing. It doesn't have to render them quickly, and you obviously need another optimised format that you will actually ship data in, but you also need something with a very quick turnaround time. It doesn't matter if the model the artists is currently editing is loading all its textures from raw ~TGAs - we all have big video cards these days. Once they're happy with the textures, a batch script or tool can compress them down to your preferred format. This can either be done overnight, or on another preview button so the artist can see what the compressed version actually looks like.

Ditto with things like vertex cache reordering and optimal animation-bone partitioning - none of that stuff matters while the artist is actually editing the model - do it in a batch process overnight. One of the really nice things about Granny3D that we've focussed on a fair bit over the years is the forward and backward compatibility of the file format - you can store all sorts of data in a .~GR2 file, both raw and compressed, and they should all be loadable by anything at any time.

One subtlety here is that a processor should never modify original data. The stuff that comes of out the DCC tool should be minimally processed during the export process - convert it to a standard format, but leave as much data in there as you can. Later processing steps can strip that data out, but they will do so with a different file as a target, not an in-place modification. That way when the optimised format changes, you don't need to fire up the DCC tool and do a while bunch of re-exporting - you can just run a batch file on all the raw exported data and have it in the new and improved format.

The level & lighting editor should be the game. Using Max/Maya/XSI/Lightwave as a level editor never works well. I've tried it both ways, and I've discussed this with others. Think about it - all you're getting out of your DCC tool is a camera movement model (easily replicated), a bad renderer (you already have a good renderer), some raycast mouse-hit checking (every game has these already) and a bad button UI that I guarantee you will annoy the snot out of you when you actually go to use it. Trying to use the DCC tool's interface to do any serious previewing is horrible - just think about trying to implement some sort of shadowbuffer rendering, or implementing the movement logic of the character inside inside a DCC tool - it's pretty absurd. This is not reinventing the wheel - there's just so little of a DCC tool you actually need or want in something of the scope of a level editor.

The animation & sound editor should absolutely be the game. There's just no doubt here - DCC tools don't help at all. Because you need to be able to see the blending and transitions between different animations, and the 3D sound environment as it really will be, a DCC tool doesn't bring anything to the table. The way I've seen animation editors done really well in the past is that the game is playable (ideally with control over any character) and the last thirty seconds are continuously recorded. At any time the animator/sound guy can pause and scrub through that time window, tweaking blends and when sound events happen.

So that's my rant - the smartest coders should be tools coders, and they need to write level editors and makefile-driven processing chains. DCC tools are for Digital Content Creation on a micro scale - individual characters, individual animation sets. They're not for entire levels, and they're really bad at previewing.
This is pretty handy - [[GameMiddleware|http://www.gamemiddleware.org/middleware/]] - a website that is just a big list of games middleware. I'm so tired of people reinventing perfectly good wheels. Unless your game is going to focus on having the best system X in the world, why not go and look if someone's already got a perfectly good X you can just buy? Then you can spend your time solving some more interesting problems.

Also tinkered with the titles. Not much point having the titles just be the date when the date's written just below them. And the RSS looks more sensible that way.
Granny3D is a big bag of goodness from [[RadGameTools]]. It's an exporter of all sorts of data from Max, Maya and XSI, it's an asset pipeline manipulation and conversion tool, it's a normal-map generator, it's a really cool file system that copes with endianness, 32 or 64 bit pointers, and versioning, it ships on 11 different platforms, and above all it's the world's most superlative runtime animation system.

Granny was written by [[Casey Muratori|http://www.mollyrocket.com/]] originally, handed over to me for two exciting years, and then taken over by [[Dave Moore|http://onepartcode.com/]]. Even though I haven't worked on Granny in a while, I still pimp it like crazy to anyone within earshot. Seriously - it's a bunch of really cool tech, a bunch of rather tedious tech, a bunch of gnarly problems solved by the brightest minds in the business (and me), and it's all wrapped up into a lovely little bundle that if you were to replicate yourself would make you officially insane. Pay the money and get on with something more interesting! When I think of all the man-years it would have saved me when working at MuckyFoot, ooh it makes me mad. As Dave said once "we can't replace your existing animation coder, but we can paint a [[big red S|http://en.wikipedia.org/wiki/Superman]] on his chest".

As time goes on, Granny becomes more and more about the entire toolchain - taking your mesh, texture, tangent space, animation and annotation data from the various DCC tools, exporting it in a reliable robust cross-platform format, processing that data in a large variety of ways, and then finally delivering the data in optimised form to your game engine. This is an under-appreciated area of games programming. It's not viewed as a sexy job for coders, and it tends to be done hastily, often by junior programmers. As a result, it is usually a source of much pain, not least by the artists and designers who have to use it. Don't let this happen to you - start your toolchain on a solid foundation.
Hello and welcome to my blog. Always fresh, always technobabble. Read the chronological blog that is usually listed below, or use some of the links to the left. Or EmailMe. Although this is technically a wiki, you'll just be editing your own personal copy, not the public one.
I have just finished a round of interviewing with various companies, and for the most part it was a pretty positive experience. However, last time I did this in 2004, it was not so pleasant, and I wrote up a bunch of moans for my dev friends, and we had a good natter and a laugh about them. It may have have been nine years ago, but they're all still very relevant, and if you interview people at your company, it would be worth a look. So here's the list:

Interviews go two ways. In many cases, I'm interviewing you more than you're interviewing me. The quality and pertinence of questions you ask reflects very strongly on you and your company. This is sort of the "meta message" to apply to all the following. Simply asking "any questions for me" at the end of the interview does not cover your arse. There's a big difference between interviewing a newbie out of college and an experienced veteran - try to adapt your style. 

Read my CV/resume. Dammit, this could be the start of a long and lasting relationship, at least do me the courtesy of reading my CV. Asking me about A* algos when it's blindingly obvious I've been a graphics coder all my life says you really haven't paid attention. So why should I work for you? If I list various papers/articles I've written, at least skim-read some of them. They'll give you stuff to talk to me about, and stop you asking questions to which the answer is "read my paper" - that's just embarrassing for all concerned. 

Do not ask me arbitrary logic questions, especially ones with "tricks" in them. You know - crossing rivers with various objects and animals. Or reversing linked lists in-place - things like that. There's three options. First, I've heard the question before, know the answer, and admit it, and so what does that prove, except that I've done a lot of interviews recently? Second, I've heard the question before and pretend I don't know it and fake working it out so I look like a genius. So now I'm a devious arsehole - cool! Third, I don't know the question and I flounder away trying to spot the trick under time pressure - it is random how long this takes. Logic puzzles are good for testing how good you are at logic puzzles. But like IQ tests and cryptic crosswords, they are only loosely related to how good you are at practical problems. The fact that I've shipped games and designed large systems should tell you more than these tests. 

Don't ask me maths degree questions when you can see that I did a ~CompSci degree. My answer is (and was) "I don't know, I didn't do a maths degree, but I know where to find the answer in a book/on the web". Especially if the answer is (a) not the same way a computer would do it and (b) in dispute within the company anyway (you know who you are - I "won" that question because I knew I didn't know the right answer :-)

Don't ask me to grind through the detailed maths on questions. Answers along the lines of "you do A and B and C and it all boils down to a line/sphere intersection check" should show I know all a coder needs to know - the performance characteristics, memory impact, robustness, etc. Making me then actually write down the exact equations is tedious, I'll make some trivial sign error because of the pressure, I can't put ~ASSERTs or test harnesses in to check it, and in practice I'd just use a library call anyway. 

Don't make me write code on a whiteboard or piece of paper (rough algos and sketches, sure - but not C). Interactive editors have been the norm for longer than I have been alive. Most of us now do not write code in a particularly linear fashion (I'm not even writing this article in a linear fashion), and in fact the order in which you write things can be extremely revealing. I think the days of punched cards are finally behind us. Plus my handwriting on a whiteboard is illegible. 

Questions with specific answers are tedious. If my CV shows I'm a competent coder who has shipped titles and done GDC talks and so on, assume I am indeed competent. Let's skip the basic C and "what would you use a 4x3 matrix for" crap and get down to some interesting questions like "how would you write a graphics system where you had to see out to the curvature of the earth?" - relevant stuff like that. Surely this is far more relevant and revealing than knowing Eratosthenes' Sieve or reimplementing a hash table? 

Be especially wary if asking me specific-answer questions on fields I know very well. I may know the field better than you. Which may mean the correct answer is not the one you are expecting. If I have to correct you during an interview, that gets extremely awkward for everyone concerned.

If you really must ask me questions to "see how my brain works", and simply discussing the state of the industry and general coding issues hasn't done this already, at least pick ones that obviously don't have a trick, or an answer you actually know. Stuff like one I was asked in an exam many years ago - "how fast does the tip of a bee's wing move" (hint - bees buzz). These are reasonable ways to show how someone's brain works, but be careful not to have a preconceived notion of the correct method, as documented with the well-known [[barometer question|http://en.wikipedia.org/wiki/Barometer_question]]. Still no substitute for asking them about more relevant topics, and they really annoy some people because of their irrelevance to the job in hand, but at least they're not actively insulting or demoralizing.

If you do give me your standard "do you know C/C++" programming test, have the decency to look embarrassed and say "sorry, we give this to everyone". Also, the answer "I'd look it up in the spec/ask the guy who wrote it, then change it so it was unambiguous" is almost always the right answer, if not the correct answer. 

Don't ask questions that are thinly disguised versions of "are you a psychopath", "do you hate management", "are you antisocial", "do you have NIH", etc. It's lame. I guess if some is any of those, then there's a slim chance they won't realise it's lame and blurt out revealing answers, but I doubt it. If you can't tell these (and a whole lot more) from general discussion, you won't get it from crappy direct questions like that. If I'm half as intelligent as you think I am, I'll lie and give you the answer you want to hear. These questions are especially eye-rolling if I personally know some of your staff or indeed have worked with them before - why don't you ask them? Why haven't you already asked them? Don't you trust them? 

I will happily sign whatever ~NDAs you like, but do not tell me there are things you can't tell me. Is it OK for me to answer "I'm afraid I can't discuss that proprietary information with you" to some of your questions? Thought not. Especially true of your future plans and your current projects - the very things you will want me to work on. Imagine I'm investing in your company. Because I am. 

Attempt not to look bored. This is my ninth interview in as many days - I easily out-trump you. 

Feed me lunch. Cheapskates. 

Always give me some time where I can randomly chat with the people I will be working with. Lunch is ideal for this. If all I get is a grilling in an interview room, I'm not going to be very flexible on pay, etc. I have no hesitation in making higher pay demands to companies I don't feel very comfortable with - if I think I'm going to have to buy my happiness outside work, you get to pay for it. 

Try not to flinch when I say what money I'm after. This is the start of negotiation, it's not my final demand. I'm obviously going to start at the high end - nobody ever started low. This is especially true if I'm going to move country - it's extremely difficult to do a quality-of-life comparison when converting from one currency to another - it's hard enough doing it when you're moving from one part of the UK to another. So I'm going to start higher than I think, just in case.
First up, read part 1: [[How not to interview]] and remember that was from experiences nine years ago. Of course I just finished another round of interviews, and decided to continue my VR adventures and join the rather spiffy [[OculusVR|oculusvr.com]]. That interview went fine - shipping the first game to use a company's hardware tends to make the interview fairly simple. However there were others that did not go as smoothly.

First the good news - everybody is doing a lot better than last time. The bad news - some people still don't really understand that interviews go two ways, and don't realize they can not only fail, but really faceplant dramatically. And gamedevs like to gossip as much as any folk. Don't let this happen to you. But on to the concrete tips.

* Whiteboard coding. Yeah, it's such an obvious fail. Fortunately only one company did this, and at least looked sheepish about it. Another suggested it, I looked pained, and counter-proposed I pull out my laptop and open up Notepad, which seemed acceptable to all. And the question was at least super simple, no tricks, and clearly just designed to check that yes, I know what a bit-shift is, and also whether I can look for and track down off-by-one errors. It actually turned out to be a decent question ''when done on an actual computer''

* Asking questions to which the interviewer knows the answer. The best questions are simply opening to a discussion. This sort of question is the opposite - it quells discussion because the interviewer usually just want to hone into the answer. It also encourages the feeling that the interviewer is "in charge" while the interviewee just has to sit there and be interrogated, which again - that's not the way an interview should actually be. I had a lot of interviews like this, and none of them went well - that style of interviewing prevents people getting to build a rapport. One particular interview was excruciating. The questions were about animation. I kinda know animation - I wrote a book on it, and I wrote Granny3D for two years. The company I was interviewing for use Granny3D. So it was fairly clear the interviewer hadn't read my resume, which isn't a great start. Or they had and were determined to ask me a bunch of basic questions anyway. It went on in a rather stilted manner for a while, and then at one stage they asked me a question to which the answer was "it depends which way you're building your tangent space". This is a somewhat obscure fact - there are two different things that people call "tangent space", but usually people only know one way, and they think it's the only way. As the interviewer did. So now there's a hugely awkward couple of minutes while I explain to the interviewer just how much they don't understand about the questions they're asking me. If the interview dynamic had not been forced into the interrogator/subject dynamic, but been allowed to fall into a situation where two peers were discussing a complex subject, then this would have been totally fine - nobody knows everything, and swapping knowledge is fun. Worst thing - after this interlude, we plowed right back into the question-and-answer format. I could have facepalmed.





A while back I made a "How to Walk" demo for Granny3D (slides available here: [[http://www.eelpi.gotdns.org/papers/papers.html|http://www.eelpi.gotdns.org/papers/papers.html]]) that showed how to do interpolation of walking stances, how to cope with uneven terrain and do foot IK. It's trickier than it looks. I always meant to keep updating that demo with more features - most notably missing was how to stop! I'm sure many of you remember it from GDC as "the one with Granny in a Xena: Warrior Princess outfit". It's certainly... eye catching (thanks to [[Steve Theodore|http://www.bungie.net/Inside/MeetTheTeam.aspx?person=stevet]] for the memorable artwork). Anyway, I got distracted by Larrabee, and never did add those features, which I always felt bad about.

[[Rune Skovbo Johansen|http://runevision.com/blog/]] has done a similar system, but taken it a lot further. He uses fewer animation cycles and copes with more terrain. Cool features include 8-way motion, ledge avoidance, stop/start, leaning into slopes, four-legged animals, and foot placement on stationary rotation. It's a very comprehensive locomotion system, and I'm going to see if I can get to [[his GDC talk|https://www.cmpevents.com/GD09/a.asp?option=C&V=11&SessID=9004]] on it.
I've converted my old article on impostors to HTML so it's "live" and searchable and so on. Also added a few hindsights. The article was written before [[StarTopia]] was published, so there's a few implementation notes added about what did and didn't work in practice. It's in the [[papers and articles section|http://www.eelpi.gotdns.org/papers/papers.html]]
People have been bugging me for ages, so I guess it's time to start a blog. I'll try not to ramble too much.

First up - [[The Elder Scrolls IV: Oblivion|http://www.elderscrolls.com/]] - what a totally fantastic game. I'm playing it on the Xbox 360 and it's gorgeous. Yes, yes, it's got gameplay up the wazoo as well (that's why I'm 30 hours in and barely feel like I've started), but I'm a graphics tech whore, so I reserve the right to spooge over the pixels. It's amazingly immersive - they've worked out that immersion is defined not by your best bit, but by your worst, and they've spent a lot of time filing off all the rough edges. It's one of the few games where the screenshots on the website are what it looks like //all the time//. Other games, you walk around and there's texture seams all over the place, and when you get close to some things it's really obvious they're just a big flat billboard, or whatever. Not in this. I spent quarter of an hour yesterday just walking around looking at trees and plants and fallen logs and lakes and little stone bridges and the way the sunlight went through the leaves - totally forgot about playing the game until a mud crab almost tore off my leg.

I probably need to get out more into the real world - I hear they have some pretty neato trees and stuff as well, not to mention the world's most powerful photon-mapper. And fewer mud crabs.
There's a fundamental signal-processing concept called various different names, mostly with the word [[Nyquist]] in them, which (if you wave your hands a lot and ignore the screams of people who actually know what they're talking about) says that an object that is 32 pixels high on screen only needs to use a 32x32 texture when you're rendering it. If you give it a 64x64 texture, not only will the mipmapping hardware ignore that extra data and use the 32x32 anyway, but //it's doing the right thing// - if you disable the mipmapping and force it to use the larger version, you will get sparkling and ugliness.

Obviously it's more complex than that, and I'll get into the details, but the important point is - the size of texture you need for an object is directly proportional to the object's size on-screen. And that's what texturing hardware does - it picks appropriate mipmap levels to ensure that this is the size its actually using. Every graphics programmer should be nodding their head and saying "well of course" at this point. But there's another thing that is subtly different, but important to realise. And that is that so far I haven't said how big my texture is, i.e. what size the top mipmap level is - because it doesn't matter. Whether you give the graphics card a 2048x2048 or a 64x64, it's always just going to use the 32x32 version - the only difference is how much memory you're wasting on data that never gets used.

If you are streaming your textures off disk, or creating them on-demand (e.g. procedural terrain systems), this can be incredibly useful to know. If you don't need to spend cycles and memory loading big mipmap levels for objects in the distance, everything is going to load much quicker and take less memory, which means you can use the extra space and time to increase your scene complexity. Obviously my approximation above is very coarse - it's not just size in pixels that matter, because we use texture coordinates to map textures onto objects. So how do you actually find the largest mipmap you need?

As a thought experiment (this is going to start out absurd - but bear with me), take each triangle, and calculate how many texels the artist mapped to that triangle for the top mipamp level. Then calculate how many pixels that triangle occupies on the screen. If the number of texels is more than the number of pixels, then by the above rule you don't need the top mipmap level - throw it away, and divide the number of texels by 4. Keep doing that until the number of texels in the new top mipmap level is less than the number of screen pixels. (keen readers will spot that trilinear filtering means you actually need one more mipmap level than this - to keep things simple, I'm going to ignore that for now, and we can add one at the end). Do this for all the triangles that use a certain texture, and throw away all the mipmap levels that none of the triangles need.

That's the concept. Obviously far too expensive to do in practice. The first step is to get rid of the loop and say that the number of mipmap levels you can throw away = 0.5*log2 (top_mipmap_texel_area / screen_area). Quick sanity check - 128x128 texture has area 16384 texels. If you draw it to a 32x32 pixel square on screen, that's 1024 pixels, so you need to throw away 0.5*log2(16384/1024) = 0.5*log2(16) = 2 mipmap levels, which is correct. Why multiplying by half? Because we actually don't want log2(x), we want log4(x), because the number of texels drops by 4x each mipmap level, not 2x. But log4(x) = 0.5*log2(x).

Obviously, you can precalculate the value "top_mipmap_texel_area" - once a mesh is texture-mapped, it's a constant value for each triangle. But calculating the screen-area of each triangle each frame is expensive, so we want to approximate. If we pick an area approximation that is too high, then the actual value of screen_area will be higher than the one for the current frame, and we'll throw away fewer mipmap levels than we could have done. This isn't actually that bad - yes, we waste a bit of memory, but the texturing hardware still does the right thing, and will produce the right image. So approximating too high doesn't change image quality at all. Given that, the approximation we make is to assume that every triangle is always parallel to the screen plane - that its normal is pointing directly at the viewer. This is the largest the triangle can possibly be in screen pixels. In practice it will be at an angle and take fewer screen pixels.

So how large is this maximum size? Well, if we assume mesh deformation and skinning don't do crazy things and ignore them, a triangle always stays the same size in world units (e.g. meters). At distance D from the viewer, with a horizontal screen field of view of F radians, and a horizontal pixel size of P, the screen length of m world meters is (m/D)*(P/tan(F)) pixels in length. That's world length to screen length, but we want world area to screen area, so we square the result. But if we feed in world area of the triangle, M=m^2, then we get the screen area is (M/(D^2))*((P/tan(F))^2) = M*((P/(D*tan(F)))^2). As a quick sanity-check: double the distance away = quarter the screen pixels. Double the screen resolution = four times the pixels. Widen the FOV = smaller screen area.

(P/tan(F))^2 gets calculated once a frame, so that's a trivial cost. And we'll make another approximation that the distance D is not done at every triangle, it's measured from the closest point of the bounding volume of the object. Again, we're being conservative - we're assuming the triangles are closer (i.e. larger) than they really are. So for a single mesh, we can calculate ((P/(D*tan(F)))^2) just once, and then use it for all the triangles. For brevity, I'll call this factor Q, so for each triangle, screen_area <= (world_area * Q).

So looking back at the mipmap calculation, we calculate 0.5*log2 (top_mipmap_texel_area / screen_area ) for each triangle, take the minimum of all those, and that's how many mipmap levels we can actually throw away.

.... min_over_mesh (0.5*log2 (top_mipmap_texel_area / screen_area))
 >= min_over_mesh (0.5*log2 (top_mipmap_texel_area / (world_area * Q))
..= min_over_mesh (0.5*log2 ((top_mipmap_texel_area / world_area) / Q)
..= 0.5*log2 (min_over_mesh(top_mipmap_texel_area / world_area) / Q)

And of course the value (top_mipmap_texel_area/world_area) is a constant for each triangle the mesh (again, assuming skinning and deformation don't do extreme things), so the minimum value of that is a constant for the entire mesh that you can precalculate. The result is that we haven't done any per-triangle calculations at all at runtime. If you grind through the maths a bit more, you find that it all boils down to A+B+log2(D), where A is a per-frame constant, B is a per-mesh constant, and D is the distance of the mesh from the camera (remember that log2(x/y) = log2(x)-log2(y)). Again, quick sanity check - if you double the distance of a mesh from the viewer, log2(D) increases by 1, so you drop one mipmap level. Which is correct.

This is pretty nifty. If you're doing a streaming texture system, it means that for each mesh that uses a texture, at runtime you can do a log2 and two adds each frame and you get a result saying how many mipmap levels you didn't need. If you remember that throwing away just one mipmap level saves 75% of the texture's memory, this can be a huge benefit - who doesn't want 75% more memory!

This isn't just theory - I implemented it in the streaming system used for [[Blade II]] on the Xbox and ~PS2 engines at [[MuckyFoot]], and it saved a lot of memory per texture. This meant we could keep a lot more textures in memory. As well as allowing more textures per frame, it also meant that we could keep a lot more textures cached in memory than we actually needed for that frame. This allowed us to stream far more aggressively than we originally intended. The initial design for the ~PS2 engine was to not stream - we assumed that DVD seek times would cripple a streaming system. But because there was a lot more memory available for textures, we could prefetch further ahead and reorder the streaming requests to reduce seek times. And the system actually worked pretty well.

''Flip the Question''

The next neat trick is instead of asking "how many mipmap levels do I throw away", you ask "how many do I need". This is obviously just the same number, subtracted from log2(texture_size). If we go back to the original equation of 0.5*log2 (top_mipmap_texel_area / screen_area) - look at how you'd actually calculate "top_mipmap_texel_area" for a triangle. You find the area in the idealised [0-1] UV texture coordinate space, and then you multiply by the number of texels in the texture. So the answer to the new question "how many mipmaps do I need" is:

...log2(texture_size) - 0.5*log2 ((texture_size^2) * area_in_uv_space / screen_area)
= log2(texture_size) - 0.5*log2 (area_in_uv_space / screen_area) - 0.5*log2(texture_size^2)
= log2(texture_size) - 0.5*log2 (area_in_uv_space / screen_area) - log2(texture_size)
= - 0.5*log2 (area_in_uv_space / screen_area)
= 0.5*log2 (screen_area / area_in_uv_space)

Hey! What happened to the texture size - it all canceled out! Yep - that's right. It goes back to something I mentioned right at the top. The texture selection hardware does not care what size the top mipmap level is - unless it wants one that doesn't exist, obviously. But if you give it a large enough texture, it doesn't matter how big that texture is.

Again, every graphics coder is saying "right, I knew that, so?" But this is actually quite a non-obvious thing. It means that if you have a streaming texture system, the only thing that matters is "how many mipmaps do I need to load into memory" - the rest are left on the DVD and not loaded into memory. And the answer to that question is independent of the size of the original texture - it only depends on the mesh and the UV mapping and the distance from the camera. So the actual graphics engine - the streaming and rendering - //doesn't care how large the textures are//. And that means that the rendering speed, the space taken in memory, and the time needed to read it off disk are all identical even if the artists make every texture in the entire game 16kx16k.

To say it again: the only limit to the size of textures the artists can make is the total space on the DVD, and the time taken to make those textures.

I finally grokked this when I was making the [[MuckyFoot]] engine to be used for the games after [[Blade II]] and I kicked myself. It had actually been true for the [[Blade II]] engine - I just hadn't realised it at the time. But it was a pretty cool thing to go and tell the artists that there were no practical limits to texture sizes. They didn't really believe me at first - it's an almost heretical thing to tell 15 artists who have spent their professional lives with coders yelling at them for making a texture 64x64 instead of 32x32. Nevertheless, if you have a streaming system that only loads the mipmap levels you need, it is absolutely true.

''Practical Problems''

Of course there's still some practical limits. One thing that breaks this is the standard artist trick of using fewer texels for less important things, such as the soles of shoes or the undersides of cars. What happened was that they would use all this new texture space for interesting things like faces and hands, so the texture size would grow. But the soles of the feet would stay the same size in texels - because who cares about them? Then the preprocessor would calculate the mipmap level it would need for the texture, and the soles of the feet would dominate the result, because it would try to still map them correctly - it doesn't know that they're less important than everything else. There's a few ways to solve this - ideally the artists would be able to tag those faces so that they are ignored in the calculation, but this can be tricky to retrofit to some asset pipelines. A way that I found worked pretty well was to take the minimum linear texel density for each triangle (this copes with stretches and skewed texture mappings), and then for the mesh take the maximum of those minimums. This will find triangles on things like the face and hands, and that will be the density you assume the artist intended. If they didn't give as many texels per meter to other triangles, you assume that was intentional - that they deliberately wanted lower resolution on those areas.

One counter-intuitive thing is that the dead-space borders between areas in a texture atlas need to grow as well. If your artists were making 256x256 textures with 4 texels of space between parts, when they up-rez to 1024x1024, the space needs to grow to 16 texels as well. The border space is there to deal with bleeding when you create mipmaps, so you need to make sure that e.g. the 64x64 mipmap is the same in both cases.
''WORK IN PROGRESS'' - which is why it's not on the main page yet.

Paint mipmap levels with alpha=1 for "valid", alpha=0 for "invalid". Sample twice, once with LOD clamp, and blend based on alpha.
Codename for an Intel processor architecture that I work on. There's been some disclosure, but not everything yet. Coming soon! Top links at the moment:

[["Larrabee: A Many-Core x86 Architecture for Visual Computing" - presented at Siggraph 08|http://www.siggraph.org/s2008/attendees/program/item/?type=papers&id=34]]
[[Three short Larrabee presentations at Siggraph 08 as part of the "Beyond Programmable Shading" course|http://s08.idav.ucdavis.edu/]]
[[A pretty comprehensive AnandTech article|http://anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3367&p=1]]
[[The Wikipedia page|http://en.wikipedia.org/wiki/Larrabee_(GPU)]]
Yay! Michael Abrash and I are finally going to do a talk each about the instruction set we helped develop: [[Rasterization on Larrabee: A First Look at the Larrabee New Instructions (LRBni) in Action|https://www.cmpevents.com/GD09/a.asp?option=C&V=11&SessID=9138]] and [[SIMD Programming with Larrabee: A Second Look at the Larrabee New Instructions (LRBni) in Action|https://www.cmpevents.com/GD09/a.asp?option=C&V=11&SessID=9139]]. They're pretty much a pair - going to mine without seeing Mike's first won't mean very much for example. Note that these aren't graphics talks, they're for anybody who wants to program these cores in assembly or C. We will be using parts of the graphics pipeline as examples, because that's what we've spent most of our time doing, but there's no graphics knowledge required at all. Just bring your assembly head - we're going all the way to the metal.

There's very few programmers that can say they got to invent their own [[ISA|http://en.wikipedia.org/wiki/Instruction_set]], and I'm really looking forward to finally being able to talk about it in public after about three years of secrecy. Creating an ISA is all about compromises between programmer flexibility and how difficult the hardware is to build, but I am always astonished how much we managed to pack in without too much screaming from the hardware folks. It will be interesting seeing how people react to it - it's got a lot of funky features not found in other ISA that I know of.
I've been trying to keep quiet, but I need to get one thing very clear. Larrabee is going to render ~DirectX and ~OpenGL games through rasterisation, not through raytracing.

I'm not sure how the message got so muddled. I think in our quest to just keep our heads down and get on with it, we've possibly been a bit too quiet. So some comments about exciting new rendering tech got misinterpreted as our one and only plan. Larrabee's tech enables many fascinating possibilities, and we're excited by all of them. But this current confusion has got a lot of developers worried about us breaking their games and forcing them to change the way they do things. That's not the case, and I apologise for any panic.

There's only one way to render the huge range of ~DirectX and ~OpenGL games out there, and that's the way they were designed to run - the conventional rasterisation pipeline. That has been the goal for the Larrabee team from day one, and it continues to be the primary focus of the hardware and software teams. We take triangles, we rasterise them, we do Z tests, we do pixel shading, we write to a framebuffer. There's plenty of room within that pipeline for innovation to last us for many years to come. It's done very nicely for over a quarter of a century, and there's plenty of life in the old thing yet.

There's no doubt Larrabee is going to be the world's most awesome raytracer. It's going to be the world's most awesome chip at a lot of heavy computing tasks - that's the joy of total programmability combined with serious number-crunching power. But that is cool stuff for those that want to play with wacky tech. We're not assuming everybody in the world will do this, we're not forcing anyone to do so, and we certainly can't just do it behind their backs and expect things to work - that would be absurd. Raytracing on Larrabee is a fascinating research project, it's an exciting new way of thinking about rendering scenes, just like splatting or voxels or any number of neat ideas, but it is absolutely not the focus of Larrabee's primary rendering capabilities, and never has been - not even for a moment.

We are totally focussed on making the existing (and future) DX and OGL pipelines go fast using far more conventional methods. When we talk about the rendering pipeline changing beyond what people currently know, we're talking about using something a lot less radical than raytracing. Still very exciting for game developers - stuff they've been asking for for ages that other ~IHVs have completely failed to deliver on. There's some very exciting changes to the existing graphics pipelines coming up if developers choose to enable the extra quality and consistency that Larrabee can offer. But these are incremental changes, and they will remain completely under game developers' control - if they don't want to use them, we will look like any other fast video card. We would not and could not change the rendering behaviour of the existing ~APIs.

I'll probably get in trouble for this post, but it's driving me nuts seeing people spin their wheels on the results of misunderstandings. There's so much cool stuff on the way that we're excited about, and we think you're going to love them too.
Well, since [[Beyond3D|http://www.beyond3d.com/content/news/557]] picked up the story, I guess I should comment. I've not technically moved to Intel just yet, but the plans are in motion. Being one of those scary foreigners, there's inevitably some paperwork to sort out first. The wheels of bureaucracy grind slowly.

The SuperSecretProject is of course [[Larrabee|http://www.google.com/search?q=Larrabee+intel]], and while it's been amusing seeing people on the intertubes discuss how sucky we'll be at conventional rendering, I'm happy to report that this is not even remotely accurate. Also inaccurate is the perception that the "big boys are finally here" - they've been here all along, just keeping quiet and taking care of business.

That said, there's been some talented people joining the Larrabee crew - both as Intel employees and as external advisers. That includes "names" that people have heard of, but also many awesomely smart people who haven't chosen to be quite as visible. It's immensely exciting just bouncing ideas off each other. Frankly, we're all so used to working within the confines of the conventional GPU pipeline that we barely know where to start with the new flexibility. It's going to be a lot of fun figuring out which areas to explore first, which work within existing game frameworks, and which things require longer-term investments in tools and infrastructure - new rendering techniques aren't that much use if artists can't make content for them.

Expect to hear much much more later (you just try and stop me!), but for now we must focus on getting things done. There's plenty of time for public discussions of algorithms and suchlike in the future. Thanks for the attention, but I need to engage the Larrabee cloaking field once more. I now return you to your irregularly scheduled blog.
I gave a talk at Stanford University recently, and it reminded me that I should make a list of links to public Larrabee info, so [[here it is|http://www.eelpi.gotdns.org/larrabee/larrabee.html]].

The Stanford talk was fun, and there's video of it available (link on the page above). So now you can watch me say "um" and "er" a lot. I'm usually better about that, but this was a small auditorium, I knew a surprising number of the audience, and so I forgot to put my "public presenter" head on.

On the recent media attention: we're not cancelled - far from it. The first version still needs some work, and we have a lot of software to write to turn a bunch of x86 cores into something that "looks" like a GPU to Windows before we can ship it to the public. That all takes a lot of time, and we won't be ready for the hoped ship this year. But it's not nearly as dramatic as some of the stories I've seen out there - people do love a sensational headline. I've never been on the inside of media attention before, but it's taught me that you need to remember the old adage - never believe anything you read on the internet.
One of my papers turns out to be useful to someone. And they asked me what my license was - or rather, their legal department asked. This then forced me to find a suitable license, which is very annoying, because the real one is of course "don't be a dick". I don't think the lawyers liked that one. After much browsing at [[http://www.opensource.org/licenses/|http://www.opensource.org/licenses/]] and finding a lot of wordy annoying ones, I settled on the MIT license, without the middle paragraph that requires attribution, because that's just pushy. Short and sweet - and the only objectionable thing is that LAWYERS SEEM TO NEED TO YOU TALK IN CAPS A LOT. Why is that? Maybe it makes judges like you more or something? Whatever - just go out there and do good stuff with it, people.
This is a bit of a random list of stuff. I tried to organise it into a nice narrative and made a mess of it. So instead I'll just write a bunch of words.


''What is ASSERT for?''

A recent discussion on AltDevBlogADay revealed that the question "what is ASSERT for?" has many possible answers and lots of people have different views on this. Sometimes people get hung up on what the word "assert" means, so let's list a bunch of useful tools we'd like to have without using that word. Chris Hargrove wrote a great taxonomy, so I'm just going to steal it:

''Premortem'': Immediately fatal, and not ignorable. Fundamental assumption by an engineer has been disproven and needs immediate handling. Requires discipline on the part of the engineer to not add them in situations that are actually non-fatal (rule of thumb being that if a crash would be almost certain to happen anyway due to the same condition, then you’re no worse off making an assert).

''Errors'': Probably fatal soon, but not necessarily immediately. Basically a marker for “you are now in a f*cked state, you might limp along a bit, but assume nothing”. Game continues, but an ugly red number gets displayed onscreen for how many of these have been encountered (so when people send you screenshots of bugs you can then point to the red error count and blame accordingly). Savegames are disabled from this point so as not to make the error effectively permanent; you should also deliberately violate a few other ~TCRs as soon as an error is encountered in order to ensure that all parties up and down the publisher/developer chain are aware of how bad things are. Errors are technically “ignorable” but everyone knows that it might only buy you a little bit of borrowed time; these are only a small step away from the immediately-blocking nature of an assert, but sometimes that small step can have a big impact on productivity.

''Warnings'': Used for “you did something bad, but we caught it so it’s fine (the game state is still okay), however it might not be fine in the future so if you want to save yourself some headache you should fix this sooner rather than later”. Great for content problems. Also displayed onscreen as a yellow number (near the red error number). You can keep these around for a while and triage them when their utility is called into question.

''Crumbs'': The meta-category for a large number of “verbose” informational breadcrumb categories that must be explicitly enabled so you don’t clutter everything up and obscure stuff that matters. Note that the occurrance of certain Errors should automatically enable relevant categories of crumbs so that more detailed information about the aforementioned f*cked state will be provided during the limp-along timeframe.

(Chris actually called the first one an "assert" but I want to use that as a more generic term).

I tend to use the actual ASSERT macro for the first three, and they'll all also put into a log. I also don't differentiate between the first two - it's hard to predict what's fatal and what's not, so I just let the cards fall where they will. Sometimes I'll put in the text of an "Error" or "Warning" something like "probably not fatal, but will cause strange effects." - generally "Premortems" will be something like "pThingie != NULL". I like the idea of the big red/yellow numbers on the screen - never done it, but it's an idea well worth stealing.


''Fixing ASSERT''

The basic ASSERT most runtimes give you is way too heavy-handed and noisy. It'll do if you have nothing else, but it's got a whole bunch of problems:
* When you fail, you get a dialogue box. This drives me nuts - I just want to go straight into the debugger as if I had hit a breakpoint.
* But if someone is running without a debugger (e.g. an artist) they do want a slightly helpful message before the machine detonates in a shower of sparks.
* Some ~ASSERTs are really just warnings, and sometimes I only want to be warned once, not every time they're hit. Especially if someone else wrote it and I don't understand what it's warning me about. This also includes things like "object went faster than the speed of sound" or "player intersects ground" warnings that you want to know about once, but will probably persist for a few frames before (hopefully) fixing themselves.

The things I do to make ASSERT more usable are, in rough order of easy to hard:

* In debug mode, an assert becomes the debugger breakpoint, e.g. int3 on x86, and it's inlined in the code, not in a subroutine you have to pop the stack on. In release builds it is a dialogue box.
* Invent ASSERTONCE. Does exactly what it says on the tin.
* Each assert should have a useful text message for release build. This should at the very minimum tell the user who to poke about it. If a designer or artist can't immediately find which coder to poke about an assert, it simply increases noise and is worse than useless.
* If an assert is an advisory about an asset, e.g. "texture is not power-of-two" then it should always say which asset it refers to. Again, there's no point telling an artist that a texture is not square if you don't say which one.
* Even better is to tag each asset with the creator's name. Then the assert should say not only what the asset is but who created and ideally when. So if it's an asset made within the last day or so, it's not a big deal. A month old - fix that shit! Also it helps if the build knows who is currently running it. We found it helpful if people could easily ignore their own warnings if they were just trying something out, but wanted to know about stuff they hadn't immediately changed. Conversely, if checking changes in, they want to know about their screwups, not other peoples'.
* People who aren't coders tend to ignore warnings. Fact of life. I clearly remember listening to two artists running the daily build - one said "oh, what's that" to which the other replied "oh it's another ignoreall" <click> ... <crash> "what happened?" That evening I removed the "Ignore All" button from the assert dialogue and made it something slightly less trivial to brush aside. The eventual solution was to log every assert and warning to somewhere useful to be analysed later. The ghetto way was to log it to a consistently-named file on the HDD in a public directory. Then the person in charge of the build (hi there!) periodically scraped all the work machines for this file, threw it into a database and found the most common and/or oldest-occurring asserts and went and bitched at people until they were fixed. It's surprisingly effective. A more elegant way would be to connect to a bug tracking system and auto-file bugs. Do the simple thing first - you'll be glad you did - if only because the hassle of dealing with it will prompt you to do the right thing eventually.

To give you an idea of how aggressively I use asserts, I picked a 2056-line C file of a private project that's mostly to do with gameplay. 235 are whitespace and 482 are lines with just a curly brace on them. 189 lines are comments - I tend to write more comments than most people, especially in complex code. 115 of those lines are asserts. In this case since the file doesn't deal with external assets, almost all of them are "premortem" types like pointer-checking, or sanity checks like sensible ranges for values. Leaving 1035 of actual code. So it's roughly a 10:2:1 ratio of code:comments:asserts, which feels about right. Another file has 2182:444:266 which is very similar.


''Degrees of paranoia''

I have a #define PARANOIA which can be set to 0 (shipping builds), 1 (normal development/debug), 2 (debugging something tricky) and 3 (evil things like memory scribblers). This is like the various DEBUG defines, except the value is important. Some of the more expensive debug checks only happen at the higher paranoia levels, and it's also somewhat orthogonal - you can enable paranoia without using debug, e.g. for trying to find bugs that only show up in release.

All standard ~ASSERTs are on PARANOIA>=1. Malloc/free/new/delete all do things like zapping data to 0xdeadbeef when PARANOIA>=1. More expensive checks go on PARANOIA>=2, and really expensive ones on PARANOIA>=3. The idea is that as you write code, you freely hurl ~ASSERTs and debug checks around like crazy. As the framerate starts to suffer, don't remove the checks, just wrap the older ones in higher and higher levels of PARANOIA. If you have a bug that the standard ~ASSERTs don't find, raise the paranoia levels. Every now and then (e.g. on checkin), do some runs on the higher paranoia levels as well.

Most of my classes (or similar groups of functionality) have methods called Check() which does a general consistency check on the object - do its values lie in sensible ranges, if it has pointers to objects do they still exist, if there's meant to be a pointer back does it do so, etc. This function is nulled out in the header unless PARANOIA>=0. There is also a function ~ParanoidCheck() that calls Check() only if PARANOIA>=2. So in general I call Check() at creation, destruction and when changing pointer relationships e.g. when a character picks up a weapon, I'll call it on the weapon & the character to make sure nobody else has the weapon, the character doesn't already have a weapon in that hand, and that both actually exist. ~ParanoidCheck() can be sprinkled anywhere you have even the slightest suspicion there may be a problem, because unless PARANOIA>=2, it doesn't do anything. Also if PARANOIA>=2, once a game tick I will call Check() on every object in the entire world.

If there's any really expensive tests (e.g. checking every single pointer on every single object after every single object is processed every time), then I'll put them on PARANOIA>=3. That's basically the setting where it's such a nasty bug to find that I'm happy to leave it running for an hour a frame if in exchange an ASSERT fires to show me where it is. You can also put more expensive code inside Check() methods hidden behind PARANOIA>=3.


''Deterministic playback''

I have found this to be such a powerful debugging tool. Make sure your game is deterministically replayable. Note - here I only mean in single-player mode, only replayable on the same machine it was recorded on, and only for the same build. I'm just talking about debugging here, not as a gameplay feature. This can be tricky at times, but the basic steps are:

* Decouple gameplay ticks & processing from machine-dependent things like rendering (this is good for all sorts of other reasons - I should write a post about it one day)
* Record all player inputs to a log.
* Make sure all random numbers are only pseudo-random, and store the seeds in the log.
* Careful about using pointer addresses (which will change from run to run) as data. You can get subtle things e.g. using the pointer for a hash table, then traversing every object in the hash table - you'll get different traverse orders from run to run.
* Record your debug levels in the log - different optimisation levels can produce different numerical results. This is another reason I keep PARANOIA distinct from DEBUG (though you can still get side-effects that screw things up).
* Have a "no rendering" mode on a key that will just replay the game as fast as possible - that way you can fast-forward to the time you're investigating.

Every so often (e.g. 1000 gameplay ticks) autosave the current game state (use a rotating series of saves), append the log to the previous auto-save, and flush the log. This gives you a bunch of cool things:

* Easy reproducibility of bugs, even ones that didn't get noticed for many frames such as AI bugs ("how did he get into that bit of the map?", "why is that henchmen trying to stab someone with a pineapple instead of a knife?", etc). If you get a bug, backtrack through your autosaves, watch the object/character in question and find where the bug originally manifested. This has saved me so much time in the past, it's worth it for just this one feature.
* Makes sure your save/load works! It's amazing how easy it is to break save/load and not realize it for quite a while. This way, it's in constant use. It also encourages people to keep it fast!
* Stress tests. You always have a big pool of saved games and player actions to replay whenever you feel nervous about the build integrity (of course they get obsolete as the gameplay changes, but it's still useful for folks not connected with gameplay e.g. graphics, sound, etc).
* Mitigation for catastrophic bugs or machine instability (e.g. dodgy drivers) in the final released version. If people can always resume, having lost only a few minutes of play, they will be far less angry when asking for support.
* Easy demo record/playback. Especially useful when you want to fly the camera to cinematic locations and engage the "high-quality rendering" mode with all the bells and whistles at print resolution that runs at 10 seconds per frame.

Another handy tip - if the game loads from an autosave, have it first copy it somewhere else with a unique name. Otherwise when you're debugging something from an autosave, it's a bit too easy to autosave over it while doing so, which is very frustrating. Also have a rotating sequence of autosaves covering at least half an hour of play - again for those places where the bug happened a long time before it was noticed.


''Unit tests''

You can read tons of stuff about unit tests and how great they are, and I agree. In theory. In practice they are this big thing that says "write tons of code now, and maybe in a year you will fix an extra bug". It's a really hard pill to swallow, and as a result I've never managed to get a codebase of any significant size to use them. I can't even persuade myself to use them much. Result - nice idea, but it just doesn't seem to work in practice. You need something almost as effective without the big chunk of up-front work.

That's why I like doing things more like asserts. Everyone knows what they do, they're not scary, you write them right alongside the code, but used properly they can be a very powerful documentation and self-checking weapon, and with a few tricks you can gradually grow things into what amount to unit tests without actually sitting down and saying "right - today I'm going to spend four hours writing code that, on a good day, doesn't do anything."

Once you have good asserts, you've scattered your code with aggressive Check() and ~ParanoidCheck() calls, and you're 0xdeadbeefing your code, you've got 95% of the goodness. Now your unit tests don't really need to be that fancy because they don't actually do the checking, they just feed inputs into the game and the game self-checks. In many cases the unit test might just be loading and replaying a savegame and log. Most of time I find it much easier to "justify" writing a unit-test that would have caught a bug I actually found, whereas writing unit tests for bugs I might never create is mentally draining - I find it far easier to write tons of asserts and self-checkers instead as I go along.
I am TomF
EmailMe

[[Show all blog entries|http://eelpi.gotdns.org/blog.wiki.html]]
[[Link to this blog|http://eelpi.gotdns.org/blog.wiki.html]]
[[RSS feed|http://eelpi.gotdns.org/blog.wiki.xml]]
[[Main website|http://eelpi.gotdns.org/]]

The Blog Roll:
[[My wife|http://eelpi.livejournal.com/]]
[[Canis|http://www.lycanthrope.org/~canis/]]
[[Mark Baker|http://technobubble.info/]]
[[Dave Moore|http://onepartcode.com/main]]
[[Wolfgang Engel|http://diaryofagraphicsprogrammer.blogspot.com/]]
[[Jake Simpson|http://blog.jakeworld.org/]]
[[Rich Carlson|http://www.digital-eel.com/blog/]]
[[Ignacio Castaño|http://castano.ludicon.com/blog/]]
[[My DirectX FAQ|http://tomsdxfaq.blogspot.com/]]

WhatIsThisThing
[[TiddlyWiki main|http://www.tiddlywiki.com/]]
I've been doing a lot of complex transformations between various coordinate systems recently, and it's crystallised a practice I've always only done half-heartedly - naming your transformations! If you don't name them right, you can get into some amazing bugs that can take days to figure out. If you do name them right, and do so consistently, it's actually quite difficult to make a mistake. I was reminded of this by [[an aside in a blog post that Fabian Giesen did|http://fgiesen.wordpress.com/2012/06/03/linear-algebra-toolbox-1/]], but I thought I'd expand on it a bit. By the way, you should absolutely read that series of three posts - it can clear up some misconceptions that even experienced coders have.

So - the naming convention. I'll use under_score_naming, but ~CamelCapsNaming works fine as well of course.

A transformation is typically represented by a matrix, though it can be a variety of things, e.g. Euler angles and an offset, a quaternion and offset, or a dual quaternion. However you do it, you'll usually have an operation to apply a transformation to a position (a vector), which if you do right-to-left multiplication of these things (which seems to be the majority):

Vec new_vec = trans1 * old_vec;

...and one to apply a transformation to another transformation:

Trans new_trans = trans2 * trans1;

And they associate freely, so that:

Vec new_vec = trans2 * ( trans1 * old_vec );
........................= ( trans2 * trans1 ) * old_vec;

However they do not commute, so ( trans2 * trans1 ) != ( trans1 * trans2 ), and not only does (trans * vec) != (vec * trans), the second term doesn't even mean anything - it won't compile.

So that's just a statement of conventions, and Fabian covered it in a lot more detail. Now what was that about naming?

All transformations convert a vector of values from one space to another. The typical transformation we're talking about in graphics is the one that describes the position and orientation of an object in the world. This transformation turns positions in model space (which the artist designed in the art package) into positions in world space, ready to be rendered. And for reasons that will become clear, we want to name things in the format "x_from_y". So this sort of transformation should be named "world_from_model". If you also name your vectors this way, you'll get expressions that look like this:

Vec vertex_world = world_from_model * vertex_model;

Note the naming here - it's slightly counterintuitive, and this can trip you up. "World_from_model" moves positions from model-space to world-space, so it's what you'd normally call "the orientation and position of the model in the world", which you'd instinctively think would be named the other way around (or at least I do). But to do the consistency tricks I'm going to talk about, you absolutely need the word "world" first and "model" second. So just remember that "it's not the way you think" and you'll be fine. I find it helps to think of transforming the vector (0,0,0) - it starts at the origin in model coordinates, and then it ends up in world coordinates at some place, which will be the "position" of the model in the world.


So why is this naming scheme so nice? Because you can immediately check you have consistent maths by checking the "nearby" parts of the names, as shown by these highlights:

Vec vertex_@@world = world@@_from_@@model * vertex_model@@;

This also applies when concatenating transformations, for example if you saw this:

Vec vertex_@@world = tank_turret@@_from_@@tank_barrel * tank_body@@_from_@@tank_turret * world@@_from_@@tank_body * vertex_barrel@@;

WAIT AT MINUTE - this is a mess - none of the parts near each other match. Clearly, whoever wrote this got the order of the transformations wrong. What they should have written is this:

Vec vertex_@@world = world@@_from_@@tank_body * tank_body@@_from_@@tank_turret * tank_turret@@_from_@@tank_barrel * vertex_barrel@@;

...and those all match nicely. In fact, given this collection of matrices, there is only one valid way to combine them - you can't make a mistake!


Of course normally you'd concatenate all the transformations once into a single combined one and then transform all the vertices by the result. You can get sensible names for the intermediate transformations by taking the first part from the leftmost particle, and the last from the rightmost particle. So:

Trans ~XXX_from_YYY = world_from_tank_body * tank_body_from_tank_turret * tank_turret_from_tank_barrel;

First part:

Trans @@world@@_from_YYY = @@world@@_from_tank_body * tank_body_from_tank_turret * tank_turret_from_tank_barrel;

Last part:

Trans world_from_@@tank_barrel@@ = world_from_tank_body * tank_body_from_tank_turret * tank_turret_from_@@tank_barrel@@;

...and then of course we'll do:

for ( i = ...each vertex... )
    vertex_@@world[i] = world@@_from_tank_@@barrel * vertex_barrel@@[i];

Easy and foolproof.


This also helps in reverse. When you need to find a specific transformation, how do you know which transformations to combine, in what order, and do you need to invert any of them? When you invert a transformation, you swap the ends of the name. So say I have the transformations world_from_body and world_from_turret, but I need to know body_from_turret. Well you can't stick the two transformations together either way around and have the parts match - must need to invert one, and taking the inverse of a transformation swaps the ends. But which one to invert? Again, there's only one combination where everything will work:

Trans body_from_world = world_from_body.Invert();
Trans body_from_turret = body_from_@@world * world@@_from_turret;

...and again notice that the result takes its first part from the leftmost word "body" and the second part from the rightmost word "turret".

So sticking closely to this sort of naming scheme minimises confusion and means that it's fairly simple to check the maths by checking the grammar.
This is another in the series of posts where I discuss things that are really common and well-known inside the games industry, but which don't seem to be well-documented outside it, and may be useful to people who have learned to code and are now starting on some larger projects. As always with these, there are loads of different ways to do things, and there's no "always right" answer, so feel free to tailor them to your specific situation.


My game world mechanics are now far enough along that I'm trying bigger worlds. This leads to two obvious problems - memory and performance. After a bunch of neat algorithmic stuff which I'll write up in good time, both were still annoyingly poor. In particular, loading, saving and even quitting the game was annoyingly slow. Quitting the game took up to 90 seconds on a large world even in release build! What could it be?

The answer was memory allocations, and the lion's share of that time was in the C runtime (CRT) debug heap. Every time you run under the debugger (release or debug build), it hooks your allocations and does things like fill new allocations with 0xCDCDCDCD, and on free it checks you haven't scribbled off the end. This is great and catches some otherwise really hard-to-find bugs, but costs performance.

You can turn it off if you really need to, but I prefer not to. Finding evil scribblers or helping find uninitialized data is really useful, but I find the performance hit also emphasizes when you're doing way too much memory management and burning performance. It's much better to know about the problem early than to hide it and discover it late in the day. And it did indeed help me find the problem! So let's fix it now, rather than wait until the code is so complex that doing something as fundamental as changing memory allocation strategies is too terrifying to contemplate.

It turns out the reason I'm doing so much memory allocation and freeing is due to two patterns that will crop up many times in many projects. It's really no surprise to see them here, and I always knew I'd have to address this eventually. Now is a good a time as any.

''Fixed-size block pools, aka "Memory Stacks"''

Every object in the game - a tree, a puddle of water, a rock, and so on - is in a single struct called a ~MapObject. Some special objects need more specialized storage for more data, such as dwarves and trains, and they have pointers to optional data structures hanging off ~MapObjects. But the vast majority of things in the world share this one structure. And of course as things move around - trees grow fruit, the fruit falls on the ground then becomes a seed and then a tree, rain falls and forms puddles, puddles merge to form water that flows - there's a lot of allocation and freeing of these structures. But there's really no need to return this data to the general heap all the time - these objects comprise the majority of memory in the game, and it's unlikely that I'll need a ton of memory at some point when there ISN'T also a ton of objects in the world.

The solution is to keep a special pool of just these fixed-size objects - of just ~MapObjects. These have lots of names, but I call them "memory stacks" because the free items are kept in a LIFO stack. (the words "pool" "heap" "stack" and "arena" in theory have specific technical meanings, but in practice everyone means a completely different specific thing by them, and even dusty old textbooks disagree, so the nomenclature is a complete mess)

The first item in the free stack is pointed to by a global inside the ~MapObject class:
{{{
    static MapObject *g_pFirstFreeObject;
}}}
When I "free" a ~MapObject, instead of calling free() or delete(), instead I push it onto the stack like so:
{{{
    *((MapObject**)this) = g_pFirstFreeObject;
    g_pFirstFreeObject = this;
}}}
...that is, I reuse the first 4/8 bytes of my now-unused ~MapObject memory as a pointer to the next item in the free heap.

To "allocate" a new ~MapObject is usually the reverse - return the top of the stack, and move the top-of-stack pointer down the linked list.
{{{
    if ( g_pFirstFreeObject == NULL )
    {
        ...allocate some more...
    }

    MapObject *pObj = g_pFirstFreeObject;
    ASSERT ( pObj != NULL );
    g_pFirstFreeObject = *((MapObject**)g_pFirstFreeObject);
}}}
The details of "allocate some more" vary according to your taste, and right now I'm just being greedy and grabbing another chunk of ({{{PoolBlockAllocCount}}} == 256), with no attempt to be able to recover the data in the future and actually free it. I can change this later, since it's now hidden from the rest of the code. So the full allocation routine is as follows:
{{{
    if ( g_pFirstFreeObject == NULL )
    {
        char *pMemBlock = (char*)malloc ( sizeof(MapObject) * PoolBlockAllocCount );
        g_pFirstFreeObject = (MapObject*)pMemBlock;
        for ( int i = 0; i < PoolBlockAllocCount-1; i++ )
        {
            char *pNextBlock = pMemBlock + sizeof(MapObject);
            *((MapObject**)pMemBlock) = (MapObject*)pNextBlock;
            pMemBlock = pNextBlock;
        }
        *(MapObject**)pMemBlock = NULL;    // end of the linked list of free objects
        g_TotalNumObjectsAllocated += PoolBlockAllocCount;     // just for debug information
    }

    MapObject *pObj = g_pFirstFreeObject;
    ASSERT ( pObj != NULL );
    g_pFirstFreeObject = *((MapObject**)g_pFirstFreeObject);
}}}

Note that recovering this memory later is tricky - you'd need to look at each chunk of 256 objects and search for them all on the free stack. If you find them, unlink them all and only then can you actually free() that chunk. In my case that's far more effort than I'm willing to put in right now, and I'm unlikely to need to recover the memory.

I could also add debugging features just like the CRT, so that I write 0xCDCDCDCD when I pop new ~MapObjects off the free stack, and write 0xDDDDDDDD to them when I push them back on the free stack. But I haven't had any scribbler problems with ~MapObject-related code, so it seems low-priority. I can do this later if I need to.

This helps in four ways:
1. I never free() this memory, so the perf hit from that is gone.
2. Most of the malloc() calls now just happen at start of day rather than every turn, so that perf his is almost gone.
3. When I do malloc() more, I do so in big blocks of 256 at a time, so it's called far less frequently.
4. When you malloc() any chunk of memory, there's some extra memory used inside the allocator to track that block. Doing 256 fewer allocations means there's much less of that overhead, so less total memory is used for a given map size.

So this was a nice win. On highly dynamic scenes (floods, rain, landslides) where there's a lot of ~MapObjects being created and freed, I get a decent perf win. Also, on larger maps the memory saving is handy - it's only about 5%, but every little helps.

This technique can be used for any fixed-size block, and usually you'll have a separate heap for each size of block, and there's a few other things in the game I might apply this method to later. One standard thing everyone uses these for first in a game are particle systems, since the number of those can bounce up and down dramatically in just a few frames, but they're all pretty tiny. But I don't have any particles in this game yet (which is a total pixel-porn fail, I know), so that's yet to come. And as it turned out, a much bigger win came from the second very common data structure:

''Revisiting resizable arrays''

I've talked about these before in my post on [[Resizable arrays without using STL]]. And I use them all over the place, and they're great. The specific place I use it most is that each hex in the world has a list of pointers to the ~MapObjects in that hex. Each hex can have many objects in it at once, for example a single hex can contain a tree, a dwarf, a log of wood and a puddle of water all at the same time. Hexes that are store-rooms can contain tens or hundreds of objects. That's why I need a resizable list of pointers.

The thing is that 99% of my hexes don't contain multiple things - they contain either nothing, or a bit of dirt/rock/sand, or a bit of water. And 0.9999% of the remainder contain maybe only two or three items. As noted in the other post, my resizable arrays have a reserved size and a used size that can be different, and a pointer to the allocated list itself. It looks like this:
{{{
template <class T>
class ArbitraryList
{
    int iReservedSize;          // The current reserved size of the list.
    int iSize;                  // The current size of the list.
    T   *pT;                    // The list.
}}}
...where T in this case is (~MapObject*), i.e. pT points to an array of pointers. The memory that pT references needs to be a separate allocation so I can resize it if I need a bigger array.

Having a separate used-size and reserved-size means I start the array off at 4 pointers in size, so when an object moves from one hex to another, although the used size might change from 1 to 2, I'm not resizing the array itself, that remains 4 pointers long. But nevertheless every hex on the map that has anything at all in it not only has the ~ArbitraryList object inside it, it also has that allocated block of memory pointed to by pT, and that is a call to the allocator, and a memory-tracking overhead, just to (in the usual case) get space for 4 pointers.

The change I made to reduce this cost is to have a small amount of space pre-reserved inside the ~ArbitraryList structure itself - it gains one new member:
{{{
template <class T>
class ArbitraryList
{
    T   Static[PreallocSize];   // pT may point to the start of this.
    int iReservedSize;          // The current reserved size of the list.
    int iSize;                  // The current size of the list.
    T   *pT;                    // The list.
}}}
The Static[] array is new, and is a place to store data if the list is small. Previously, at start of day pT=NULL and iReservedSize=0. Now, pT=Static and iReservedSize=~PreallocSize. Almost all the list-manipulation functions don't know or care - they just indirect off pT as usual. If we ever need to resize the allocated list (either up or down), all we need to do is instead of always calling free(pT), we first check if pT==Static, and if it does, don't try to free that memory! When allocating space for the new list, if the new size is ~PreallocSize or less, instead of allocating new memory, just point pT back at Static.

But why is this a win? I just grew every list in the game by sizeof(T)*~PreallocSize - isn't that a waste of memory? Well, not really. Most of the times I use a list, it has at least one item in it - very few lists are actually empty. And I already have a minimum size that I allocate that list to, to avoid constant reallocation. So almost all my lists previously had an array of that same size allocated - it was just a separate chunk of memory instead of being inline. And whenever you allocate memory, the memory manager needs to reserve a header to track that allocation, which is extra overhead. So the only time I actually waste memory is if the list is larger than ~PreallocSize, at which point I have that static array AND the real allocated list that pT is pointing to. But this is the uncommon case, and of course if I do have a large list, I also have a large number of things hanging off that list, so the overhead of the static array (in this case, it's the size of 4 pointers) isn't that significant. But the common case is improved in a bunch of ways:

1. When allocating the object containing the ~ArbitraryList, it's a single allocation, not two, so it's faster.
2. Similarly when freeing the object, it's only one free, not two.
3. Avoiding the list allocation means we avoid the memory-manager header overhead.
4. If the ~ArbitraryList is inside something that uses fixed-size object pools (see above), there's no malloc()/free() calls at all, the whole chunk just gets pushed or popped from the stack of free objects.

There are lots of extra enhancements you can make. If zero-sized lists are a common thing and the static array is too much of a memory hit, you could combine the two techniques. Instead of each list having a static array inside it, have a fixed-size-block heap of sizeof(T)*~PreallocSize. Then each list can be in three states:

If iReservedSize == 0, pT is NULL.
If iReservedSize == ~PreallocSize, pT points to a fixed-size-block from that heap.
If iReservedSize > ~PreallocSize, pT points to a actual allocated memory.

Alternatively you could hide iReservedSize inside the Static[] block, since you never need both at once. Which saves 4 bytes, but makes the logic and debugging trickier.

However do realize that good memory managers also have things like fixed-size-block heaps internally, so don't get too complex or you're just replicating work - and they've probably done far more analysis than you have on memory allocation patterns. The reason to do simple things like these two techniques is because you have implicit semantic knowledge at compile time that a general allocator can't get at, or can take assumptions (such as never needing to aggressively free up ~MapObject memory) that a general system can't.

The end result of both these bits of work was that a large map roughly halved in memory footprint, which is pretty dramatic, and the time taken to delete everything in the world dropped from 90 seconds to 10 (which still sounds a lot, but it's using a very generic routine that maintains the consistency of the world after removing every single object, which is obviously a waste when just destroying the entire thing).
Moore's Law (even when properly stated!) is a wonderful thing. Code gets faster and faster and it allows us to do more stuff than before, or we can do the same stuff but with less effort. One of the ways we can use less effort is to use higher-level languages, and they let us do the same amount of coding but quicker (time to market) or we can do more of it and handle more complex projects.

That's all great, but //by how much?// Moore's Law implies a doubling every 18 months or so, but what do we actually get? I'm going to use five systems I know pretty well - the Z80 in the ZX Spectrum, the 68000 in the Atari 520STFM, the Pentium, the ~PowerPC inside the Xbox 360, and the latest Nehalem. Numbers might be approximate in places for various reasons (e.g. my memory is crap), and I'm going to only focus on integer performance to keep the playing field level.

Z80: The [[ZX Spectrum|http://en.wikipedia.org/wiki/ZX_Spectrum]] launched in 1982 with a 3.5MHz [[Zilog Z80|http://en.wikipedia.org/wiki/Z80]], and it was my first computer, and it was pretty awesome for its time. It certainly launched a lot of British coding careers. Each simple instruction needed around 4 clocks to execute. No cache.

68000: [[The Atari ST|http://en.wikipedia.org/wiki/Atari_st]] launched in 1985, though I was a late-adopter and got mine in 1987 (just months before the Amiga A500 came out!). I'm going to call 1986 the first year it was cheap enough to become common with the launch of the first real "games machine" version - the 520STFM. It had an 8MHz [[Motorola 68000|http://en.wikipedia.org/wiki/Motorola_68000]], and each simple instruction also took 4 clocks to execute. No cache.

Pentium ~P54C: I remember consciously buying my first PC with a 386SX25 and at some point I had a 486DX66 (I remember because the turbo button toggled the stupid 7-segment display on the front panel between 66 and 8 - how cheesy!), but my memory of specific machines is hazy after that as I changed them so often at both work and home, but at some point I clearly must have owned a Pentium 1 of some description. However, the reason I pick the [[P54C|http://en.wikipedia.org/wiki/P5_%28microprocessor%29#P54C]] is that's what we based the first Larrabee core on, and in terms of integer performance we didn't change much apart from frequency, so I know it well. It can issue two simple instructions (load, store, arithmetic) per clock, and I'm going to choose the first 75MHz version as the data point. It was released in 1993, but only widely available in 1994. These had a built-in L1$, but most motherboards did not have the optional external L2$. I have been informed that memory latency was around 15 clocks on most motherboards, though I never measured it myself.

Xenon: the [[360's PowerPC|http://en.wikipedia.org/wiki/Xenon_%28processor%29]] is not a great core - I think it would have run twice as fast if they'd designed it to be clocked at half the speed. But for these purposes I'm going to ignore my gripes with it and (like all the other cores here) focus on reasonable peak throughput. Released in late 2005, it's dual-issue at 3.2GHz. It had an L1$ and L2$, and main memory latency was around 500 clocks (rather high even for the time because all memory requests had to go via the GPU, and there was also some encryption hardware adding a bit of latency in there).

Nehalem: the latest released Intel CPU architecture (Westmere is a shrink, so not an //architecture// as such, and Sandy Bridge is not yet officially released), [[Nehalem|http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29]] is capable of sustaining 4-wide execution of simple instructions (to compare properly with all the other cores in this list). Is has L1$, L2$, and a big shared L3$. The data from a helpful [[Anandtech test|http://www.anandtech.com/show/2542/5]] says a 2008 Nehalem clocked at 2.93GHz has a main memory latency of 46.9ns (or 137 core clocks), and if you think that number is really low compared to the Xbox, you're right - keeping it low was a big focus for that team.

OK, so where did that trip down memory lane get us?

{{{
Name     Year    MHz   CPI      MIPS     Mem latency    Latency in ns
Z80      1982    3.5    4          0.9      4           1143
68k      1985      8    4          2        4            500
P54C     1994     75    0.5      150       15            200
Xenon    2005   3200    0.5     6400      500            156
NHM      2008   2930    0.25   11720      137             47
}}}
(~MHz = clock speed, CPI = Clocks Per Instruction, MIPS = Million Instructions Per Second)

We basically got 10,000 more instructions per second, but memory latency only improved by about 25. So that's about a 1.44x growth in instruction speed per year, but only a 1.13x growth in memory speed. Practical processing power has grown even faster because of things like having more bits, floating point, SIMD, etc but ignore those for now.

We should be able to use all that power to help us write more complex programs, not just run the same programs faster. But just how far can we go with the languages? I took two common operations - member access of an object, and a function call, and looked at how many instructions and memory accesses that were likely to miss the cache it took to do each of them in three different languages - plain old C (standing in for assembly on the oldest two ~CPUs), C++ with multiple inheritance, and a duck-typed language such as Python. I came up with the following:

{{{
                Instructions   Memory misses
Member access
C                     1               1
C++ MI                4               2
Duck-type            20               3

Function call
C                     5               1
C++ MI                8               2
Duck-type            30               3
}}}

The instruction counts for the duck-typing are basically a guess, but it turns out they don't affect the answer that much anyway. I'm also not sure if there's 3 dependent memory accesses or 4 for duck-typing - I couldn't actually find a low-level description of what the common operations are. Saying "it's a hash table" doesn't narrow it down all that much. If anyone does actually know what specific algorithms are used in common duck-type languages (and I ignored the fact that most of them are bytecode being run through an interpreter rather than native code), let me know what the numbers should be.

I assumed the "language of choice" on the first three ~CPUs was C, on the Xenon it's C++ and on Nehalem it's duck-typing. So a table showing thousands of the operation in question per second, and the speedup relative to the Z80:

{{{
              Member access    Function call
Z80 (C)         440    1.0       150    1.0
68k (C)        1000    2.3       330    2.3
P54C (C)       4800   11.1      4300   29.4
Xenon (C++)    3200    7.3      3200   21.9
NHM (DT)       7000   16.1      7000   48.0
}}}

So from Z80 to ~P54C we got 11x the perf in member access and 30x the perf in function calls sticking with the same language. We can clearly see that using multiple inheritance on the Xenon is a bad idea - it has pretty large memory latency and it suffered. According to these (admittedly rather pessimistic) numbers, absolute performance actually decreased from using plain C on the Pentium, and that reflects what a lot of people actually see if you don't watch what you're doing. But the Nehalem has much better memory latency, so that should save us, right? Well, if you switch to duck-typed languages at the same time, it doesn't actually help much. Function calls are a bit faster because the overhead of the actual call itself (pushing stuff onto the stack etc) took significant time on the older ~CPUs, but now the 4-wide Nehalem just gobbles them up. What it can't get rid of are the extra indirections, and it means there's no significant difference between accessing a member and calling a member function!

More interestingly, this implies that performance has barely moved going from C on a Pentium to Python on a Nehalem. Is it really true that that taking code you wrote in 1994 and porting it to Python means it will run at basically the same speed? No, of course it's not that bad - I've clearly approximated wildly for simplicity and to make my point. Nehalem has some gigantic caches on it - its combined L1 caches are as large as the entire addressable memory on a Z80. The L3$ is 2Mbytes per core and its distributed so if you have a four-core chip it does actually have 8Mbytes total. A lot of those memory accesses will manage to hit in the L3$, and that's only 39 cycles away.

But nevertheless the point is still there. Memory speeds are scaling very slowly compared to our common perception of processing speed, and even though we all know this, we sometimes lose track of the relative magnitudes, assuming that the new features are only a slight performance hit. The recent "Data Oriented Design" movement from folks like Mike Acton is basically trying to remind people that all this extra performance is contingent on caches working well, and that even code we think of as a minor processing load can start becoming a significant performance hog if it misses the cache all the time. These chicken-scratch numbers show that the sophisticated language features are basically becoming mainstream as fast as processing power can increase.

(warning - extreme speculation ahead!) That might not be too surprising though if the size of the job is limited by other things. Like, say, the amount of complexity that the player's brain can cope with in a minute. Maybe what's happening is that the amount of gameplay-related code happening per minute is asymptotically approaching these various limits, and as it does we find that we can use this extra power to write that code in higher-level languages and increase productivity, iteration speed and robustness. That's not a bad thing in and of itself when confined to code that is inherently limited by the human on the other end of the controller (and thus not following Moore's Law), but it does cause problems as other areas of code adopt these languages, either because of coder choice (the languages are more productive) or because of interoperability with the gameplay code (as opposed to using two different languages duct-taped to each other with something like SWIG).

That's when Acton and the DOD folk get ranty, as it can be very difficult to recover this performance. It's not performance tuning as we used to understand it - focused in the traditional "kernel" areas of high number crunching (graphics, animation, sound, raycasting, etc), or even in certain "hot functions" that you can see in a profiler and sit and stare at for a week. It's all over the code - ever function call, every member access. None of it shows up on well on traditional profiling tools, and because it's all to do with how nice you are to caches, which are unpredictable and temperamental beasts, its incredibly sensitive to both Heisenburg and Butterfly effects.

The traditional method of games production is to write the thing, run it at low rez or with simple assets, pretty much ignore the fact that the thing runs slowly because 90% of that time is the unoptimized number-crunching kernels, and assume that in the last quarter or so the performance ninjas will optimize the hell out of them and quadruple the speed. Instead what's happening is that during development of the game only about half the execution time is spent in kernels, and when the ninjas come in and solve that half you only get a modest framerate boost, a profile as flat as a pancake, no idea where to even start looking for perf, and now the ninjas are screaming at some sobbing level designer about the speed of his Python hash table lookups.
Volker Schoenefeld took pity on my pathetic maths knowledge. I used to know all this stuff back in university, but it's atrophied. He sent me some links to some excellent tutorials and papers he's done on SH and the maths behind them. Probably the best place to start is his paper at [[http://heim.c-otto.de/~volker/prosem_paper.pdf|http://heim.c-otto.de/~volker/prosem_paper.pdf]]. He also has a Flipcode entry about it: [[http://www.flipcode.com/cgi-bin/fcarticles.cgi?show=65274|http://www.flipcode.com/cgi-bin/fcarticles.cgi?show=65274]] and the source to go with it [[http://heim.c-otto.de/~volker/demo.tgz|http://heim.c-otto.de/~volker/demo.tgz]]. Many thanks to you, Volker.
Well, it never rains but it pours. OK, not a true reinvention this time, but a sort of co-invention. Sony just announced their EDGE libraries recently and in them was a snazzy new vertex-cache optimiser, and some people at GDC asked me whether it was my algorithm. I had no idea - haven't written anything for the ~PS3 for ages, but I knew Dave Etherton of Rockstar ported [[my algorithm|Vertex Cache Optimisation]] to the ~PS3 with some really good results (he says 20% faster, but I think that's compared to whatever random ordering you get out of Max/Maya).

Anyway, the EDGE one is actually a version of the [[K-Cache algorithm|http://www.ecse.rpi.edu/~lin/K-Cache-Reorder/]] tweaked and refined by [[Jason Hughes|http://www.dirtybitsoftware.com/]] for the particular quirky features of the post-transform cache of the RSX (the ~PS3's graphics chip). Turns out we do very similar broad-scale algorithms, but with slightly different focus - they tune for a particular cache, I try for generality across a range of hardware. It would be academically interesting to compare and contrast the two. However, they both probably achieve close enough to the ideal ACMR to ensure that vertex transformation is not going to be the bottleneck, which is all you really need.

Remember - after ordering your triangles with vertex-cache optimisation, you then want to reorder your vertex buffer so that you're using vertices in a roughly linear order. This is extremely simple - it's exactly the algorithm you'd expect (scan through the tris, first time you use a vert, add it to the VB) but it's as much of a win in some cases as doing the triangle reordering.
Mucky Foot Productions Ltd - a games company in Guildford, a few miles south west of London. Guildford used to be the heart of UK indie games development, but it's all gone a bit quiet now. I worked there from late 1999 through late 2003, when it all went pear-shaped and the company folded. Worked on some fun games.

I joined in the closing stages of [[Urban Chaos]] (PC, ~PS1), and then did the job of porting it to the Dreamcast (which was a fabulous machine to work on). A fun game, if a little random in places. But full of character. To my knowledge, Darci is still the gaming world's only female black cop protagonist.

Then I joined the StarTopia (PC) team and took over the graphics engine, which I still play with today (http://www.eelpi.gotdns.org/startopia/startopia.html). It's a fantastic "little computer people" game - available for about $3 on Amazon these days.

Then we did [[Blade II]] (Xbox1, ~PS2) - I was mainly the Xbox1 engine side of things. Not a great game, but I'm pleased with the way the graphics turned out - I managed to put in a lot of nice tech such as auto-generated normal maps, sliding-window VIPM for most meshes, and almost every asset was streamed from disk (meshes, textures, sounds, physics hulls).

After that we started two games in parallel with a new cross-platform engine which I headed up, but we only got a year or so into development before everything collapsed. Bit of a shame really - the engine was shaping up fairly well. Over the years I've tried to document some of the techniques used - click on the Streaming and VIPM tags in the right-hand menu.
This is the story of a fun bug. The challenge for the reader here is twofold. First, spot the bug before I give the secret away (and the title should have given a huge clue already, and filled those in the know with both dread and giggles). But second, and more pragmatically interesting to working programmers - figure out why this one took me several hours to narrow down and fix, rather than being immediately obvious after a bit of standard single stepping and sprinkling of asserts and all the standard bug triage that usually works pretty quickly.

So the context is some simple image processing. The high-level flow is this:
{{{
Read RGB image
Convert to YUV
Do some simple processing
Convert to sRGB
Quantise to lower-bit-per-pixel format
Dequantise and display to verify it all worked
}}}
The result was that the image was kinda right, but the quantisation was very jumpy, and the contrast of the image tended to go up and down unpredictably. I expected it to slowly change as the dynamic range in the visible scene changed - as very bright or dark things came in and out of view - that's what it's meant to do. But what actually happened was even tiny changes in the input caused large changes in the quantisation quality. The first thing I tried was disabling the "simple processing" part so it did nothing (which was the normal state anyway - it was for special effects). Same result.

I then narrowed it down to the part of the quantisation where I scanned the image and found the min and max of each channel. If I didn't do that and just set the min and max values to 0.0 and 1.0 for each channel, everything was stable - though obviously the quantisation isn't giving particularly good results if we do that. Now as it happens, what the code actually did was step through each pixel, do the YUV conversion and processing, and while there I found the min/max values, i.e.
{{{
    vec3 max = -FLT_MAX;
    vec3 min = FLT_MAX;
    for ( int i = 0; i < imagesize; i++ )
    {
        vec3 pix = image[i];
        pix = to_yuv(pix);
        pix = processing(pix);
        pix = to_rgb(pix);
        vec3 gamma = powf ( pix, 1.0f/2.2f );
        max = Max ( gamma, max );
        min = Min ( gamma, min );
        ... more stuff ...
    }
    ... use min+max to quantise the image... 
}}}
(note the use of -~FLT_MAX, and ''not'' ~FLT_MIN. That has bitten me before as well. Go look it up. It's not the value you think it is! Use of ~FLT_MIN is almost always a bug. Maybe go grep your codebase for it just to check. Anyway, that's not the bug here)

Well, debugging a bunch of input images showed pretty sensible values for min and max coming out, so it wasn't horribly broken. But it was still unstable, and further investigation showed the quantisation was having to clamp values outside the range of min...max (that code was in there as a safety net, but shouldn't have been needed in this case). I single-stepped through this loop for a bunch of representative pixels and it all seemed to be working as expected, and yet min and max were clearly not producing the right answers in the end. How could this possibly be? The code's just not that complex! I started to worry about a memory scribbler playing with the values of min and max on the stack, and other terrifying thoughts.

Then I happened to pick an image where the very first pixel was a fairly primary colour - let's say it was red: (1,0,0). It gets gets converted into YUV, which is some uintuitive triplet of numbers, and then back to RGB. The conversion to and from YUV was not quite perfect, because hey welcome to floating point, and what we got back was more like (0.99997, 0.00001, -0.00001). But that's easily close enough for image work, especially if we're going to quantise the result. But then we convert to sRGB by the standard method of raising each number to the power of 1.0/2.2 (this is not strictly correct, but for these purposes it was close enough).

* Hey fellow floating-point nerds - spidey-sense tingling yet? Good. So that's the first part - the what - but why did it take so long for me to track this down? What would you expect the end results to have been? Surely I would have spotted that almost immediately, no? So there's clearly more fun to come.

For those not as familiar with floating-point arcana, let me explain. -0.00001 isn't very negative. Only a tiny bit. But that's enough. If you raise a negative number to a fractional power like 1.0/2.2, the result is undefined. powf() gives you back a result called a "~NaN" which stands for "not a number" and it's a floating point value that... isn't a number. That is, it's a value that says "you screwed up - we have no idea what this is". Other examples of where they come up is dividing zero by zero (1/0 is infinity, but 0/0 just has no sensible answer). In practice in "normal" code I also regard infinity ("inf") in the same category as ~NaN. In theory inf is slightly better-behaved, but in practice it rapidly turns into a ~NaN, since 0.0*inf=~NaN. inf-inf=~NaN, and so on. They're still usually a sign you have buggy unstable code.

And once you have a ~NaN, usually they spread through your data like wildfire, because unless you're deliberate about filtering them out, you can't easily get rid of them. 1+Nan=~NaN. 0*~NaN=~NaN, and so on. So once you get one ~NaN, almost any operation it does with another value will produce a ~NaN, and then any operations it does will cause ~NaNs and so on - they "infect" your data.

In one respect this is maddening, but in other ways you do start to rely on it. If you see a ~NaN in the output of a function, you know that somewhere in that function you screwed up, and there's only a few places you can generate a ~NaN, so you check them first. Square roots, divides, power functions - things like that. Normal add/sub/mul can't make a ~NaN/inf without first being fed a ~NaN/inf. And if you don't see a ~NaN in the result, you're probably fine - because if one ''ever'' got generated, they're usually pretty hard to get rid of and they hang around.

And that's what should intuitively have happened here. I took the power of a negative number, got a ~NaN, oops my bad, and of course taking the min and max of a range of numbers, one of which is junk, is clearly also junk, and then all the stages after that should have produced utter gibberish, and I would have tracked it down really quickly, right? Of course! Except that didn't happen. The min and max results weren't junk - they were totally plausible values for an image - they were just not the correct values.

* You hard-core floating-point nerds are all laughing your asses off by now. But if you think you have the whole story - I promise - there's still more subtlety to come! For the rest - lemme 'splain some more.

I said ~NaNs don't usually keep quiet and stay where they are in your data - they infect and spread. But there is one very notable exception. Well, I guess two. The functions min() and max(). In fact there are lots and lots of versions of min() and max(), and all do subtly different things in the presence of ~NaNs. I could write a long and deeply nerdy post on all the varieties because we spent a few days figuring them all out when devising the Larrabee instruction set (and no less than three of those variants are ~IEEE754. That's the good thing about standards - there's so many of them!)

The details don't really matter to most coders, just be aware that all standard lessthan/equals/greaterthan comparisons with ~NaNs return false, except for "not equals" which is true. That is, when you ask the question "is x greater than a ~NaN" the answer is "no" because we have literally no idea what a ~NaN is - we don't even know its sign. Additionally, two ~NaNs of exactly the same bit pattern don't equal each other either, because they're not a "number" - not even a strange one like infinity - they're "don't know".

OK, so the burning question is - how were Min() and Max() implemented? It turns out, in a fairly obvious way, by doing result=(x<y)?x:y. Some people don't like the ?: operator, and I'm not a great fan, but it turns out not to be the problem here, you'd get the same result with traditional if() statements. For ease of discussion, let me expand those macros out, and also change the loop so it's scalar rather than vector, and give it a very small input "image".
{{{
    float image[5] = { 0.0f, 0.1f, -0.0001f, 1.0f, 0.9f };
    float max = -FLT_MAX;
    float min = FLT_MAX;
    for ( int i = 0; i < 5; i++ )
    {
        float pix = image[i];
        float gamma = powf ( pix, 1.0f/2.2f );
        max = ( max > gamma ) ? max : gamma;
        min = ( min < gamma ) ? min : gamma;
    }
}}}
OK, well that's pretty simple - any decent coder should be able to have a go at figuring out the various possible values of min and max it might produce. This example also compiles, and you could run it and print out min and max and try to figure out why you get the results you do. I'll save you the bother - the answer is min=0.953 and max=1.0. How nerdy you are should determine how many layers of WTF this should cause.

The first quick scan would expect something like min=-0.0001, max = 1.0.

But I've already pointed out that raising -0.0001f to a power will produce a ~NaN, so next guess is you'd expect max and min to end up with ~NaNs, right? Perfectly reasonable guess. Just wrong. That would have been far too quick to track down, and I wouldn't have written this post about it.

OK, but I also told you how special Min() and Max() are, and indeed in some definitions they explicitly discard ~NaNs, giving you the other answer - the one that ''is'' a number. So as I said, the floating-point nerds probably thought they had got this solved it long ago. They'd expect the ~NaNs to be ignored, and so the expected result there is min=0.0, max=1.0. And that would have been a very subtle bug as well, but probably so subtle we'd have never even seen it, because that is really really close to the actual range of those numbers, and this being image-processing (especially with quantisation), close enough is just fine.

Ah, but this isn't ~IEE754-2008 Min() and Max(), this is just ?: And they do ''awesome'' things in the face of ~NaNs. Let's single-step.

After the first loop, min=0.0, max=0.0. Easy, no problems.

After the second loop, min=0.0, max=0.351 (which is the surprisingly large result of gamma-correcting 0.1). So far so good.

Third loop, gamma=~NaN, and here's the obvious problem.
{{{
max = ( max > gamma ) ? max : gamma;
min = ( min < gamma ) ? min : gamma;
}}}
Is (0.351 > ~NaN)? No, it is not. Comparisons with ~NaN return false. So the new value of max is gamma, i.e. max=~NaN. Same happens with min. Well, that's what we were expecting alright. Our data is now screwed. Same thing happens with min - they're both ~NaN, the world is a disaster, game over, the output is junk.

But wait, the answer wasn't junk. Eh? Somewhere, the ~NaNs vanished again. Let's do the next loop. So now gamma = 1.0, min&max=~NaN, and again:
{{{
max = ( max > gamma ) ? max : gamma;
min = ( min < gamma ) ? min : gamma;
}}}
Is (~NaN > 1.0)? Again, it is not - comparisons with ~NaN return false. So the new value of max is... oh, it's actually 1.0. The ~NaN vanished! And again, same with min. So now they're both 1.0.

And last loop is as you'd expect, max=1.0, min=0.953 (the result of gamma-correcting 0.9)

So if you think through the overall behaviour, the effect of the specific way this was coded was that ~NaNs ''were'' discarded (as long as the very last pixel wasn't a ~NaN-producer) and we ''did'' get the min/max of a range of values - but only the range of values ''after the last ~NaN''. How subtle is that? And the ~NaNs were due to slight math wobbles on certain colours during ~RGB->~YUV->RGB conversion, so they'd pop up in random places and come and go and unpredictably, and every time they did an entire chunk of the image suddenly got ignored or added for the purpose of min/max and the quantisation and contrast would pop. This finally explains everything!

It's worth pointing out that this only happened because of the specific way min() and max() happened to be implemented in the library I was using. It is left as an exercise for the reader to consider what the results would have been for the equally valid, and you'd think ''completely identical'':
{{{
max = ( gamma > max ) ? gamma : max;
min = ( gamma < min ) ? gamma : min;
}}}
or a number of other possible alternatives such as using the SSE min/max assembly intrinsics.

Oh, and the actual bugfix - the solution to all this? Very simple - just clamp the result of the ~YUV->RGB conversion to 0.0 to 1.0, and a nice chunky comment block about WHY.
I will shortly be leaving Intel. Working on the architecture known variously as SMAC, Larrabee, MIC, Knights, and now Xeon Phi has been enormously rewarding - there is absolutely no other company on this planet that can take a few people's ideas and use them to change computing to such an extent. But I've discovered I'm really not a big-company sort of person, and seven years on one project is a personal record for me, so it's time to move on. And when Michael Abrash asks if you want to come work on [[wearable computing|http://blogs.valvesoftware.com/abrash/valve-how-i-got-here-what-its-like-and-what-im-doing-2/]] with him, there's really only one correct answer. It's the same answer I gave seven years ago when he asked me if I had any ideas about new x86 instructions...

I am immensely proud of what we achieved (and almost achieved) on Larrabee, and I think the architecture is clearly the blueprint for the future. Knights Corner is a great chip, and I'm really excited about what Intel will be able to show everyone when it starts shipping. I'm sad not to be able to be a part of that any more, but I know the team will go on to refine the architecture, and also proliferate the concepts further into general-purpose computing. The next few years are going to be really exciting as the different architectures battle it out.

In one way I'm sad to be leaving the world of hardware and instruction set design. But the stakes are high, and the pressure is too. I'm looking forward to being able to go back to what I was before - a games coder, cranking out pixels and maxing out the hardware you're given. There's significantly fewer 7am meetings for a start.
Look, please try to remember, my surname ''DOES NOT HAVE AN E IN IT''. There are approximately three times as many Forsyths in the London phone directory as Forsythes, so it's not exactly uncommon. I wouldn't mind as much if people didn't add it when I'm //spelling// my name out for them. I'm saying "eff oh are ess why tee aych........." and they write an "e" and then ask me "oh, no E?" Well - there was that big silent bit where I could have easily said "E". I didn't suddenly forget or anything. Now you're going to have to write that cheque/receipt/permit out for me again, aren't you?
[[http://notepad-plus.sourceforge.net/|http://notepad-plus.sourceforge.net/]]. Like notepad, but better. Does word-wrap right, has a bunch of neat bracket-matching options, and a whole bunch of useful stuff I haven't played with yet. Launches in an instant. It doesn't replace a good custom text editor, but it's ace for hacking text files in arbitrary languages (XML, etc) when you don't have a spare few hours for Visual Studio to start up.

And no, I'm not going to use emacs. Perverts.
I just know I'm going to get endless emails on the pedantry of this subject, but here's my brief round up of the background. You don't need to know any of this, it's just for geeky interest.

There's the [[Nyquist rate|http://en.wikipedia.org/wiki/Nyquist_rate]]. This has two meanings, but the interesting one is that if you have a limited bandwidth medium, you absolutely can't represent a signal of a frequency more than twice that of the medium. Put into graphics terms, you only have a certain number of pixels on the screen - that's your limited bandwidth medium. So you absolutely can't represent a signal (the texture) that is more than twice that size - there's no point even trying.

There's also the [[Nyquist frequency|http://en.wikipedia.org/wiki/Nyquist_frequency]]. It says a subtly different thing, which is that if you want to represent a signal of a certain frequency, you have to sample it at twice that frequency to avoid aliasing. However, this is not actually that relevant to texturing, because generally our signal is real life, which is much higher than can be achieved on a pixel display. So we're concerned with how to give a reasonable representation of a signal that is far beyond the capabilities of our hardware, not how to choose hardware to completely represent a signal.

Some people also talk about the "Nyquist limit", but it appears to be an informal term. They are usually referring to how much information you can pump through a system without it looking bad. This is a different thing to the two above, which talk about theoretical limits, not practical ones. As ever in game graphics, the academia can give you a guide and set limits, but in the real-world we tend to make large assumptions and go with what looks good. But it's nice to know in this case that there is a real basis for the approximations that seem to work fine in practice.
A handy list of rules that I use to instantly decide if someone is an idiot coder or not. The way I decide this is that I tell them the list of rules, and if they're offended, then they're an idiot. If they give me useful arguments against any of my rules then they're not idiots, they are thoughtful but misguided. If they agree with all the rules instantly, they're sheep. If they argue but then agree, they are now true disciples. It's a very handy classification system, since it completely eliminates the possibility of people who might be smarter than me.

This list will grow with time. The chance of anybody agreeing with the entire list therefore tends towards zero. Thus the number of idiots in the world grows with every update. Fear the blog.

* Double precision has no place in games. If you think you need double precision, you either need 64-bit fixed point, or you don't understand the algorithm. [[A matter of precision]]
* Euler angles are evil, unless you're actually modelling a tank turret. And even then they're evil, they just happen to be an evil hard-wired into every tank turret. Try not to spread TEH ~EV1L further than strictly necessary.
* Angles in general are evil. Any time you feel the urge to use angles, a vector formulation is likely to be more robust and have fewer transcendentals or branches. What that means is any time you have sin/cos/tan or their inverses inside inner loops, you need to [[finish your derivations|https://fgiesen.wordpress.com/2010/10/21/finish-your-derivations-please/]]
* If you can't write the assembly for the code you're writing, you're a danger to the project and other coders. Try something a bit more low-level until you do understand what you're writing.
* If you think you can remember precedence rules beyond +-*/, you're only 99% right. Also, nobody else can remember them, so your code is now 1% buggy, and 100% unmaintainable. Use brackets.
* ~OpenGL is crazy. ~DirectX is flaky, transient, confusing and politically-motivated, but it's not crazy.
* [[Scene Graphs|Scene Graphs - just say no]] are useless and complex. Occam's Razor makes mincemeat of them.
* [[Premultiplied alpha]] is the right way to do blending. If you're still doing lerp blending, you're 30 years out of date. Get some help converting your pipeline in [[Premultiplied alpha part 2]].
We all know and love loose octrees, right? Really useful things. Have been for used for around ten years in games - about twice that long in raytracers. Well, some doofi at Intel have patented them. [[http://www.google.com/patents?id=sH54AAAAEBAJ&dq=7002571|http://www.google.com/patents?id=sH54AAAAEBAJ&dq=7002571]] Filed in 2002. Not only is this well after they were in common use, but it's two years after the publication of Game Programming Gems 1 where [[Thatcher Ulrich|http://www.tulrich.com/geekstuff/index.html]] wrote the definitive article on them. The title of the article is the fairly obvious one - "Loose Octrees". Not exactly difficult to google for, should you be investigating a patent application on the subject, and indeed the top page found is Thatcher's. Understandably, he's not that thrilled about this either: [[9th May 2007 entry|http://www.tulrich.com/rants.html]]

Normally I simply avoid reading about patents entirely, and I try not to pollute anyone else's brains with them because of the incredible abuses they are put to in this utterly broken system. But in this case it's too late - you already violated this one. You thought you were safe didn't you? Just because it was invented before most of us started programming, just because loads of people have used it in tons of shipped code, just because it's been discussed ad nauseam all over the internet, and just because it's been several years since actual publication in a popular hardback book - none of that stops some idiots from patenting the existing contents of your brain. But that's fine - the sooner someone gets sued by Intel for violation, the sooner the patent can be revoked //from orbit// for gratuitous and wanton disregard for prior art and obviousness. As Kevin Flynn put it - "the slimes didn't even change the name, man!".
Somebody asked for the slides to Michael Abrash's Pixomatic GDC 2004 talk the other day, and we at RadGameTools realised that they weren't online anywhere. So [[here they are|papers/pixomatic_gdc2004.ppt]] in a somewhat temporary place until we get them on the [[main site|http://www.radgametools.com/pixomain.htm]].

Someone also pointed out that the Dr. Dobbs Journal articles on Pixomatic are also available: [[part1|http://www.ddj.com/184405765]] [[part2|http://www.ddj.com/184405807]] [[part3|http://www.ddj.com/184405848]]. A fascinating journey into the world of the software renderer. Even in today's GPU-dominated world, it's not dead yet - and it's getting better.
A nice chap called Benedict Carter has been working on a fun zombies-vs-marines games called "Plague" - check the blog here: [[http://plague-like.blogspot.com/|http://plague-like.blogspot.com/]], and there's a download link there as well - you can get source or just a Windows EXE. Handy hint - the concussion grenades are really powerful, but have a huge range and you can easily kill yourself with them if you're not careful - stick with the incendiaries to start with (press the "9" key at the start of the game!). And before you start moaning it gets slow when you lob a bunch of molotovs around and set fire to huge forests - it's written in Python, and it's a fun home project, so shush.

Anyway, why do I mention this? Because he got some inspiration for the way fire spreads from an old article on [["Cellular Automata for Physical Modelling"|http://www.eelpi.gotdns.org/papers/cellular_automata_for_physical_modelling.html]] that I wrote a while back, and was kind enough to tell me about it. I'm still not super happy with the article, and particularly the way it conflates "temperature" with "heat energy" - they're not the same thing at all of course. But it works after a fashion, and it's fairly explicit about how you need to balance having just enough realism to create the effect you want, without too much realism that it's hard to tweak or too slow to run. Game physics is not like real physics and should not be!

So, fun to see some old ideas made into an actual enjoyable game. And of course always fun to see what people get up to in their spare time. My brain has been so full of the SuperSecretProject for so long, my outside projects have basically stopped - hence no blog posts for so long. But the veil of secrecy is slowly lifting, so hopefully I'll be able to talk more about that stuff in a bit.
Everyone should have something like this in a header file in every project. It just makes life simpler:

{{{
typedef signed char sint8;
typedef unsigned char uint8;
typedef signed short sint16;
typedef unsigned short uint16;
typedef signed long sint32;
typedef unsigned long uint32;
typedef signed __int64 sint64;
typedef unsigned __int64 uint64;
}}}

...and you have one of these for each platform and/or compiler of course. Note the pedantic use of "signed" and "unsigned" because the default is not that well-defined for things like chars.

You then use "int" when you need something fast with a poorly-specified number of bits, but otherwise use one of the above. If you have data you're going to save and load, you ONLY use data types that have specified bit sizes. Anything else is just asking for trouble and woe. There's also a reasonable justification for having bool8 and bool32, since the size of "bool" is also not well-defined. Some people also define something like "Bool" to cope with older C-only compilers that don't support "bool" natively at all (many used to semi-officially support "BOOL").

You can also do the same with float32 and float64, though there's far less confusion there - "float" and "double" are very consistent across platforms.

Edit: Reaven Khali pointed me to a BOOST header file that does most of this for you: [[boost/cstdint.hpp|http://www.boost.org/doc/libs/1_38_0/libs/integer/cstdint.htm]]. I don't personally like the use of _t at the end, or making it implicit that ints are signed, but you can very easily typedef the names again to your own preference (some people prefer even shorter names like s8, u16, etc), but still leave the BOOST header to figure out all the compiler config problems for you.

Edit 2: Mark Baker (hi Mark!) pointed out that C99 has the same header as a standard lib file: [[Wikipedia entry on stdint.h|http://en.wikipedia.org/wiki/Stdint.h]]. They appear to use the same naming convention (not an accident, I'm sure). I don't know the pros and cons of using <stdint.h> version versus the BOOST version, and //I don't want people to email me about it// because I am long past caring about code-standard wars (especially not the ~BOOST-is-awesome vs ~BOOST-is-the-devil one). I'm just writing down the options.

Edit 3: Sean Barrett solves problems. He solved this one as well: [[sophist.h|http://nothings.org/stb/sophist.h]]. Brian Hook also solves problems: [[POSH|http://www.bookofhook.com/poshlib/posh_8h.html]]

The most important thing is your code uses some sort of reasonably-named standard and isn't just littered with "unsigned int" that you have to struggle to search-and-replace later. Exactly what that convention is and what header you use to define it is easily changed later as you change platforms. So don't stop to think - just cut'n'paste the top snippet into your code RIGHT NOW and start using the types, and you can go investigate changing it later.
In work recently I've been dealing with fitting polynomial curves to set of points. Normally for this job you'd reach for Mathematica or some other heavy-math package and use the right incantations. It would do Magic Math Stuff and spit out the right answer. The problem if you didn't really understand the question, and so you don't really understand the solution, and certainly it's hard to do things like implement at runtime in your own code, or explain why it doesn't work in some cases. So I do like to at least try to do it myself with my high-school math skills before reaching for the Math Wand.

The question here is - I have a function, and a bunch of data points, and they're not that badly-controlled, but I do need to approximate them with a nice simple polynomial I can put in something like a pixel shader. To keep it concrete, say it's a cubic. I'm quite happy to stick down a few control points myself and do the noise-smoothing approximation process by eye (humans are very good at this!), but how does that help me find the right numbers to put in the shader code?

In a shader, a cubic polynomial looks like this:
{{{
y = A + B*x + C*x^2 + D*x^3;
}}}
Actually, if you want the ultimate efficiency, you should write it this way to express its fundamental "fused multiply-add" nature. To us they're the same, but to ~IEEE754 rules they're not EXACTLY the same, so it sometime pays to write it out very specifically.
{{{
y = A + x*( B + x*( C + x*( D ) ) );
}}}
OK... but that doesn't help me. I have four points (p1,q1) (p2,q2) (p3,q3) (p4,q4) that I want this curve to pass through. How do I find A, B, C and D? Intuition says there should be a solution - we have four constraints (the four values of q) and four constants (A, B, C, D). There should be a solution!

I have seen many people throw their hands up at this stage of problems like it and say "I'm not a mathematician - this is too hard for me". Except it's not - it's not actually difficult maths. And in fact there's plenty of mathematicians that will make a mess of this too, precisely because they know too much maths - so they'll start to use a higher-order sledgehammer to crack this fairly simple nut and soon you'll be #include-ing all sorts of libraries with legally intimidating licenses.

So now I've let the cat out of the bag and told you it doesn't require anything resembling graduate maths, let's smash our foreheads into it like the dirty stubborn hackers we are.

The truly stubborn way to approach this is to write out four equations and use high-school math to solve them simultaneously by adding and subtracting stuff, so:
{{{
q1 = A + p1*( B + p1*( C + p1*( D ) ) );
q2 = A + p2*( B + p2*( C + p2*( D ) ) );
q3 = A + p3*( B + p3*( C + p3*( D ) ) );
q4 = A + p4*( B + p4*( C + p4*( D ) ) );
}}}
So we can... um... multiply everything by... er... oh that's all going to be mess I think. Maybe stubborn isn't all you need. This looks like the wrong path. Let's see if we can solve some simpler cases - see if that helps us.

Well... what if p1 was zero? Oh, then life gets much simpler, because from the first equation:
{{{
q1 = A + p1*( B + p1*( C + p1*( D ) ) );
   = A;
}}}
Well that was easy! Except p1 probably isn't zero. But what if we reformatted the polynomial so it was all relative to zero, i.e. rewrite it:
{{{
y = A + (x-p1)*( B + (x-p1)*( C + (x-p1)*( D ) ) );
}}}
So by using (x-p1) instead of x we can make all those extra terms vanish at x==p1. But this isn't the same A as before because we changed things. We could probably multiply it all out and find the old A from the new A, once we've found the new B, C and D. And we haven't found them yet. This is still looking ugly. Though - that's a neat trick multiplying things you don't like by (x-p1). What if we did it for all the other P values? What if we wrote the polynomial like this:
{{{
y = 
     f1 *          (x-p2) * (x-p3) * (x-p4) +
     f2 * (x-p1) *          (x-p3) * (x-p4) +
     f3 * (x-p1) * (x-p2) *          (x-p4) +
     f4 * (x-p1) * (x-p2) * (x-p3)           ;
}}}
Clearly this is still a cubic polynomial, and I can theoretically convert it back to the one I want to put in the shader as long as I find out f1-f4. The nice thing is that when you set x=p1 and y=q1, the second third and fourth lines are zero and you just get a fairly simple thing to solve:
{{{
q1 = f1 *           (p1-p2) * (p1-p3) * (p1-p4);
}}}
Therefore:
{{{
f1 = q1 / (           (p1-p2) * (p1-p3) * (p1-p4) );
}}}
Best of all, it's symmetrical - I don't have to really figure out the other four lines, they're just obvious extensions of the first:
{{{
f2 = q2 / ( (p2-p1)           * (p2-p3) * (p2-p4) );
f3 = q3 / ( (p3-p1) * (p3-p2)           * (p3-p4) );
f4 = q4 / ( (p4-p1) * (p4-p2) * (p4-p3)           );
}}}
Symmetry often hints that we're on the right tracks. OK, so now I have f1-f4, how do I get A,B,C,D? Well, now we do have to multiply out and gather terms together for constants, x, x^2, x^3. But it's really not that hard, and you can still use lots of symmetry to help yourself. Taking each line of the new polynomial separately:
{{{
f1 *          (x-p2) * (x-p3) * (x-p4) = -f1*p2*p3*p4 + x * ( f1*(p2*p3+p3*p4+p4*p2) ) + x^2 * (-f1*(p2+p3+p4)) + x^3 * (f1)
f2 * (x-p1) *          (x-p3) * (x-p4) = -f2*p1*p3*p4 + x * ( f2*(p1*p3+p3*p4+p4*p1) ) + x^2 * (-f2*(p1+p3+p4)) + x^3 * (f2)
f3 * (x-p1) * (x-p2) *          (x-p4) = -f3*p1*p2*p4 + x * ( f3*(p1*p2+p2*p4+p4*p1) ) + x^2 * (-f3*(p1+p2+p4)) + x^3 * (f3)
f4 * (x-p1) * (x-p2) * (x-p3)          = -f4*p1*p2*p3 + x * ( f4*(p1*p2+p2*p3+p3*p1) ) + x^2 * (-f4*(p1+p2+p3)) + x^3 * (f4)
}}}
...and then gather vertically by powers of x to get:
{{{
A = -( f1*p2*p3*p4 + f2*p1*p3*p4 + f3*p1*p2*p4 + f4*p1*p2*p3 );
B = f1*(p2*p3+p3*p4+p4*p2) + f2*(p1*p3+p3*p4+p4*p1) + f3*(p1*p2+p2*p4+p4*p1) + f4*(p1*p2+p2*p3+p3*p1);
C = -( f1*(p2+p3+p4) + f2*(p1+p3+p4) + f3*(p1+p2+p4) + f4*(p1+p2+p3) );
D = f1 + f2 + f3 +f4;
}}}
Hey - that's actually pretty elegant! And of course if you want to get perf-crazy you can gather some terms together and so on, but sometimes it really is fine to just let the compiler do that for you. There's always the danger that you fail to [[Finish Your Derviations|http://fgiesen.wordpress.com/2010/10/21/finish-your-derivations-please/]], but in this case I don't see one (there is an obviously elegant way to express all this symbolically, but it's a bit dodgy because when you actually go to evaluate it you end up dividing by zero unless you pre-cancel terms, and if you allow those sort of shenanigans you can accidentally prove that 1=2)

Also, what are the failure points? Well, the obvious ones are the divides when calculating f1-f4. They fail when any of the P values are the same as any of the other P values, e.g. p1==p2, and you divide by zero and get infinities all over the place. That's a pretty reasonable failure - you can't have a simple polynomial having two different values at the same input value. There's also clear numerical instability if two P values are close to each other - the polynomial is going to go pretty bonkers - it might actually go through the points you want, but its behaviour between them is going to go up and down all over the place. Other than that, it seems pretty well-behaved.

Now of course the mathematically-inclined will point out that these are simply the products of some matrix multiplies or start (ab)using Kronecker deltas. That's all fine, but that's not the point of my post. Of course you folks can solve it, and many harder problems as well - that wasn't in doubt. The point is that for simple things like this, hackers like me don't have to go asking the maths gods for help - we can solve this on our own by applying some basic problem-solving skills. The key is knowing just how far your maths knowledge will spread, and the trick is to develop the intuition to know, as I did, that this was well-constrained enough that there probably was a simple solution, and to at least explore towards a solution without prematurely invoking post-high-school maths. This intuition can often be wrong - there are some really simple-seeming problems that are indeed evil bastards that need heavy number-crunching. But in this case the solution is so fast and so elegant we could perform it millions of times a frame, and that opens up all sorts of interesting applications. Whereas if you handed it off to Maple, you'd of course get the right answer, but it would be far slower, and when it failed it would not be particularly obvious why it failed or what to do about it.

For some easy visualisation (and to prove I didn't screw it up) I put the above into an Excel spreadsheet which you can play with [[here|Polynomial interpolation.xlsx]]. Change the values in orange, get the results in yellow, and look at the evaluation with the curve on the right.
What we think of as conventional alpha-blending is basically wrong. The SRCALPHA:INVSRCALPHA style blending where you do FB.rgb = texel.rgb * texel.a + FB.rgb * (1-texel.a)  is easy to think about, because the alpha channel blends between the texel colour (or whatever computed colour comes out the pixel shader - you know what I mean) and the current contents of the framebuffer. It's logical and intuitive. It's also rubbish.

What you should use instead is premultiplied alpha. In DX parlance it's ONE:INVSRCALPHA, or FB.rgb = texel.rgb + FB.rgb * (1-texel.a) (and the same thing happens in the alpha channel). Notice how the texel colour is NOT multiplied by the texel alpha. It's a seemingly trivial change, but it's actually pretty fundamental. There's a bunch of reasons why, which I'll go into, but I should first mention that of course I didn't think of this and nor is it new - it's all in a 1984 paper [[T. Porter & T. Duff - Compositing Digital Images|http://keithp.com/~keithp/porterduff/]]. But twenty years later, most people are still doing it all wrong. So I thought I'd try to make the world a better place or something.

Anyway, Porter & Duff talk about a lot of things in the paper so you might not want to read the whole thing right now, but the fundamentals of premultiplied alpha are easy. "Normal" alpha-blending munges together two physically separate effects - the amount of light this layer of rendering lets through from behind, and the amount of light this layer adds to the image. Instead, it keeps the two related - you can't have a surface that adds a bunch of light to a scene without it also masking off what is behind it. Physically, this makes no sense - just because you add a bunch of photons into a scene, doesn't mean all the other photons are blocked. Premultiplied alpha fixes this by keeping the two concepts separate - the blocking is done by the alpha channel, the addition of light is done by the colour channel. This is not just a neat concept, it's really useful in practice.

''Better compression''
Consider a semi-translucent texture with a variety of colours in it. With conventional blending, the colour of alpha=0 texels is irrelevant you think - because hey - they get multiplied by alpha before being rendered. But it's not true - consider when the bilinear filtering is half-way between RGBA texels (1,0,0,1 = solid red) and (0,0,1,0 = transparent blue). The alpha value is obviously interpolated to 0.5, but the colour is also interpolated to (0.5, 0, 0.5), which is a rather ugly shade of purple. And this will be rendered with 50% translucency. So the colours of texels with alpha=0 are important, because they can be "pulled in" by the filtering.

The answer with normal alpha-blending is to do a flood-fill outwards, where texels with non-zero alpha copy their colours to texels with zero alpha. This is a pain to write though, and of course some alpha=0 texels have multiple neighbouring texels, and they have to do an average of them or something heuristic like that.

Then you go to encode it as a ~DXTn texture, and you hit more problems. First of all, ~DXT1 can encode alpha=0 texels, but it can't encode them with any colour other than black. So all your flood-filling was pointless, and black is going to bleed in. I've actually seen this in shipping games - the edges of translucent stuff is bizarrely dark. For example, you get a brightly-lit green leaf rendered against a bright blue sky, so the result of any blending between them should be bright, but there's a dim "halo" around the leaves. Silly developer!

Now OK, ~DXT1 isn't all that useful for alpha textures because although the top-level mipmap might only have alpha=0 or alpha=1, as soon as you start to make mipmap levels you need some intermediate levels. So let's try ~DXT3 or ~DXT5. Same problems with both of those. First of all, the bleeding of course, so fine, you write a flood-filler, whatever. The next problem is when you come to compress the texture. Your flood filler has made sure neighbouring texels are something like the solid ones, but they will be different, especially if the flood-filler has had to blend them. But ~DXTn compression does badly as the number of colours increases, and the flood-filler just added texel colours. What's worse, they're not particular significant colours (as mostly they're invisible, it's just the "halo" effect you're trying to prevent). But it's hard to tell most ~DXTn compressors about this, so they see a bunch of texels that have very different colours, and try to satisfy all of them. What happens visually is that you get significantly worse compression around the translucent edges of textures than in the opaque middle, which is very counter-intuitive.

OK Mr. Smarty-pants, what's the answer? Well, premultiplied alpha. All you do is before you compress, you multiply the colour channel by the alpha channel, and during rendering change blend mode from SRCALPHA:INVSRCALPHA to ONE:INVSRCALPHA. That's it - it's not rocket-science.

So what now happens is all the alpha=0 texels are now black. Wait - but you'll get bleeding and halos! No, you don't. Let's do the half-texel example again. Let's say we have an entirely red texture, but with some bits alpha'd, and we render onto a green background. You'd expect to get shades of red, green and yellow - but the darkest yellow should be around (0.5,0.5,0) - no dark halos of something like (0.25,0.25,0), right?. So (1,0,0,1 = solid red) and (1,0,0,0 = transparent red). The second one gets premultiplied before compression to (0,0,0,0). Now we bilinear filter between them and get (0.5, 0, 0, 0.5). And then we render onto bright green (0,1,0).

FB.rgb = texel.rgb + (1-texel.a) * FB.rgb
= (0.5, 0, 0) + (1-0.5) * (0,1,0)
= (0.5, 0, 0) + (0, 0.5, 0)
= (0.5, 0.5, 0)
which is exactly what we were expecting. No dark halos. And since when alpha=0, the colour we WANT to encode is black, ~DXT1 does exactly the right thing. Lucky that, eh? Well no, not really - ~DXT1 is ''meant'' to be used with premultiplied alpha (even though the docs don't say this).

Better still, when you encode the texture into ~DXT3, all the edge texels are black, whatever their neighbours are. So that's just a single colour for the encoder to fit, rather than a bunch of them. In general, this leads to more consistent compression quality for translucent textures.

By the way - ~DirectX has the two "premultiplied alpha" compressed texture formats ~DXT2 and ~DXT4. I'm not sure why they added them - they are rendered identically to ~DXT3 and ~DXT5. The fact that they're different formats is meant to be used as meta-data by the application using them. Thing is - you usually need far far more meta-data than that - is the texture a normal map, does the alpha channel hold translucency or is it a specular map instead, etc. Seems pointless to me. It looks like they're going away in ~DX10 which is good (~DXT2/3 become ~BC2, ~DXT4/5 become ~BC3). In general I'd avoid using ~DXT2 and ~DXT4 - I have seen (early) drivers that did something strange when you used them - I think it was actually trying to do the multiply for me, which is not what I wanted. Stick to ~DXT3 and ~DXT5 for sanity.

''Compositing translucent layers''
Some rendering techniques want to composite two translucent layers together, and then render the resulting image onto a third. The common one I've seen (and used) is for impostors. This is where you take a high-polygon object in the distance and render it to a texture once. Then, as long as the camera doesn't move too much, you can just keep drawing the texture as a billboard, without re-rendering the object itself and burning all that vertex & pixel shader power on a visually small object.

It's rather more complex than that, but the fundamental point is that you're doing a render to a texture of an image, then rendering that to the screen. The problem comes when the object itself has multiple translucent textures that might overlap. So the effect you're looking for is as if you did a standard render - background, then translucent tex1, then translucent tex2. The problem is, what you're actually doing is rendering tex1 then tex2 to an intermediate, then rendering that onto the background. Let's do some maths. Assuming normal alpha-blending, what we want is:

Render tex1:
fb2.rgb = (tex1.rgb * tex1.a) + ((1-tex1.a) * fb1.rgb)
Render tex2:
fb3.rgb = (tex2.rgb * tex2.a) + ((1-tex2.a) * fb2.rgb)
= (tex2.rgb * tex2.a) + ((1-tex2.a) * ((tex1.rgb * tex1.a) + ((1-tex1.a) * fb1.rgb)))

So now we try to emulate this with our impostoring:

Render tex1:
impostor1.rgb = tex1.rgb
impostor1.a = (1-tex1.a)
Render tex2:
impostor2.rgb = (tex2.rgb * tex2.a) + ((1-tex2.a) * impostor1.rgb)
impostor2.a = well ... er ... that's the question - what do I do with the alpha channel? Add? Mutiply?
Render impostor to FB:
fb2.rgb = (imposter2.rgb * imposter2.a) + ((1-imposter2.a) * fb1.rgb)

...and you'll find that whatever blend modes you choose, you basically get a mess.

But if we use premultiplied alpha everywhere, it's a doddle. What we want is:

Render tex1:
fb2.rgb = (tex1.rgb) + ((1-tex1.a) * fb1.rgb)
Render tex2:
fb3.rgb = (tex2.rgb) + ((1-tex2.a) * fb2.rgb)
= (tex2.rgb) + ((1-tex2.a) * ((tex1.rgb) + ((1-tex1.a) * fb1.rgb)))
= (tex2.rgb) + ((1-tex2.a) * (tex1.rgb)) + ((1-tex2.a) * (1-tex1.a) * fb1.rgb))

And we're actually doing:

Render tex1 (impostor starts at (0,0,0,0):
impostor1.rgb = (tex1.rgb) + ((1-tex1.a) * (0,0,0))
impostor1.a = (tex1.a) + ((1-tex1.a) * 0)
Render tex2:
impostor2.rgb = (tex2.rgb) + ((1-tex2.a) * impostor1.rgb)
= (tex2.rgb) + ((1-tex2.a) * tex1.rgb)
impostor2.a = (tex2.a) + ((1-tex2.a) * (impostor1.a))
= (tex2.a) + ((1-tex2.a) * (tex1.a))
Render impostor to FB:
fb2.rgb = (imposter2.rgb) + ((1-imposter2.a) * fb1.rgb)
= (tex2.rgb) + ((1-tex2.a) * tex1.rgb) + ((1-((tex2.a) + ((1-tex2.a) * (tex1.a)))) * fb1.rgb)

That scary-looking (1-((tex2.a) + ((1-tex2.a) * (tex1.a)))) bit looks horrendous but actually reduces to a nice simple ((1-tex1.a) * (1-tex2.a)), so the final result is:
fb2.rgb = (tex2.rgb) + ((1-tex2.a) * tex1.rgb) + ((1-tex1.a) * (1-tex2.a) * fb1.rgb)
...which is exactly what we want - yay! So you can see that premultiplied-alpha-blending is associative, i.e. ((a @ b) @ c) is the same as (a @ (b @ c)) (where @=the alpha-blend operation), which is incredibly useful for all sorts of image compositing operations.

''Multipass lighting techniques''
A common thing in rendering is to render an object multiple times, once per light. This means you can do shadowing with things like stencil volume shadows that can only deal with a single light per pass. The problem comes with translucent textures. You still want them to be shadowed, but the math doesn't work:

What we want is:
fb2.rgb = ((tex.rgb * (light1.rgb + light2.rgb))*tex.a) + ((1-tex.a) * fb1.rgb)

But pass 1:
fb2.rgb = (tex.rgb * light1.rgb * tex.a) + ((1-tex.a) * fb1.rgb)
Pass 2:
fb3.rgb = (tex.rgb * light2.rgb * tex.a) + ((1-tex.a) * fb2.rgb)
= (tex.rgb * light2.rgb * tex.a) + ((1-tex.a) * ((tex.rgb * light1.rgb * tex.a) + ((1-tex.a) * fb1.rgb)))
Oh dear, that's a mess. Let's try using additive blending for the second pass instead. After all, that's what we'd use for opaque objects.
fb3.rgb = (tex.rgb * light2.rgb) + (fb2.rgb)
= (tex.rgb * light2.rgb) + ((tex.rgb * light1.rgb * tex.a) + ((1-tex.a) * fb1.rgb))
= (tex.rgb * (light2.rgb + (light1.rgb * tex.a))) + ((1-tex.a) * fb1.rgb))

...which is closer, but still not right. OK, let's try with premultiplied alpha. What we want is:
fb2.rgb = (tex.rgb * (light1.rgb + light2.rgb)) + ((1-tex.a) * fb1.rgb)

Pass 1:
fb2.rgb = (tex.rgb * light1.rgb) + ((1-tex.a) * fb1.rgb)
Pass 2 uses additive blending, just like opaque objects:
fb3.rgb = (tex.rgb * light2.rgb) + (fb2.rgb)
= (tex.rgb * light2.rgb) + ((tex.rgb * light1.rgb) + ((1-tex.a) * fb1.rgb))
= (tex.rgb * (light1.rgb + light2.rgb)) + ((1-tex.a) * fb1.rgb))

Magic! Again, it actually makes perfect sense if you think about each pass and ask how much light each one lets through, and how much light each adds, and then realise that actually there's three passes - the first one occludes the background by some amount, the second adds light1, the third adds light2. And all we do is combine the first two by using the ONE:INVSRCALPHA blending op.

''Additive and blended alpha in a single operation''
In some places in your game, you want "lerp" materials - standard alpha-blending ones that occlude the background to some extent and add their own colour in to an extent - for example, smoke particle systems are like this. In other places, you just want to do an additive blend without occluding the background - for example a fire particle system. So we have two very common blending modes usually called "add" and "lerp" or somesuch. But what if you want a material with bits of both? What if you want a single particle system that has additive flame particles turning into sooty lerping particles as they age? You can't change renderstate in the middle of a particle system, that's silly. Who can help us now? Why - it's Premultiplied Alpha Man - thank god you're here!

Because of course if you're using premultiplied alpha, what happens if you just set your alpha value to 0 without changing your RGB value? In normal lerp blending, what you get is a fully translucent surface that has no effect on the FB at all. In premultiplied alpha, you... er ... oh, it's additive blending! Blimey, that's a neat trick. So you can have particles change from additive to lerp as they get older - all you do is change the alpha value from 0 and the texture colour from a firey red/yellow colour towards an alpha of 1 and a dark sooty colour. No renderstate changes needed, and it's a nice smooth transition. The same trick works for mesh textures where you want both glows and opaque stuff - one example is a neon sign where the actual tubes themselves and things like the support brackets or whatever are opaque, but the (faked) bloom effect from the tubes is additive - you can do both in a single pass.


So anyway, yeah, premultiplied alpha. Use it, love it, pass it on. Then maybe we'll only take another 20 years til everyone's doing this stuff correctly.
A long time ago I wrote a post on why [[Premultiplied alpha]] was the right way to do blending. It's worth a re-read, so go have a look. The summary is that PMA does two important things:

1. It makes texture filtering and mipmapping work correct even when the textures have an alpha channel, and avoids black/purple edges.

2. It makes alpha-blending associative. That is, ((A blend B) blend C) == (A blend (B blend C)). This is an important properly of many modern engines, especially when working with particle systems, impostor rendering, and some varieties of deferred rendering.

It's arguable that these are actually the same property expressed different ways, but in terms of how they are implemented in the hardware, they're still two very different bits of the chip and worth considering separately. And the fact that they're at "opposite ends" of the pipeline - one in texture filtering, the other in alpha-blending - actually has some interesting properties. In fact there's a bunch of places in the overall gamedev asset-to-pixels production pipeline that you can think of as being either traditional "lerp" blending, or using PMA. And to an extent you can mix and match them and convert from one to the other. There's lots of other styles of blending of course, notably additive blending (and PMA combines both lerp and additive blending in a single operation - see the other post for details), but just put that aside for the moment as it complicates this particular discussion.

Converting from lerp to PMA is pretty simple - you just multiply the RGB channels by the alpha channel. Converting the other way is harder, since you need to divide the RGB channels by the alpha. Obviously, two bad things can happen. The first is that alpha could be zero, in which case the new RGB values are undefined. Which wouldn't be a problem as any actual use of lerp blending then zeros them again - unless you want to filter them with other RGB neighbours, in which case it does matter - and this is exactly the problem with filtering non-PMA data mentioned above. The other problem is that the new RGB values can be bigger than 1.0, which is a problem if you want to store them in a limited-precision buffer. This happens when the data you have is expressing a blend that isn't just a lerp - it was more like an additive operation. In this case it's unsurprising that you have problems expressing it as a lerp! For both those reasons, let's just say that ~PMA->lerp conversion is full of problems, and only consider the lerp->PMA conversion.

So, the areas of the production and rendering pipeline to consider are the following:

1. Texture authoring in a paint package.

2. Export via various processes to the final shipping images on a disk (or card or streamed over the internet or whatever).

3. The textures stored in memory after being loaded off disk at runtime. These will be sampled & filtered by the GPU's texture units.

4. The math inside the shader pipeline.

5. The alpha-blend unit that blends shaded pixels with the rendertarget buffer.

In current pipelines, it turns out to be annoying and difficult to author data directly in PMA format. Paint packages don't really understand it, many artists have a tough time getting their heads around it, and it's just generally an uphill struggle, so in practice phase #1 almost always results in lerp-blended format data.

Many development studios then convert to PMA format almost immediately in their internal toolchain, and everything from stage #2 onwards is PMA. This is a fantastic way to do things, and gives you all the advantages of PMA - good filtering, good compression, a flexible rendering architecture, and everything is consistent and you never get confused about whether you're dealing with lerp or PMA data. This is the recommended way to do things if you can. However, there are some cases where this is not possible.

''When disk images might not be PMA''

The most obvious case is where for whatever reason you don't have a conversion step in the middle. The most common example of this is user-generated content, where users can supply their own textures and import them straight into the engine. Now you need to keep track of which image formats are PMA and which are lerp. Sometimes this can be as simple as a naming or format convention - for example maybe all DDS files are PMA because they've been through the asset-processing pipeline, but all ~TGAs are lerp because they've come straight out of a paint package. The other reason to do this is so that artists can skip a preprocessing step during development and get textures directly from art package to screen as quickly as possible (this is a good thing!). Either way, when you load a file, you may need to convert it to PMA after loading, and still keep the rest of the pipeline (#3 onwards) fully PMA. This is still good - it means most of the pipeline is still all PMA, and only the file loaders need to know about lerp blending. And that's fine because anything dealing with file formats is generally pretty messy anyway.

''When textures might not be PMA''

But there's cases where that won't work either. Sean Barrett raised this problem in a tweet that started this recent discussion - he has a texture with an alpha channel, but he's not using alpha blending, he's using alpha test instead - presumably against a value such as 0.5. In this case, he's after a nice clean "cutout" look with the full texture colour right up to the edge where the alpha-test happens. But he also wants texture filtering to get round curves on that alpha-test edge, rather than a pixellated look. If he does the PMA preprocessing, this will turn the texels off the edge to RGB=0, and that will be filtered inwards, turning his edges black, or at least darker. So this is a reason not to automatically PMA textures - because what the alpha channel represents is not the concept of blending, it's something else. Of course we have this all over the place in modern engines - we pack lots of arbitrary data in the four channels known as RGBA or XYZW that are nothing to do with colours, positions or directions. But in this case it's still slightly surprising because alpha test is a similar concept to blending.

There is another case where this can happen which I came across a while back. The texture was an image with alpha-blending data, but in some cases it would be used with the blending, and others it would be used without. Converting it to PMA would have broken the non-blended case, since all the blended areas would be darkened or black. Note that here the first advantage of PMA format textures - that you don't get weird edges when the texture is filtered - doesn't matter here because the source art has already been specifically authored to make sure those alpha=0 pixels still contain sensible data when alpha-blending is turned off.

One solution to this dual-mode problem would be to create two copies of the texture - one converted to PMA and used for the blending case, and the other left alone and used when not blending (we could even not store the alpha channel at all, since it's not used). But that duplication is a waste of texture memory, and it complicates some asset-handling systems to have two copies of a texture around. The alternative - of storing it just once in lerp format and having a flag that warns me of that - is also complexity, but it's a far simpler thing to deal with. So that's a reason for stage #3 to not be pure PMA - to have some textures in PMA format and others in lerp format.

Even so, we can still keep most of stage #4 and #5 in pure PMA form by changing the shaders a little. After sampling a lerp-form texture, we convert to PMA form simply by multiplying RGB by the alpha value using shader code. Three multiplies is really cheap on even mobile hardware, so this isn't a big deal. Then the rest of the shader math is all done assuming PMA data, and this may include compositing with other textures, and those textures can be PMA or lerp as long as we also convert those appropriately after sampling. There's obviously complication that we need to do the multiplies or not depending on the source texture type, and this can mean a significant growth in the number of shaders, but many engines already have to deal with this combinatorial explosion for other reasons anyway, so it's just one more option to include in that system. Note we've now lost the nice filtering properties of PMA textures, so the folks authoring them need to remember the old rules about edges and mipmaps.

''And the last two stages...''

But maybe there's some funky math in the shader that does require lerp data. I can't really think of an example, but never say never. Even then, it's obvious the RGB multiply could be done after that in the shader, and we still preserve the alpha-blend stage #5 in PMA form, and so still retain the nice associativity property, meaning we can mix and match shaders in a scene and not worry that some are using PMA blending and some are not.

If even adding that final multiply causes problems (maybe because of shader combinations), then we can change the alpha-blend mode to cope. As a reminder, PMA blending performs this operation to all four channels - sometimes called ONE:INVSRCALPHA blending:
{{{
for x=RGBA: Rendertarget(x) := Shader(x)*1 + Rendertarget(x)*(1-Shader(A))
}}}
It is important to note that the specified blending happens on the alpha channel as well, otherwise we lose the associativity property - the framebuffer alpha result is important because it can be used later to blend the framebuffer onto other things with PMA.

But if the shader doesn't want to multiply RGB by alpha before it sends it to the alpha-blend unit to do PMA blending, we could get the alpha-blend unit to do it for us. We just need to change the blending equation to this:
{{{
for x=RGB: Rendertarget(x) := Shader(x)*Shader(A) + Rendertarget(x)*(1-Shader(A))
for x=A:   Rendertarget(x) := Shader(x)*1         + Rendertarget(x)*(1-Shader(A))
}}}
Note the alpha channel didn't change! That's important. So this uses SRCALPHA:INVSRCALPHA blending for the RGB channels, in other words standard lerp blending. But it retains the PMA blending for the alpha channel. To do this uses a little-known feature of graphics cards usually called "separate alpha blending". It may be little-known, but it's been around a long time, and it's required for ~DirectX10 cards, so it's basically ubiquitous on desktop.

~DX9: set ~D3DRS_SEPARATEALPHABLENDENABLE to true and set ~D3DRS_SRCBLENDALPHA / ~D3DRS_DESTBLENDALPHA appropriately.
~DX10: specify the RGB and alpha blends separately in ~D3D10_BLEND_DESC
~DX11: specify the RGB and alpha blends separately in ~D3D11_RENDER_TARGET_BLEND_DESC1
~OpenGL: depending on version, use ~GL_EXT_blend_func_separate or the more modern ~glBlendFuncSeparate()

So there's lots of options for adopting PMA filtering and blending, and ways to adopt it bit by bit to satisfy some of the more traditional parts of the authoring pipeline. There's even a few cases where you actually don't want to force the entire pipeline to be all PMA all the time. But those are the exceptions rather than the defaults, and modern engines really should be using PMA as standard, or at the very least giving you the option. It's just the right thing to do and removes so many annoying problems with translucency. It's now thirty years since Porter & Duff's paper - let's get the job done properly.
It's the bit of circuitry on a video card that reads the framebuffer out of the RAM, puts it through a Digital to Analogue Convertor, and feeds it to your monitor. RAM+DAC you see. In the old days when VGA was a card, not a cable standard, that's pretty much all a video card was - some memory and a thing to output it to the monitor. People still use the term RAMDAC to refer to the chip that outputs DVI, even though it doesn't actually have a DAC (Digital to Analogue Convertor) in it.
Now with RSS feeds - which is handy in case I accidentally post something useful.

I'm a total newb on interweb tech, but RSS is fascinating - there appears to be at least five different versions, apart from the one called ATOM which isn't a different version, it's a completely different thing altogether (...and cue Airplane gag). "Open Standards" are such an oxymoron. Precisely because they're open, they're not standard. I mean, if they were standard, you couldn't change them at will, so they wouldn't be open. So the only truly Open Standard is one you just invented that nobody knows about yet.

If an Open Standard is RFC'd in a forest, but there's no slashdot post, does anyone file a submarine patent on it?
Hmmm... not that impressed by the automagic RSS feed generator on these TiddlyWikis. There's probably lots of cunning ways to tweak it with embedded commands, but I'd have to RTFM. The problem is that every time I touch a page it gets put to the top, even if it's just minor tweaks (e.g. adding tags or corrcting typos), or adding tiddlers that are not directly part of the blog. Also, the "description" field isn't much of a description - it's the entire text of the entry. I'm sure that's not the intent of that field. So I might try doing it by hand for a bit - just add the blog bits myself. The format isn't exactly rocket-science now I have a template.

Tell you what, I'll set up both - the really verbose automatic one and a hand-done one that will just have the blog headlines. Then you can choose which you like best. By the way, the RSS feed for this page is the same as the web address but with .xml instead of .html. Firefox does that automagically and gives you a little orange icon to click, but it looks like a lot of other things don't, so I've added a manual link at the left hand side.

If you're still having problems with the RSS feed, yell. I'm not a big RSSer myself, so I don't really know what standard practice is for that stuff, and I'd like to tweak it now, get it right and then forget about it :-)
In other news - I think RSS is a good idea waiting for a decent implementation. There's two ways to do this - I can make the manually-edited feed have the complete text with the Wiki stuff edited out, but then you can't use the Wiki links, which is kinda the whole point of using a Wiki. Or I can continue to make it just have like a single-line description of the post, and any decent RSS reader should include the link to the full post you can click on if the one-line teaser intrigues you.

So, both of you out there using RSS, bet now! Bet bet bet bet bet! It's even easier now I added an email address to the menu on the left :-)
http://www.radgametools.com/  Where I used to work before joining Intel. Rad products include [[Bink|http://en.wikipedia.org/wiki/Bink_video]], Miles, Pixomatic and the thing I worked on for two years - [[Granny3D]].
Deano Calver has a neat article comparing ray-tracing and rasterisation. http://www.beyond3d.com/content/articles/94

I basically agree with it - ray-tracing has a limited number of things it's good at, and a large number of things it's very bad at, and a bunch of things people think it's good at that are not raytracing at all and can be applied to either system. The arguments I always have with people are about scalability - how well do the two methods scale as your amount of content rises?

To put it in a very simplistic way, raytracing takes each pixel and then traverses the database to see what object hit that pixel. Whereas rasterisation takes the database and traverses the screen to see what pixels each object hit. In theory they sound the same, but with the inner and outer loops swapped (pixels vs objects) - should be the same, right?

The problem is that rasterisation has some very simple ways of not checking every single pixel for each object, e.g. bounding boxes of objects and triangles, and rasterisation algorithms that know what shape triangles are. Raytracing also has a bunch of ways of not checking every object for each pixel, but those methods are not as simple for the same win, and can take time to reconstruct when the scene changes.

The other fundamental problem is the thing in the inner loop - changing from one pixel to the next is pretty easy for a rasteriser - pixels are pretty simple things, laid out in a nice regular grid. The rasteriser can keep around a lot of information from pixel to pixel. The raytracer on the other hand is stepping through objects in the scene - they're not inherently nicely ordered, and the state of object #1 says nothing about the state of object #2. So logically you'd want objects in your outer loop and pixels in your inner loop.

That's an absurdly simplistic way of looking at things of course, but it's a handy O(n)-notation way of thinking about it. You can then start digging and find lots of other reasons why one is slightly more efficient than the rest (after all, bubblesorts are quicker than mergesorts in small lists in architectures with high branch penalties), but you still have to work very hard to overcome the ~BigO problem. And of course the real world shows a lot of harsh realities.

We'll put aside the massive dominance of hardware rasterisers, because it is somewhat of a local-minimisation problem, in that what the hardware companies know how to build is rasterisers, and they don't know how to build raytracers, so that's what they build, and therefore that's what people program for. But in the offline rendering world, where the hardware is the same - general-purpose ~CPUs - the dominant mechanism is still the rasteriser, in the form of the REYES system. It's not your standard rasteriser of course - there's a long way between what it does and what graphics cards currently do - but nevertheless the fundamental principle holds - for each object, they see what pixels it hits. Not the other way round.

Raytracing has its places - it's just not The Future. But then it's possible neither is rasterising - currently neither has a great solution to the lighting problem.
Ignacio Castaño recently looked into some explicit ordering methods for regular grids. [[http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/|http://castano.ludicon.com/blog/2009/02/02/optimal-grid-rendering/]]. It was kinda neat to see that [[my algorithm|http://www.eelpi.gotdns.org/papers/fast_vert_cache_opt.html]] does really well considering it's cache-size-blind. I'm not quite sure why the weakness at size=12, but when writing it I did observe that on large regular grids it would get a bit degenerate and start to do long double-wide strips. Real game meshes have sections of regular grid (e.g. 10x10 chunks), but those are small enough that the algorithm quickly hits discontinuities that kick it out of the double-wide mode and into the more Hilbert-like mode.

I have thought about removing what I think is the algorithm's biggest weakness - the first few triangles, when it doesn't have any data in the cache and it's basically making random choices. So my theory was - run the algorithm until half the mesh is done. Then throw the first 100 tris back in the "available" pool and run until the end.

The fun thing is that when I originally wrote the algorithm, it just needed to be ~OKish, it didn't need to be good. The real priority was it had to be fast and space-efficient, because it was running as part of the Granny3D export pipeline from Max/Maya, and any sort of half-decent ordering was better than the essentially random ordering we were getting. It was a nice accident that it turned out to be really good with only an extra day's work.
Someone just pointed out Kurt Pelzer's article in Game Programming Gems 4 called "Combined Depth and ID-Based Shadow Buffers". I don't have the book (which is stupid of me), so I can't check, but there's really no magic to the algorithm except the realisation that you //can// combine the two. After that it's all fairly obvious, so we probably have much the same technique. So I just reinvented somebody else's wheel, which is stupid of me - sorry about that Kurt.
Renderstate changes cost valuable clock cycles on both the CPU and GPU. So it's a good idea to sort your rendering order by least-cost. But you need to be careful you know what "least-cost" means - it's not just the number of ~SetRenderState calls you send down!

(note - I'm going to suggest some numbers in this section - all numbers are purely hypothetical - they are realistic, but I have deliberately avoided making them the same as any hardware I know about - they're just for illustration)

Typically, graphics hardware is driven by a bunch of internal bits that control the fixed-function units and the flow of the overall pipeline. The mapping between those bits and the renderstates exposed by the graphics API is far from obvious. For example, the various alpha-blend states look simple in the API, but they have large and wide-ranging implications for the hardware pipeline - depending on what you set, various types of early, hierarchical and late Z will be enabled or disabled. In general, it is difficult to predict how some render states will affect the pipeline - I have written graphics drivers, I have an excellent knowledge of graphics hardware, and I'm responsible for some right now, and I //still// have a hard time predicting what a change in a single renderstate will do to the underlying hardware.

To add to the complexity, a lot of modern hardware can have a certain limited number of rendering pipeline states (sometimes called "contexts", in a slightly confusing manner) in flight. For further complexity, this number changes at different points in the pipeline. For example, some hardware may be able to have 2 pixel shaders, 2 vertex shaders, 4 sets of vertex shader constants in flight at once, and 8 different sets of textures (again, numbers purely for illustration). If the driver submits one more than this number at the same time, the pipeline will stall. Note the differing numbers for each category!

In general, there are some pipeline changes which are cheaper than others. These are "value" changes rather than "functional" changes. For example, changing the alpha-blend state is, as mentioned, a functional change. It enables or disables different things - the ordering of operations in the pipeline changes. This can be expensive. But changing e.g. the fog colour is not a functional change, it's a value change. Fog is still on (or off), but there's a register that changes its value from red to blue - that's not a functional change, so it is usually fairly cheap. Be aware that there are some that look like "value" changes that can be both. For example, Z-bias looks like just a single value that changes from - should be cheap, right? But if you change that value to 0, then the Z-bias circuitry can be switched off entirely, and maybe the pipeline can be reconfigured to go faster. So although this looks like a value change, it can be a functional change as well.

With that in mind, here's a very general guide to state change costs for the hardware, ordered least to most:

* Changing vertex, index or constant buffers (~DX10) for another one ''of the same format''. Here, you are simply changing the pointer to where the data starts. Because the description of the contents hasn't changed, it's usually not a large pipeline change.

* Changing a texture for another texture identical in every way (format, size, number of mipmaps, etc) except the data held. Again, you're just changing the pointer to the start of the data, not changing anything in the pipeline. Note that the "same format" requirement is important - in some hardware, the shader hardware has to do float32 filtering, whereas fixed-function (i.e. fast) hardware can do float16 and smaller. Obviously changing from one to the other requires new shaders to be uploaded!

* Changing constants, e.g. shader constants, fog color, transform matrices, etc. Everything that is a *value* rather than an enum or bool that actually changes the pipeline. Note that in ~DX10 hardware, the shader constants are cheap to change. But in ~DX9 hardware, they can be quite expensive. A "phase change" happened there, with shader constants being moved out of the core and into general memory.

* Everything else - shaders, Z states, blend states, etc. These tend to cause widescale disruption to the rendering pipeline - units get enabled, disabled, fast paths turned on and off. Big changes. I'm not aware of that much cost difference between all these changes.

(note that changing render target is even more expensive than all of the above - it's not really a "state change" - it's a major disruption to the pipeline - first and foremost, order by render target)

OK, so that's the ''hardware'' costs. what about the ''software'' costs?

Well, there's certainly the cost of actually making the renderstate change call. But that's usually pretty small - it's a function call and storing a DWORD in an array. Do not try to optimise for the least number of ~SetRenderState calls - you'll spend more CPU power doing the sorting than save by minimising the number of calls. Total false economy.

Also note that the actual work of state changes doesn't happen when you do the ~SetRenderState. It happens the next time you do a draw call, when the driver looks at the whole state vector and decides how the hardware has to change accordingly - this is caled "lazy state changes", and every driver does it these days. Think of it as a mini-compile stage - the driver looks at the API specification of the pipeline and "compiles" it to the hardware description. In some cases this is literally a compile - the shaders have to be changed in some way to accommodate the state changes. This mini-compile is obviously expensive, but the best drivers cache states from previous frames so that it doesn't do the full thinking every time. Of course, it still has to actually send the new pipeline to the card, and stall when the card has too many contexts in flight (see above). So what tends to happen is that on the CPU side, there's states that cause a new pipeline (tend to be higher on the list above), and states that don't (tend to be lower). But there's a much lower difference in cost between top and bottom.

(note that although the SuperSecretProject is a software renderer, you still have a certain cost to state changes, and those costs are reflected in the above data. It's not exactly a coincidence that a software renderer and hardware renderers have similar profiles - they have to do the same sort of things. Our consolation is that we don't have any hard limits, so in general the cost of changing renderstates is much lower than hardware).

Of course it's also a good idea to sort front-to-back (for opaque objects), so that has to be reconciled with sorting by state. The way I do it is to sort renderstates hierarchically - so there's four levels according to the above four items. Generally I assign a byte of hash per list item, and I concatenate the bytes to form a uint32 (it's a lucky accident there's four items - I didn't plan it that way). Also note this hash is computed at mesh creation time - I don't go around making hashes at runtime.

So now I have a bunch of hashed renderstates, what order do I put them in? What I generally do is assume the first two types of state change - changing constant buffers and texture pointers - are much cheaper than the others (also, you don't tend to change shader without changing shader constants and textures anyway, so it's a reasonable approximation). So I make some "sub-buckets" of the objects that share the same state in the last two items. In each sub-bucket I pick the closest object to the camera. Then I sort the sub-buckets according to that distance (closest first), and draw them in that order. Within each bucket, I draw all the objects that have each renderstate in whatever order they happen to be - all I care about is getting all the objects for one state together. This gets me a fair bit of early-Z rejection without spamming the hardware with too many expensive changes.

In theory by sorting within each sub-bucket you could get slightly better driver/hardware efficiency, but I suspect the effort of sorting will be more expensive than the savings you get. I've not measured this, so this is just my experience talking. By all means if you have the time, test these hunches and let me know if my intuition is wide of the mark. It wouldn't be the first time.

Note that obviously I do the "first nearest object" as I'm putting the objects into the sub-bins, not as a later sorting stage, and there's a bunch of other obvious implementation efficiencies. And I use a bucket-sort for Z rather than trying for high precision. So although it sounds complex (it's surprisingly difficult to explain in text!), the implementation is actually fairly simple and the runtime cost is very low.
STL is fine if you like that sort of thing, and I get what it's trying to do, but the actual implementation drives me up the wall because it's so unbelievably hard to debug anything about it - compile errors, runtime errors, performance. I do still use STL when it's doing something complex that I don't want to write - maps and hash tables are pretty handy - but when all I want is a frikkin auto-resizing array, and I want to use it in perf-sensitive code, using vanilla STL drives me nuts.

So over the years I - like every other decent coder on this planet - have reinvented my own particular brand of wheel. I'm not going to claim this is the best wheel ever. It's not even the best wheel I could make. But it is where I have ended up after a decade or so of "I really should take the time to fix this" fighting against the "I really should just frikkin shut up and use it". Consider it a point on the continuum, and use or abuse it as you will.

At the very least, if you don't already have one of these, for Turing's sake use this and stop hurting yourself. STL vector<> is hideously over-specced for such a simple thing.

Most of it is boringly obvious, included only for completeness. The things I particularly like are:

1. Explicit ordered and unordered functions for item removal. In an ideal world I'd specialize and make it so the ordered/unordered nature of the list was in its declaration, not just in the functions you use on it, but then I'd have twice as much code to maintain, so no thanks (and before you ask, inheritance and/or nested templates would probably solve that, but at the cost of my sanity).

2. The debug versions of the foreach() loops that do their very best to catch common bugs like adding or removing values inside the loop. Most iterators I've seen just happily work most of the time, and every now and then a resize will happen and they'll point to free'd memory and punch you in the face well after the add/remove operation, and you have no idea why. Since this is an unfortunately common bug I seem to code up all the time, I really like this addition, because it helps me not punch myself in the face.

3. Puns. Everything is better with puns.

Here's the code - steal as appropriate:

{{{
// An arbitrary-sized list template class.
// Holds ordered data, but some functions also have unordered versions for speed.
template <class T>
class ArbitraryList
{
	// Careful with changing these - needs to agree with ArbitraryListGrannyTypeDef
	int		iReservedSize;		// The current reserved size of the list.
	int		iSize;			// The current size of the list.
	T		*pT;				// The list.

public:

	// Optional initial size setting.
	void Init ( int iInitialSize = 0, int iInitialReservedSize = 0 )
	{
		pT = NULL;
		iSize = 0;
		iReservedSize = 0;
		if ( iInitialReservedSize > iInitialSize )
		{
			ReserveTo ( iInitialReservedSize, true );
			SizeTo ( iInitialSize, false );
		}
		else if ( iInitialSize > 0 )
		{
			SizeTo ( iInitialSize, true, true );
		}
	}

	void Fini ()
	{
		if ( pT == NULL )
		{
			ASSERT ( iReservedSize == 0 );
			ASSERT ( iSize == 0 );
		}
		else
		{
			ASSERT ( iReservedSize > 0 );
			ASSERT ( iSize >= 0 );
			delete[] pT;
			iReservedSize = 0;
			iSize = 0;
			pT = NULL;
		}
	}


	// Constructor, with optional initial size setting.
	ArbitraryList ( int iInitialSize = 0 )
	{
		Init ( iInitialSize );
	}

	// Destructor.
	~ArbitraryList ( void )
	{
		Fini();
	}

	// Returns the pointer to the given item.
	T *Item ( const int iItem )
	{
		ASSERT ( iItem < iSize );
		return ( &pT[iItem] );
	}

	const T *ConstItem ( const int iItem ) const
	{
		ASSERT ( this != NULL );
		ASSERT ( iItem < iSize );
		return ( &pT[iItem] );
	}

	// Or use the [] operator.
	T &operator [] ( const int iItem )
	{
		return *Item ( iItem );
	}

	const T &operator [] ( const int iItem ) const 
	{
		return *ConstItem ( iItem );
	}

	// Returns the pointer to the first item.
	T *Ptr ( void )
	{
		return ( pT );
	}

	// Returns the size of the list
	int Size ( void ) const
	{
		return iSize;
	}

	// Returns the pointer to the last item plus one - same sense as STL end()
	T *PtrEnd ( void )
	{
		return ( pT + iSize );
	}

	// Grows or shrinks the list to this number of items.
	// Preserves existing items.
	// Items that fall off the end of a shrink may vanish.
	// Returns the pointer to the first item.
	// Set bBinAtZero to false if you don't want the memory to be freed,
	// even though the size is 0. This speeds things up for a list that is
	// frequently used, even though it's zeroed between uses.
	// Set bAllocExactly to true if you want iReservedSize to be exactly the same as iSize.
	T *SizeTo ( int iNum, bool bBinAtZero = true, bool bAllocExactly = false )
	{
		ASSERT ( iNum >= 0 );
		int iOldSize = iSize;
		iSize = iNum;

		if ( iNum == iReservedSize )
		{
			// Already have exactly the right space - all is well.
			if ( iReservedSize == 0 )
			{
				ASSERT ( pT == NULL );
			}
			else
			{
				ASSERT ( pT != NULL );
			}
			return pT;
		}
		else if ( iNum < iReservedSize )
		{
			// We have enough space.
			if ( ( iNum == 0 ) && bBinAtZero )
			{
				// Shrunk to 0 - bin the memory.
				delete[] pT;
				pT = NULL;
				iReservedSize = 0;
				return NULL;
			}
			if ( !bAllocExactly || ( iNum == iReservedSize ) )
			{
				// ...and we don't need to resize down.
				return pT;
			}
		}

		// We got here, so we need to resize the array in some way, either up or down.
		ASSERT ( iNum > 0 );
		int iNewSize = iNum;
		if ( !bAllocExactly )
		{
			// Usually grow by 50% more than needed to avoid constant regrows.
			iNewSize = ( iNum * 3 ) >> 1;
			if ( iNewSize < 8 )
			{
				iNewSize = 8;
			}
			ASSERT ( iNewSize > iReservedSize );
		}
		if ( pT == NULL )
		{
			ASSERT ( iReservedSize == 0 );
			pT = new T [iNewSize];
		}
		else
		{
			ASSERT ( iReservedSize > 0 );
			T *pOldT = pT;
			pT = new T[iNewSize];
			int iCopySize = iOldSize;
			if ( iCopySize > iNewSize )
			{
				iCopySize = iNewSize;
			}
			for ( int i = 0; i < iCopySize; i++ )
			{
				pT[i] = pOldT[i];
			}
			delete[] pOldT;
		}
		ASSERT ( pT != NULL );
		iReservedSize = iNewSize;
		return ( pT );
	}

	// Preallocates to at least this big.
	void ReserveTo ( int iNum, bool bAllocExactly = false )
	{
		ASSERT ( iNum >= iSize );
		if ( iNum <= iReservedSize )
		{
			return;
		}
		int iOldSize = iSize;
		SizeTo ( iNum, false, bAllocExactly );
		iSize = iOldSize;
	}

	// Frees all the array memory. Just a fancy way of saying SizeTo(0).
	void FreeMem ( void )
	{
		SizeTo ( 0, true );
	}

	// Removes the given item number by copying the last item
	// to that position and shrinking the list.
	// Looking for the old RemoveItem() function? This is the new name.
	void RemoveItemUnord ( int iItemNumber )
	{
		ASSERT ( iItemNumber >= 0 );
		ASSERT ( iItemNumber < iSize );
		pT[iItemNumber] = pT[iSize-1];
		SizeTo ( iSize - 1 );
	}

	// Removes the given item number by shuffling the other items up
	// and shrinking the list.
	void RemoveItemOrd ( int iItemNumber )
	{
		ASSERT ( iItemNumber >= 0 );
		ASSERT ( iItemNumber < iSize );

		const int iStopPoint = iSize-1;
		for ( int i = iItemNumber; i < iStopPoint; i++ )
		{
			pT[i] = pT[i+1];
		}
		SizeTo ( iSize - 1 );
	}

	// Adds one item to the list and returns a pointer to that new item.
	T *AddItem ( void )
	{
		SizeTo ( iSize + 1 );
		return ( &pT[iSize-1] );
	}

	// Adds the supplied item to the list.
	void AddItem ( T const &t )
	{
		*(AddItem()) = t;
	}

	// Inserts an item in the specified place in the list, shuffles everything below it down one, and returns a pointer to the item.
	T *InsertItem ( int Position )
	{
		ASSERT ( ( Position >= 0 ) && ( Position <= iSize ) );
		SizeTo ( iSize + 1 );
		for ( int i = iSize - 1; i >= Position + 1; i-- )
		{
			pT[i] = pT[i-1];
		}
		return ( &pT[Position] );
	}

	// Inserts an item in the specified place in the list, shuffles everything below it down one.
	void InsertItem ( int Position, T &t )
	{
		*(InsertItem ( Position ) ) = t;
	}

	// Copy the specified data into the list.
	void CopyFrom ( int iFirstItem, const T *p, int iNumItems )
	{
		if ( iSize < ( iFirstItem + iNumItems ) )
		{
			SizeTo ( iFirstItem + iNumItems );
		}
		for ( int i = 0; i < iNumItems; i++ )
		{
			*(Item ( i + iFirstItem ) ) = p[i];
		}
	}

	// A copy from another arbitrary list of the same type.
	void CopyFrom ( int iFirstItem, const ArbitraryList<T> &other, int iFirstOtherItem = 0, int iNumItems = -1 )
	{
		if ( iNumItems == -1 )
		{
			iNumItems = other.Size() - iFirstOtherItem;
		}
		if ( iSize < ( iFirstItem + iNumItems ) )
		{
			SizeTo ( iFirstItem + iNumItems );
		}
		ASSERT ( other.Size() >= ( iFirstOtherItem + iNumItems ) );
		for ( int i = 0; i < iNumItems; i++ )
		{
			*(Item ( i + iFirstItem ) ) = *(other.ConstItem ( i + iFirstOtherItem ) );
		}
	}

	// A copy from another list, but it always adds new items to the end of the current list.
	void AddFrom ( ArbitraryList<T> &other, int iFirstOtherItem = 0, int iNumItems = -1 )
	{
		if ( iNumItems == -1 )
		{
			iNumItems = other.Size() - iFirstOtherItem;
		}
		int iFirstItem = iSize;
		SizeTo ( iFirstItem + iNumItems );
		CopyFrom ( iFirstItem, other, iFirstOtherItem, iNumItems );
	}

	// A simple find. Returns the list position, or -1 if not found.
	int FindItem ( const T &t )
	{
		for ( int i = 0; i < iSize; i++ )
		{
			if ( pT[i] == t )
			{
				return i;
			}
		}
		return -1;
	}

	// Reallocates the memory so that the capacity is exactly the same as the size.
	// Useful for a list that has been constructed, but will now remain the same size for a long time.
	void ShrinkToFit()
	{
		SizeTo ( iSize, true, true );
	}

	// Copy constructor.
	ArbitraryList ( const ArbitraryList<T> &other )
	{
		int iNumItems = other.Size();

		pT = NULL;
		iSize = 0;
		iReservedSize = 0;
		SizeTo ( iNumItems );
		for ( int i = 0; i < iNumItems; i++ )
		{
			*(Item(i) ) = *(other.Item(i) );
		}
	}
};


// See, it's foreach, except it's my "special" version, so I called it... I'll get me coat.

#ifndef DEBUG

#define forsytheach(T,it,list) for ( T *it = list.Ptr(), *last = list.PtrEnd(); it != last; ++it )
// Same, but easier syntax for lists-of-values that you don't mind copying.
#define forsytheachval(T,ref,list) for ( T *it = list.Ptr(), *last = list.PtrEnd(), ref = ((it!=last)?*it:T()); it != last; ref = *(++it) )
// Same, but easier syntax for lists-of-pointers-to-objects
#define forsytheachptr(T,ref,list) for ( T **it = list.Ptr(), **last = list.PtrEnd(), *ref = ((it!=last)?*it:NULL); it != last; ref = *(++it) )

#else //#ifndef DEBUG

// Debug versions that try to self-check that you didn't modify the list in the middle of the loop.
// If you do, that can cause a reallocate, and then everything fails horribly.
// In those cases, you need to do a loop of NumItems from 0 to thing.Size() and inside the loop
// explicitly do Thing *pThing = thing[ThingNum]

#define forsytheach(T,it,list) for ( T *it = list.Ptr(), *first = list.Ptr(), *last = list.PtrEnd(); internal_functional_assert(first==list.Ptr()), internal_functional_assert(last==list.PtrEnd()), it != last; ++it )
// Same, but easier syntax for lists-of-values that you don't mind copying.
#define forsytheachval(T,ref,list) for ( T *it = list.Ptr(), *first = list.Ptr(), *last = list.PtrEnd(), ref = ((it!=last)?*it:T()); internal_functional_assert(first==list.Ptr()), internal_functional_assert(last==list.PtrEnd()), it != last; ref = *(++it) )
// Same, but easier syntax for lists-of-pointers-to-objects
#define forsytheachptr(T,ref,list) for ( T **it = list.Ptr(), **first = list.Ptr(), **last = list.PtrEnd(), *ref = ((it!=last)?*it:NULL); internal_functional_assert(first==list.Ptr()), internal_functional_assert(last==list.PtrEnd()), it != last; ref = *(++it) )

#endif
}}}
The range of names is mind-boggling.
SSE
~SSE2
~SSE3
~SSSE3
~SSE4
~SSE4.1 (maybe it's like ~SSE4 with an extra low-frequency channel?)
~SSE4.2
...and now ~SSE5

I'll be so glad when this instruction set is replaced by something more sensible. Although ~SSE5 does finally add ternary instructions and multiply-add.
I finally finished a fairly tedious but necessary part of [[Dwarf Minecart Tycoon]] - save and load. There's two parts to this - how, and why.

''Why?''

Well duh - so I can load and save games! No - it's not that simple of course. The real question is why do it now, when there's barely any game there? Why not wait until the end of the project when everything's more stable? The answer is that the ability to snapshot and restore the game gives you a bunch of really useful abilities that will make life simpler while developing the game. Of course you have the extra hassle of keeping the save/load system up to date as you change it, but I still think these are worth the effort.

1. Easier testing. If I need to test the pathfinding abilities of a dwarf through a complex sequence of bridges and doors, I need that complexity in a map. If I have to create that map from scratch each time, I'm going to skimp on it and just hope the algo works. And as we all know, if you didn't test it, it doesn't work :-) Whereas with save/load I can just create the map once and reload each time as I'm testing or debugging.

2. Stress-testing. Once I have the dwarves doing good pathfinding in a variety of complex environments, I can periodically check that they still work as expected in these environments, and that a change I made recently to fix one bug didn't just break lots of other things. Indeed, I can do a complete rewrite of the system if I need to, but if it still behaves well in the fifteen different environments I have as savegames, then I can be reasonably sure I didn't miss any major features during the rewrite. Taken to the extreme, this is "test-driven development" - the idea that the only criterion for "good code" is "does it pass the tests". And if it passes the tests but still doesn't work in some other situation well then - you didn't have enough tests :-) I'm not quite that radical, but it's an interesting principle to bear in mind, and the principle does scale well to large teams and complex codebases.

3.  Easy debugging. This is the big one for me. For complex games like this, bugs can hide and only be apparent many minutes after they actually happened. For example, the bug will manifest as two different dwarves walking to mine out the same section of rock, when in theory you should only allow one to do it. The bug happened when they both decided to walk over there (there should be some sort of marker or arbitration to make sure the second one gets on with some other task), but it only actually manifests when the second one gets there and finds no rock to dig! Now you're sitting in the debugger, you know what happened, but the actual bug happened tens of seconds and hundreds of game turns earlier, so good luck doing the forensics on that.

So the way we solved similar problems on [[StarTopia]] will work fine here as well. Every 15 seconds or so, you auto-save the game. Use a rotation of filenames so you have the last 10 or so saves around. Then, if you get a gnarly bug like this, you can always find an earlier savegame before the bug happens, then watch the respective dwarves as they make the incorrect decisions. The other key component to this is of course repeatability - if you load an earlier savegame and it doesn't happen because the dwarf rolled his RNG differently and goes to a different bit of rock, you'll never find the bug. So it's important to make sure the game is deterministic and repeatable. I've got another blog post on the boil about [[Deterministic gameplay]]. The only thing that isn't deterministic right now is the player - although the dwarves will follow existing queued commands and directives, I'm not recording and playing back the new commands that I issue. That's not too difficult, and I might do it in the future, but because in this style of game 99% of things (and therefore bugs) happen autonomously, it should still catch most of the bugs.

''How?''

The obvious way to do save and load is to write a routine that traverses through your whole world writing every object into some data structure on disk. Doing this yourself is a pain in the arse. You need to figure out what to turn pointers into - in the file they need to be an object index, or an offset from the start of the file or something like that. You also need to at the very least version-number everything, because you'll almost certainly want to change your data structures as you develop the game, but still keep the ability to load older versions of save files. Keeping that code to load old files around is really tedious and it gets grubby pretty fast.

But I have a cunning plan! Since I used to be Mr Granny3D, I happen to know that Granny has a bunch of routines for doing 90% of the boring work, and it does it really well. It's so good that Granny can still load meshes, anim, etc files created from the very first release of Granny version 2, back in 2002. It does versioning by matching field names. If fields move that's no problem. If new fields are added it initializes them to defaults (zero or NULL), and if fields vanish it ignores them. It does all the pointer fixup for you - you can save directly from in-memory structures, and structures it loads are all ready to use. It handles 32<->64 bit transitions, and also endian changes. The good thing is the routines are designed to work for non-Granny data formats - you can define your own and use them for any data you like. Yeah, I know I sound like an advert, but it is very useful.

Anyway, I added granny_data_type_definition thingies to all my structures, and now Granny reads and writes them. But it's not quite that simple - still a few things to do. Granny will store out the canonical definitions of the base types, e.g. I have a train type called "Fast Red Train" which has stats for power, acceleration, sprite use, etc. And all of the trains of that type reference it. Granny will happily save that structure out, but when I load it back in, I don't want to create a new base type - they're meant to be hard-coded into the game. So I need to go through the train types, see which ones are of that type and map them to whatever the current game has hard-coded as the "Fast Red Train", even if it has slightly different stats. Of course if that name doesn't exist any more I can choose to map it some other way - closest similar stats or somesuch.

Probably the biggest hassle was the enums. I have a lot of enums in the game - peep types, object types, train types, order types, block types, etc etc etc. I'll certainly want to add to the list, and sometimes change the list, and maybe even rearrange the ordering. In the save file they're just stored as integers, so how do I deal with this? Well, I have a neat hack from [[Charles Bloom|http://cbloomrants.blogspot.com/]] and others to create both the enum and the list of enum strings at the same time. This is normally intended for printing the names out:

{{{
#define TrainState_XXYYTable           \
	XX(TS_Stuck) YY(=0),           \
	XX(TS_Going),                  \
	XX(TS_Waiting),                \
	XX(TS_Stopped),                \
	XX(TS_LAST)

#define XX(x) x
#define YY(y) y
enum TrainState
{
    TrainState_XXYYTable
};
#undef XX
#undef YY

#define XX(x) #x
#define YY(y)
const char *TrainState_Name[] =
{
    TrainState_XXYYTable
};
#undef XX
#undef YY
}}}

...and then you can do things like

{{{
printf ( "Movement type %s", TrainState_Name[Train->Movement] );
}}}

(there's more about enums in the post [[Wrangling enums]])

But you can also make part of the savegame state point to these arrays of strings and have Granny save them out as an array of strings. So now when you load the savegame structure in, Granny will have a table of the strings loaded. You can then compare the two tables with strcmp(). If they match, fine. If they don't you can construct an old->new mapping table and remap them. Obviously if an old enum is removed you need to write some code to deal with that, but that's not very common, and probably the cleanest thing to do is write the conversion code, convert all your old save games (by loading them and then saving them again) and then delete that code.

The biggest annoyance with Granny's loading routines is actually a positive feature of them in normal circumstances. When Granny loads a file, it does so using just one big allocation. That's a good thing for reducing fragmentation if the file is a mesh or an animation or something like that that is not usually manipulated by the game once loaded. The problem is in my case it's a world, and the parts of that world need to be mutable and separately allocated, so they can be separately freed and so on. I can't free memory in the middle of that big allocation - that'll cause nasty problems. The solution is I have a ~CopyWorld routine (and a ~CreateWorld and ~DestroyWorld). This does what it says on the tin - it copies the world. So the loading process is:

* Load Granny world file.
* Get Granny to do version converts if needed.
* Check if the enum string arrays changed and convert the in-object enums if so.
* Remap base types (e.g. point the loaded "Fast Red Train" to the global "Fast Red Train")
* ~CopyWorld from the loaded Granny one to create a new one made of lots of allocations.
* ~DestroyWorld on the currently running world.
* Free the world in the single block of memory Granny loaded.
* Switch to the world we just loaded.

Saving is the opposite:

* Copy the current world to a new one.
* Clear any cached data (no point in saving that and bloating the file).
* Point the "savegame" state at the enum string arays so they get saved.
* Get Granny to write the file out.
* ~DestroyWorld on the copied version.

Wait... why do I need to copy the world first? Two reasons, both kinda minor. First - I want to nuke the cached data so Granny doesn't save that, and do some pointer fixup that maybe the real game won't appreciate. Second - after the copy, I could fork the savegame stuff into a separate thread. So if the Granny conversions or the disk access do cause thread stalls, they won't have much on an effect on gameplay speed, and that's actually kinda cool - it means you can have autosaves all the time without a nasty stutter as it does so. So yeah, not that important, but once you have a ~CopyWorld routine, you just want to use the sucker everywhere. OK, I'll admit it - it's actually ~DestroyWorld I like calling the most. "You may fire when ready."

''Are you sure this works?''

Oh yes. I'm well aware that one of the problems with normal save/load code is it doesn't get called much, so it doesn't get tested much. To fix this, I have a debug mode that every single game turn calls ~CopyWorld twice, calls ~StepGameTime() on the original and the first copy, compares the two to make sure they're the same, and then calls ~DestroyWorld on the two copies. Why make the second copy? Well, so if the comparison fails, I can step the second copy and watch what changes. Then if the copies agree, I screwed up the ~CopyWorld routine. And if all three disagree, then I probably have a dependency on a global somewhere (e.g. a RNG I forgot to move into the world, or dependency on traversal order or somesuch). It's not fast - the game runs at about quarter speed - but it does function well enough to do it every now and then for a few minutes.

Then to test save/load I do the obvious thing - I save the first copy, destroy the first copy, step the original world, load the first copy, step it, and compare it. Now if there are mismatches, I screwed up the save/load routines. I can run it so this save/destroy/load/compare happens every frame. It's not exactly quick (though not as slow as you'd think - about a timestep every half second in debug mode - Windows file caching seems to work pretty well!), but I have left games running for days with my little dwarves and trains running around with queued orders building, delivering, etc. It seems pretty robust.

''Conclusion''

It was a lot of rather tedious work, but actually not as horrible or hard to debug as it could have been. And using Granny's routines should make it moderately easy to maintain when I change the structures. And I'm glad I now have the easier debugging that autosaves and deterministic gameplay give you, because I've already had evil bugs like that and had to fix them through just thinking hard. I hate thinking hard, because you never really trust the result - maybe you fixed it, maybe you just changed the initial conditions enough to avoid it this time. I'd rather watch it actually happen in a debugger so I know exactly what to fix, then replay it and watch it not happen the second time.
A [[Scene Graph|http://en.wikipedia.org/wiki/Scene_graph]] is essentially a method of rendering where you place your entire world into this big graph or tree with all the information needed to render it, and then point a renderer at it. The renderer traverses the graph and draws the items. Often, the edges between the nodes attempt to describe some sort of minimal state changes. The idea is fairly simple - we're computer scientists, so we should use fancy data structures. A tree or graph is a really cool structure, and it's great for all sorts of things, so surely it's going to be good for rendering, right?

It's a great theory, but sadly it's completely wrong. The world is not a big tree - a coffee mug on a desk is not a child of the desk, or a sibling of any other mug, or a child of the house it's in or the parent of the coffee it contains or anything - it's just a mug. It sits there, bolted to nothing. Putting the entire world into a big tree is not an obvious thing to do as far as the real world is concerned.

Now, I'm not going to be a total luddite and claim that trees are not useful in graphics - of course they are. There's loads of tree and graph structures all over the place - BSP trees, portal graphs, skeletal animation trees, spatial octrees, partitioning of meshes into groups that will fit in memory (bone-related or otherwise), groups of meshes that are lit by particular lights, lists of objects that use the same shader/texture/post-processing effect/rendertarget/shadowbuffer/etc. Tons of hierarchical and node/edge structures all over the place.

And that's the point - there's tons of them, they all mean different things, and (and this is the important bit) they don't match each other. If you try to put them all into one big graph, you will get a complete bloody mess. The list of meshes that use shadowbuffer X are completely unrelated to the list of meshes that use skeleton Y. I've seen many Scene Graphs over the years, and typically they have a bunch of elegant accessors and traversers, and then they have a gigantic list of unholy hacks and loopholes and workarounds and suchlike that people have had to add to get them to do even the simplest demos. Try and put them in a big, real, complex game and you have a graph with everything linked to everything. Either that, or you have a root node with fifty bazillion direct children in a big flat array and a bunch of highly specialised links between those children. Those links are the other structures that I mentioned above - but now they're hard to access and traverse and they clutter up everything.

Fundamentally of course you do have to resolve a traversal order somehow - the objects need rendering, and there's some sort of mostly-optimal way to do it. But you're just never going to get that ordering by just traversing a single tree/graph according to some rules. It's fundamentally more complex than that and involves far more tradeoffs. Do you traverse front to back to get early-Z rejection? Do you traverse by shadowbuffer to save switching rendertargets? Do you traverse by surface type to save switching shaders? Do you traverse by skeleton to keep the animation caches nice and warm? All of this changes according to situation and game type, and that's what they pay us graphics coders the big bucks for - to make those judgement calls. You can't escape these problems - they're fundamental.

But trying to fit them all into one uber-graph seems like madness. This is compounded by the fact that most Scene Graphs are implicitly a "retained mode" paradigm, and not an "immediate mode" paradigm. My esteemed friend [[Casey Muratori|https://mollyrocket.com/forums/]] has some uncomplimentary comments on doing things with retained mode concepts, and I'm very inclined to agree. Yes, they can work, and sometimes they're necessary (e.g. physics engines), but it's not what I'd choose given an alternative.

The one major argument I hear presented for Scene Graphs is the "minimal state change" concept. The idea is that the edges between your nodes (meshes to be rendered) hold some sort of indication of the number or cost of state changes going from one node to the other, and by organising your traversal well, you achieve some sort of near-minimal number of state changes. The problem with this is it is almost completely bogus reasoning. There's three reasons for this:

1. Typically, a graphics-card driver will try to take the entire state of the rendering pipeline and optimise it like crazy in a sort of "compilation" step. In the same way that changing a single line of C can produce radically different code, you might think you're "just" changing the AlphaTestEnable flag, but actually that changes a huge chunk of the pipeline. [[Oh but sir, it is only a wafer-thin renderstate...|http://en.wikipedia.org/wiki/Mr._Creosote]] In practice, it's extremely hard to predict anything about the relative costs of various changes beyond extremely broad generalities - and even those change fairly substantially from generation to generation.

2. Because of this, the number of state changes you make between rendering calls is not all that relevant any more. This used to be true in the DX7 and DX8 eras, but it's far less so in these days of DX9, and it will be basically irrelevant on DX10. The card treats each unique set of states as an indivisible unit, and will often upload the entire pipeline state. There are very few //incremental// state changes any more - the main exceptions are rendertarget and some odd non-obvious ones like Z-compare modes.

3. On a platform like the PC, you often have no idea what sort of card the user is running on. Even if you ID'd the card, there's ten or twenty possible graphics card architectures, and each has a sucession of different drivers. Which one do you optimise for? Do you try to make the edge-traversal function change according to the card installed? That sounds expensive. Remembering that most games are limited by the CPU, not the GPU, and you've just added to the asymmetry of that load.

Anyway, this is something we can argue til the cows come home. I just wanted to give my tuppen'orth against the "prevailing wisdom" of using a Scene Graph. I'm not saying there aren't some good reasons for using one, but after writing a fair number of engines for a fair number of games, I have yet to find any of those reasons particularly compelling. They invented a lot of cool stuff in the 70s and 80s, and there's plenty of it that we continue to [[ignore at our peril|Premultiplied alpha]]. But some of it was purely an artifact of its time, and should be thwacked over the head and dumped in a shallow grave with a stake through its heart.
Wolfgang Engel and Eric Haines have managed to get the ~ShaderX2 books made [[free and downloadable as PDFs|http://www.realtimerendering.com/blog/shaderx2-books-available-for-free-download/]]. I wrote three articles that were included in ~ShaderX2 Tips & Tricks, and it's nice to have them "live" and on the net. I'd also like to convert them to HTML (as I have done some previous articles) and put them on my website - much easier to search than ~PDFs - but honestly I'm so busy with all things [[Larrabee]] that it may be some time before I work up the energy. But a quick self-review is possibly in order:

''Displacement mapping''

Oh dear - a lot of talking, and almost no content! It's a "Tom talks for a while about the state of the industry" article. Unfortunately, the industry still hasn't figured out how to do displacement mapping, which is immensely frustrating. We're still stuck in the chicken-and-egg of having to solve three difficult problems before we get any progress.

In my partial defence, all the content is in the very first article in ~ShaderX2 - "Using Vertex Shaders for Geometry Compression" by Dean Calver. We realised early on that our articles would overlap, and since Dean had shader code ready to go, he talked about that while I talked about tools. You can think of them as two parts of a larger article.

Aside from moaning about why hardware doesn't do displacement mapping (and it still doesn't!), I think the main goodness from the articles is [[Displacement Compression]], which I think is still an interesting technique. There's some more details on the algorithm in my [[GDC2003 paper|http://www.eelpi.gotdns.org/papers/papers.html]], and Dean shows some good shader code. Sadly, [[MuckyFoot]] dissolved before we were able to battle-test the pipeline described in the article, so we'll never know how elegantly it would have worked in practice. I always wanted to bolt the system into [[Granny3D]]'s export pipeline, but it just wasn't a natural fit. I still think it's an excellent way to render high-polygon images without waiting for [[Larrabee]] to change the world.

''Shader Abstraction''

It's always difficult to talk about engine design in articles without gigantic reams of code. I think this is one of my more successful efforts, but it's always hard to be sure. A lot of these concepts were expanded upon by others in the later ~ShaderX books (where I was sub-editor of the "3D Engine Design" section).

Probably the one part of this article that I haven't seen discussed elsewhere is the texture compositing system. That was used in a production pipeline (a prototype in [[Blade II]], and more extensively in the following project), and it worked well. The biggest benefit was that the artists could use any file and directory structure they liked, and they could copy and rename textures at will, store them in any format they liked, and so on. This made life much simpler for them, as they didn't need to constantly worry about bloating disk or memory space. When it was time to bake the assets together on the disk, all that duplication was removed by hashing, and the fiddly stuff like format conversion and packing specular maps into alpha channels all happened automatically. Discussing this with other game devs I know others have invented similar systems, I've just never seen anybody write about them.

''Post Process Fun with Effects Buffers''

This was the result of me trying to figure out how to structure an engine that did more than just the usual opaque and alpha-blended stuff. It adds a lot of structure on top of a standard pipeline. In one sense I'm not that happy with it - it turned out far more complex than I originally hoped, with lots of structures, dynamic allocation of rendertargets, and object callbacks. It feels like overkill for a couple of cheesy post-processing effects. On the other hand, once you have that complexity, it does make some things pleasingly elegant. What is still unproven is whether that extra complexity plays well with a full-blown game rendering engine rather than a simple demo. I still think there's some interesting ideas here.
GDC2007 was a lot of fun, though tiring. On the first day, I did two talks.

The first was about shadowbuffers. Yes, I know, again. What can I say - a GDC wouldn't be complete without me giving yet another talk about shadowbuffers - can't fight tradition. Anyway, for this one I figured out how to combine ID and depth shadowbuffers to get the best of both worlds - no surface acne, and no Peter Pan problems. Additionally, this is robust - it doesn't fall over according to the size of your world or the distance to the light or what objects are in the scene, which is a constant problem with depth-only shadowbuffers. As a nice bonus, it only needs 8 bits of ID and 8 bits of depth, so not only do you only need a 16-bit-per-texel shadowbuffer, but it works on pretty much every hardware out there. Certainly everything with shaders or register combiners (including Xbox1 and GameCube), and you could probably even get it to work on a PS2 if you tried hard enough. I think this problem can now be marked SOLVED, and combined with Multi Frustum Partitioning (which is still more expensive than I'd like, but does work), I think hard-edged shadows are also a SOLVED problem. Now we just have make the edges nice and soft - plenty of methods in the running, but still no clear winner. Lots of people researching it though, so I think it shouldn't be too long.

The other talk was only half an hour, and it was a brief look at ways to integrate lots of complex shaders into our pipelines. Most people are moving from just a few simple shaders in previous gen hardware up to the full combinatorial craziness of today's hardware. There's a lot of methods that work for 10s of shaders that just don't scale when you get to 100s. This talk explores the methods I've used in the past when throwing large numbers of shaders at multiple platforms. It's certainly not meant as gospel - there's many ways to skin this cat - but it should be a decent start for those bewildered by the options.

Both these talks are on my [[papers|http://www.eelpi.gotdns.org/papers/papers.html]] page. The shadowbuffer one also has a demo with it - let me know if you have trouble running it, it hasn't had much testing on other machines.

It was good to get those talks out of the way early on. After that I could focus on a day or two of stuff for the SuperSecretProject, and then relax and be a RadGameTools booth-babe for the rest of the show. And now I need some sleep.
Kinda silly, but the question keeps coming up - which is it? The answer is - both or either, add a space as desired. They're all the same thing, just depends who you ask. I prefer "shadowbuffer", because the word "buffer" implies a certain dynamic quality. Framebuffers and Z-buffers are single-frame entities, and dynamically updated by the hardware, just like shadowthingies. On the other hand vertex, index and constant buffers generally aren't, so it's not a universal quality. Shadowmaps are very much like texture maps - indeed, they're sampled with texture hardware, and there is a "mapping" between what is in them and the visible frame. However, I tend to think of "shadowmaps" as the reverse of "lightmaps" - they are calculated offline and not dynamically generated. Taken all together, I prefer "shadowbuffer".

As for the rest of the world - well, it's not my fault if they're wrong. But just for information, Google says:
"shadowbuffer": 1,160 hits.
"shadow buffer": 2,790,000 hits.
"shadowmap": 20,900 hits.
"shadow map": 32,600,000 hits.

So what do I know?
Phew! That was exhausting. My first Siggraph and it was right in the public eye. But I met a lot of very smart people, and got introduced to a lot of prominent academics - it's fun to be able to meet people after reading all their papers.

There was a serious amount of "Larrabuzz" around at Siggraph, and a lot of Intel people were being swamped by questions - most of which unfortunately we're not allowed to answer yet. If we were vague on some topics, we all apologize - it's tough remembering what information is not yet public. An added restriction was the length of talks - at Siggraph we had four talks on Larrabee totalling only 90 minutes, and there's only so much information you can pack into that time.

The main paper was "Larrabee: A ~Many-Core x86 Architecture for Visual Computing" (available from the ACM or your favourite search engine), and was presented by Larry Seiler on Tuesday. It outlined the hardware architecture and some of the reasoning behind the choices we made when designing it. It was extremely exciting to see this as an official Siggraph paper after surviving the whole rigorous process of peer-review and rebuttals. It's gratifying to know that the academic community finds this an interesting and novel architecture, and not just another graphics card.

On Thursday there were three talks on Larrabee in the "Beyond Programmable Shading" course. I talked about the Larrabee software renderer, Matt Pharr talked about how the architecture opened up some new rendering techniques, and Aaron Lefohn talked about the extension from there into general-purpose computing, including the term "braided parallelism" which I think is an awesome metaphor for the mix of thread and data parallelism that real world code requires. In that same course we had similar views from the AMD and Nvidia side of things, previews of the ~APIs that will be driving them from Microsoft and Apple, and also some fascinating glimpses into the future from Stanford, UC Davis, Dartmouth and id Software (cool demo, Jon!). All of us were under the gun on time limits and everyone had to drop content to fit, so well done to all of the speakers in the session - we managed to keep it on-schedule. [[All the notes are available for download here|http://s08.idav.ucdavis.edu/]] - my notes on the site are slightly longer and more detailed than the ones I spoke to (32 minutes instead of 20), so they're worth looking at even if you went to the talk.

Supplemental: also see the [[Siggraph Asia 2008]] versions of the talks for a different view on things.
Just noticed that the Siggraph Asia 2008 course notes for the "Beyond Programmable Shading" course are [[available for download|http://sa08.idav.ucdavis.edu/]]. At first glance they're a similar structure to the [[Siggraph 2008]] ones, but Tim Foley has some new slants on the material that are pretty interesting. He's highlighted a slightly different take of things, and it might give people some more insights into the nature of The Bee.
I don't know if Perforce is the best version-control system (please stop telling me about Git, people!), but it's perfectly decent, and it's the one I've had most experience with. They have a free version for under 20 people (a decent-sized studio!), which means every indie or bedroom coder has no excuse for not having decent source version-control.

BUT - the initial setup to just get version-control on an existing codebase on a single machine is unfriendly as hell. It works perfectly well, it's just got a ton of arcana to wade through. For various reasons this is now the third time in six months I've had to do this, and every time I forgot how to do it and had to re-learn from trial and error. So to save myself doing that again, I wrote it down!

Warning - I am NOT a P4 expert by any stretch, but this seems to work for me. If anybody who is an expert spots some problems, do let [[EmailMe]] or [[TweetMe]] or whatever.

For the hobbyist coder, the question is - how do I put the existing file c:\~MyCode\~ProjectAwesome\main.cpp and so on into P4 in a depot that lives somewhere sensible such as C:\Users\Tom\Documents\~P4Depots? Why is this so difficult?

Start with some concepts. There's two different places that you'll have stuff:

1. The depot, where all the past-versioned files live. P4 likes to put these in places like C:\Program Files\Perforce\Server, which is where it installs itself, which is crap because THAT'S WHERE EXECUTABLES LIVE. I don't usually back those up constantly, especially since I can usually download or reinstall most executables. So instead I want to put it somewhere I'll remember to back up, e.g. C:\Users\Tom\Documents\~P4Depots

2. The actual code you're working on. This is your workspace. I'm assuming you already have a project somewhere like C:\~MyCode\... and you want to keep using that as your working codebase. Normally P4 expects it can name the workspace any old thing because you're getting your code from a large central repository, right? Except we're not - this is a little private project, and we just want to use the benefits of P4 without changing any paths - we already have a file structure and we don't want to change it.

Workspaces (used to be called "client specs" - still some documentation using that name) are by default named something like ~Tom_ComputerName_1234, and they will be in the directory C:\Users\Tom\Perforce\~Tom_ComputerName_1234. Then inside that will be the depot name as a subdirectory. And inside THAT is your working files. So it's a long way from something you actually want to use. However you do it, the format of a file under P4 is (unless you get deep into configuration land) always:

~WorkspaceRoot/~DepotName/files

You can change the workspace root, and you can change the depot name, but you can't stop them being smooshed together to make the filenames you want, so a little trickery is needed. If you have c:\~MyCode\~ProjectAwesome\main.cpp, you can do it two ways:

Workspace root: c:\~MyCode
Depot name: ~ProjectAwesome
Filename: main.cpp

However, if you also have ~ProjectExcellent, that would need to be a separate depot, and if you have a shared library in c:\~MyCode\~SharedSpace, that can't go in either. So my solution is:

Workspace root: c:\
Depot name: ~MyCode
Filename: ~ProjectAwesome\main.cpp

''Step by step, set up Perforce''

Go to Perforce's website, and download and install the latest versions of "~P4D: Server", and "~P4V: Visual Client". These packages also contain things like the command-line client, a decent merge tool and ~P4Admin - you don't need to download these separately.

* Run ~P4Admin
** Create a new account with "Tom" as every name, host is "localhost", port number 1666, no password. It will make you superuser - this is fine.
** A default depot called "depot" will be set up. You can right-click and delete it.
** Set up a new "local" depot: File/New/Depot
*** Name: ~MyCode (this must match the name of the directory you put your code in!)
*** Depot type: local
*** Storage location for versioned files. This should be the place you want to actually store the depot, where you'll remember to back it up, transfer to a new computer, etc. I use: "C:/Users/Tom/Documents/~P4Depots..." - make sure you include the drive spec. By default the depot location will look like: "~MyCode/..." which is relative to the server root, which is by default C:/Program Files/Perforce/Server. To put the depot in a specific place, type in the full filespec.
** Don't panic if you don't see anything in C:/Users/Tom/Documents/~P4Depots - there won't be anything until we add some files with ~P4V.
** That's it, you can now close ~P4Admin.
* Run ~P4V
** Server will be "localhost:1666"
** User is "Tom"
** Workspace: hit New and fill in:
*** Name: it will prompt you for something like ~Tom_ComputerName_1234. This is fine, and these namings are very distinctive to P4, and I find it helps to use the same format.
*** Workspace root: C:\  or whatever is the next-highest directory up from wherever you put your code.
** If you need to select a depot, select the one you created above, otherwise it will usually pick it by default.
** The big text box below is the "mappings" window. It's a list of server to client mappings, one per line. "..." means "all directories below". You should have only one line, and it should look like "//~MyCode/... //~Tom_ComputerName_1234/..." which means a mapping from the depot ~MyCode to the workspace ~Tom_ComputerName_1234.
** It will complain that C:\ already exists, but that fine, so reply that no, you don't want to choose a different place.
	
A wizard may pop up saying the depot is empty and would you like to add some files. You can if you want, but it's better to do it in the main ~P4V tool itself, so you learn how to do it.

So now you're in the main ~P4V tool - split into two halves. You'll see on the left the depot/workspace view, so switch that to "workspace" with the tab at the top. This is just a view of your local drive, and will just show all your files in C:\. On the right side is the list of available workspaces, of which there should be just one called ~Tom_ComputerName_1234.

The left-side workspace view is cluttered, and seeing all the files on C:\ isn't actually that useful, because the only files that you can do anything with are inside the folder of the same name as the depot, which we named ~MyCode to match where your code already lives.

So go into that folder, and ctrl+LMB select the files you want under source control, and then right-click on them and select "Mark for Add". It will ask you for a changelist to add them to, and "default" is fine. You can do this multiple times for all the files you want to add.

On the right pane, click the "Pending" tab, and open the changelist called "default", and you should see all the files with little + signs, meaning they're about to be added. To actually perform the add, right-click on the changelists's name (i.e. "default") and click "Submit". It will ask for a checkin comment, and there you go - they're under version control.

If you go to the P4 depot which in the example above I put in C:/Users/Tom/Documents/~P4Depots, you will now see a bunch of directories and files called things like main.cpp,v - and that's how you know it's working (incidentally, in a disaster you can open these in a text editor to extract the original file - it's fairly obvious what the format of stuff is).

From here on out you can switch the left pane of ~P4V to "Depot" view which is a lot more compact than "Workspace" and only shows the files in the depot. Right-click on any of them to select "Check out", they go into the default changelist, edit them, and "Submit" when you're done. For more on using P4 in general, read the guides - they're actually pretty good once you get past this excruciating setup phase.

''Integration into Visual Studio (or other editors)''

You can now check out files in ~P4V, but it's a bit of a pain while your coding - it would be nice to do that inside Visual Studio rather than Alt+Tabbing around. You could install the VS plugin, but it just doesn't seem to work well in practice. Instead, first get the command-line client working. From the start menu, run CMD, and in it type:

{{{
p4 set
}}}

This should list something like:

{{{
P4EDITOR=C:\Windows\SysWOW64\notepad.exe (set)
P4PORT=1666 (set)
P4USER=Tom (set)
}}}

What's missing is ~P4CLIENT, which describes your workspace (I told you they still used "client spec" interchangeably, and this one of the places). So to add that, type:

{{{
p4 set P4CLIENT=Tom_ComputerName_1234
}}}

...or whatever the workspace name you chose. Type "p4 set" again to check it's in there (easy to forget the = sign), then open up Visual Studio. Open your project, click on the "Tools" menu then "External Tools..." then "Add" and fill in:

{{{
Title: P4 edit current file
Command: C:\Program Files\Perforce\p4.exe                      (or wherever you installed P4)
Arguments: edit $(ItemPath)
}}}

...and tick "Use Output window", and hit OK. Also make sure when Visual Studio tries to save a write-protected file, it does NOT just go ahead and write over it, instead it pops up a dialogue box. Hit cancel in this box, it will make that file the current one, and you just select "P4 edit current file" from the tools menu, and that checks it out for you - you should see a confirmation in the "Output" display looking like:

{{{
//MyCode/MyCode/ProjectAwesome/main.cpp#1 - opened for edit
}}}

One you have your code compiling and running, check it in using ~P4V as usual. Pro-tip - take the time to do at least a small comment. I know it seems silly on a one-man project, but I promise you in three months time you won't have a clue what checkin "jfdhgdfjhgjkghd" was, or know that it was the one that caused that bug that's been plaguing you for so long.

I hope this helps folks who are not part of a big corporation get basic version control working. It's such an essential tool for any coder out there, and Perforce have given you a free version, so even if you're just doing a single-person hobby project, I hope this quick guide got you through the arcana. Also, hey - Perforce - the free version's great, but you're not going to win any hearts and minds when it needs this sort of voodoo setup.
It's only pretending to be a wiki.
http://eelpi.gotdns.org/blog.wiki.html
There's a discussion I keep having with people, and it goes something like this:

"Hey Bob - how's it going at ~UberGameSoft?"
"Hey Tom - it's going well. I'm figuring out a neat HDR system, Jim's doing a cool physics engine, and Frank's got some awesome post-processing effects."
"Cool. So you guys must be well into production by now, right?"
"Kinda, but not really. We still have trouble with the whole export pipeline and there's a lot of pain getting assets into the game."

At this point, I usually start yelling at people and preparing another ranty blog entry. And thus...

There's a bunch of stuff that is neat, and it's geekily cool, and we as engineers all love to play with them and they generate some killer GDC talks and suchlike. But they're all unimportant next to the stuff that is necessary to ''ship the game''. We're now looking at team sizes of 50-150 people for 2-4 years, and that's a lot of assets.

You need the smartest coders in the building to be making sure that everyone else can work efficiently. And that means they're tools coders. Yes, "tools" - don't look at me like I said a dirty word.

The problem is that at heart a lot of us are still bedroom hackers. And that's a great mentality to have - it keeps us honest. But at those numbers, you have to have some of the dreaded word - "management". But we're smarter than the average bear, so we can use geek management, not [[PHB|http://en.wikipedia.org/wiki/Pointy_Haired_Boss]] management. And it's not management of people so much as management of data. Good people who like the work they do and the people they work with will generally manage themselves pretty well. The problems come when the data flowing between them sets up dependencies. Now obviously that's inevitable - but people can cope with obvious dependencies - like you can't animate a character very well until the rigging is done or whatever.

The problem is dependencies that aren't as obvious. They're obvious to us coders, because we're the ones who write the systems. And unfortunately the easiest way to write most systems is to have all the source data available at once, throw it in a big post-processing hopper and out comes the DVD image, and then we ship it. Fairly obviously, this is a disaster in practice. So instead we need to think carefully about the chain of dependencies and how we can isolate one person's changes from another, and allow people to work on related objects at the same time without getting stalled.

This is almost exactly the other big problem coders are facing - parallel processing. And we all know that parallel processing is really really difficult, so we put our smartest coders on the problem. And the same should happen with the tools - it's an even bigger parallel processing challenge, and the problem is not as well-defined as "make the physics go faster" because half the time when you start a game you don't actually know where you're going.

So I'm going to do some more posts on some of the detailed aspects once I get them straight in my head. But the big message is really that the old days of the hero-coder saving the day with leet gfx skillz are over. Shaders just aren't that difficult to write, HDR is a matter of detailed book-keeping, and shadows may be a pain in the arse but they're not fundamentally going to make or break your game. Good asset management on the other hand can mean the difference between shipping and going bankrupt. That's what the new breed of hero-coder needs to focus on.
Like... in a proper academic paper and everything! This was two years ago, but I only just found it. They took sliding window VIPM and fixed one of the major problems - the poor vertex-cache efficiency. They also tried to improve the memory footprint, but weren't as successful with that. But anyway - real academics writing real papers based on my stuff. Totally cool! I've had citations before in related papers, but never anyone deriving stuff directly and primarily from my work.

[[Topology-driven Progressive Mesh Construction for Hardware-Accelerated Rendering|http://www.graphicon.ru/2004/Proceedings/Technical/2%5B4%5D.pdf]] - Pavlo Turchyn, Sergey Korotov

Yeah, OK, so they're from an improbably-named [[Finnish university|http://www.jyu.fi/]] and their paper was published in a Russian journal and it's not even listed on Citeseer. Whataddya want - a personal recommendation from Stephen Hawking? Jeez you people...
I've been writing Dwarf Minecart Tycoon for ages now, but it's always been a tiny world - 20x20x10 - which I could just fit in a a simple array. But the time has come to make it a biiiiig world. I've been careful to use wrappers at all times so I can just swap out the representation and 90% of the game doesn't care, and I've also used integer and fixed-point coordinates so there's no precision problems near the origin (see [[A matter of precision]]).

But what representation to use? The obvious thought is an octree with each node having eight pointers to child nodes, and some of the pointers can be NULL, so you can store huge sparse world. But to do random lookups is a LOT of pointer-chasing, and computers suck at those more and more as time goes on. The other problem with standard octrees is you tend to fill a lot of memory with pointers rather than actual data. One remedy is to have more than 8 children, so each node is larger than 2x2x2, for example 8x8x8. That's a third the number of traversals, and it's probably a wash in terms of memory usage.

You can also partition the world dynamically, e.g. k-D trees - which are like quad/octrees, but the split plane isn't always in the middle of the node, it's at an arbitrary point through the node. This lets you balance the tree, so you have the minimum number of walks to get to any node. On the other hand, I ''loathe'' writing tree-balancing code, and the point of doing this project is to have fun, so I'm going to arbitrarily discard these options. (if you don't dynamically rebalance, there's not much point and you might as well use a regular quad/octree).

Another way is to use a hash table instead of a structured tree. Take your world x,y,z coordinate, hash the bits of the coordinates together to make an index, look that up in a largish array, and there's a linked list hanging off that array of the possible nodes that have that hash (and if your hash is good, these nodes will not be anywhere near each other in the world). One really nice thing about this is that there's no "edge" to the world. For an octree, the top node needs to cover the whole world extents, so if you want a bigger world you have to add more layers of nodes, so more pointer-chasing. With a hash table, it all just works - it doesn't know or care how far the world is from one side to the other (assuming it fits in the int32 or int64 you use as x,y,z coordinates) - by definition the hash is modulo the size of the table, and then all it's doing is walking the list looking for the matching coordinate - the coordinates don't have to "mean" anything.

But one of the things I'd like to do is be able to run some parts of the map slower than others. Where the dwarves are running around, you want high-speed updates - things are falling, burning, water flowing, possibly air currents, etc. But away from there not much is happening - the water has settled to a level, plants are growing very slowly. So it would be good to put some large areas on a very low update rate. When doing cellular-automata stuff in the past, I found having a hierarchy of update speeds was really useful, so it would be good to use the hierarchy of something like an octree for this purpose as well. But a hash table on its own won't give you this, precisely because a good hash tries to avoid any spatial locality.

For really huge worlds, you need to start conserving memory - you can't store full detail for every occupied cube of the world. A trick to conserve memory for dense but dull areas (e.g. deep in the bowels of the earth where you haven't dug yet, or in the middle of the sea) is to take a volume and represent it in a compressed format. For example, you can use the octreetree but instead of going all the way down to the single node you stop at a higher level and store the whole thing as just one sort of block - "it's the sea". Or you can have your leaves be a decent size e.g. 8x8x8 and store that in an RLE fashion (bonus points for doing RLE with Hilbert or Morten walk patterns - though I'm not sure it actually makes much difference in this context). If the compressed format is too slow to access all the time, you could decompress to a more accessible format when creatures actually go down there, and then recompress when they leave. Hey - this ties nicely into the "low-frequency update" stuff above, doesn't it?

OK, so which combination of the above? What are my criteria? Well, I'd like to have:

1. A world as big as int32s will allow in all 3 directions.
2. Reasonably memory-efficient, but given today's memory sizes being fast is probably more important - so not too much pointer-chasing.
3. Some sort of hierarchy to allow region-sleeping.
4. Option to region-compress (not planning to do this now, but maybe later).

Which to choose? Well, this isn't an interview question, you don't have to sit and cogitate it all yourself from first principles. The correct answer is - to the Twitternets! So I posted the question and sure enough got a bunch of great replies. I've tried to massage this into a vaguely readable format:

An interesting link: http://0fps.wordpress.com/2012/01/14/an-analysis-of-minecraft-like-engines/

Sebastian Sylvan ‏(@ssylvan): hash. For tree, query pattern is crucial (Oct tree for uniform queries, kd-style for queries mainly near values). 
Sebastian Sylvan ‏(@ssylvan): hierarchical hash grid. Big contiguous chunks go in coarser grid. 

Chris Pruett ‏(@c_pruett): I guess I'm in the hash table camp provided you can come up with a good hash function. Oct tree + sparse data = empty nodes. 
Chris Pruett ‏(@c_pruett): Alternatively, I've seen kd trees used quite effectively for smaller datasets. Hash to nodes of kd trees? 

Jon Watte ‏(@jwatte): You answered your own question. For large solids, anything kd-like (even a BSP) will win. 

Chris Pruett ‏(@c_pruett): I'm still down for the layers-of-organizational-structure approach. Did a collision system that was a 2D grid of ~BSPs once. 
Chris Pruett ‏(@c_pruett): Theory being that resolution up close has different requirements than resolution of the world as a whole. 

Pat Wilson ‏(@pat_wilson): May not be applicable, but I saw this a few weeks ago: http://www.openvdb.org/index.html

Mark Wayland ‏(@rompa69): Store smaller "octrees" in hashed chunks, only bits around the observer loaded? Octrees better for vis but hash for collision? 
* Per Vognsen (‏@pervognsen): IIRC, Minecraft uses a sparse array (hash table) of 16x16x64 chunks. Coldet is voxel tracing against fine grid. 
* Sebastian Sylvan ‏(@ssylvan): also Oct tree != Oct trie. Uniform split is trie. An *actual* Oct tree reduces height while giving kd tree benefits. 
* Sebastian Sylvan ‏(@ssylvan): i.e. tree is when you use input data points to pick split plane. Most people use "Oct tree" when they mean trie. 
* Sebastian Sylvan (‏@ssylvan): so real Oct tree: split node on a point that roughly subdivides input into 8 equal sized children. 
* Sebastian Sylvan (‏@ssylvan): true. Hierarchical hash trees are hard to beat for simplicity. R trees are cool to, my talk/post on them http://bit.ly/T7bLmg
* Per Vognsen (‏@pervognsen): Sparse two-level arrays are a sweet spot for data structures. 
** mmalex ‏(@mmalex): yes! Every time I stray into something more complex than this, I come back to 2 level stuff. Suits vm too 
** Per Vognsen ‏(@pervognsen): I know Alex likes a B-tree-like root (sorted split keys) with dense subarrays. Lots of possibilities here. 
** Per Vognsen ‏(@pervognsen): Right, and ideally each structure is flat, though it need not be (e.g. sparse voxel octrees). 
** Per Vognsen (‏@pervognsen): Not sure if you read Cyril Crassin's papers on those, but he uses an octree with leaf "bricks" (dense arrays).

Also:
Scott Nagy ‏(@visageofscott): you're like two tweets from a "valve working on minecraft clone" story showing up on kotaku 

OK, yeah - no - [[Dwarf Minecart Tycoon]] is absolutely not a Valve project. I'm not even sure I ever want to release it - if the goal is to have fun writing it, releasing it would end the fun!


Anyway taking into account all the above (I wasn't very clear on the ability to sleep large areas when I asked the question - 140-char limit!), the place I've started with is a "brick" at the lowest level, then a shallow octree above it, then a hash table. It's probably easiest to express with code. A ~MapNode is the container for a single hex of the map (holds a list of objects, etc), so going from the bottom up:

{{{
struct WorldLeaf
{
    static const int SizeXLog2 = 3;
    static const int SizeYLog2 = 3;
    static const int SizeZLog2 = 2;

    MapNode MapArray[1<<(SizeXLog2+SizeYLog2+SizeZLog2)];
};

struct WorldOct
{
    static const int MaxDepth = 4;
    static const int SizeXLog2 = 2;
    static const int SizeYLog2 = 2;
    static const int SizeZLog2 = 2;

    // Will be NULL iff not a leaf.
    WorldLeaf *pLeaf;
    // Some or all of these can be NULL if there's no child, or if this is a leaf.
    WorldOct *pChild[1<<(SizeXLog2+SizeYLog2+SizeZLog2)];
};

struct WorldHashEntry
{
    // Where the min corner (i.e. lest-most X,Y,Z) of this entry is in the world.
    IntMapPos MinCorner;
    WorldOct Oct;
};

struct WorldHashList
{
    ArbitraryList<WorldHashEntry*> Entries;
};

struct World
{
    ...other game stuff...

    static const int HashTableSizeLog2 = 8;
    WorldHashList HashList[1<<HashTableSizeLog2];
};
}}}


I pulled all the statically-defined numbers numbers out of my arse - the size and depth of the octree, the size of the bottom leaves, etc. I'll probably do some tests later on to see what the optimal sizes and depths are, but for now these feel right - enough depth to the octree to allow efficient region-sleeping, but not too much pointer-chasing.

I would have liked to make ~WorldOct something like:

{{{
struct WorldOct
{
    static const int MaxDepth = 3;
    static const int SizeXLog2 = 2;
    static const int SizeYLog2 = 2;
    static const int SizeZLog2 = 2;

    union
    {
        WorldLeaf *pLeaf[1<<(SizeXLog2+SizeYLog2+SizeZLog2)];    // If depth=2
        WorldOct *pChild[1<<(SizeXLog2+SizeYLog2+SizeZLog2)];   // If depth<2
    };
};
}}}

...to save one extra pointer indirection per traversal, but it got really messy. I might try to fix that later.

The other thing I did was have a tiny cache of the most recently-accessed *~WorldLeaf entries, and before doing the whole tree-traversal thing, you check in that cache first. This provided a very nice speedup. How big should that cache be? Well the surprise was... 4 entries. I tried 2 and 8 and they were slower. I also tried LRU and random replacement policies, and LRU was best. Again, this needs more testing later, but it was surprising how small a good size was.

I also wrote an iterator object that knows about the above structure, so when you want to query a region in an arbitrary order (e.g. for physics updates, etc) you use that iterator and it does sensible coherent traversals rather than doing top-down queries every single time. The other thing to write might be macros that do "find me the block to the negative Y" and so on, since they are also very common operations and will very frequently hit in the same ~WorldLeaf and never need to traverse any levels of the tree (note that as currently written, if you have a pointer to a ~MapNode and you know its world position, with some modulo math you can find out the pointer to its parent ~WorldLeaf, which is handy). But that's possibly for a future post, when more of the game, and hence the actual access patterns, are implemented. Until I have those access patterns, there's little point in "optimising" any further - I have no data.
I thought I'd better write some notes on my [[GDC 2003 talk|http://www.eelpi.gotdns.org/papers/papers.html]] on SH in games, as in hindsight there were some errors and some items cut for time.

Error - slide 5 - "To multiply two ~SHs, multiply their weights". Not true. Although you can add two ~SHs by adding their components, you can't do a modulation of one by the other that way. Doing a componentwise multiplication is a convolution, which is a completely different operation that I really don't fully understand.

Slides 12 and 13 didn't tell you what the magical fConstX values were! Sorry about that. The correct values are found by taking the canonical SH coefficients (for example http://www.sjbrown.co.uk/?article=sharmonics), and then convolve with the clamped cosine kernel (i.e. to actually do the lighting with the standard clamp(N.L) equation we know and love). This is easy - multiply the first value by 1, the next three by 2/3 and the last five by 1/4. Finally, you'll probably want to convert the output from radiance (in some arbitrary units) to a graphics-friendly [0,1] range by scaling everything.

And the results are:
fConst1 = 4/17
fConst2 = 8/17
fConst3 = 15/17
fConst4 = 5/68
fConst5 = 15/68

Easy! Also note that of course slide 13 - the vertex shader code - can be refactored to change (3*z*z-1) into just z*z by massaging the values held in sharm[7] and sharm[0] back in slide 12. That's left as a fairly simple exercise for the coder - I didn't want to confuse the slides with non-canonical equations.

Thanks to [[Peter-Pike Sloan|http://research.microsoft.com/~ppsloan/]] for getting this all straight in my head. He actually //understands// the fundamental maths behind this stuff, whereas I'm more of a ... er ... general concepts man myself (in other words, I suck).

Volker Schoenefeld wrote a good paper on this - see [[More on SH]].

Robin Green also wrote a good talk called "Spherical Harmonic Lighting: The Gritty Details" [[http://www.research.scea.com/gdc2003/spherical-harmonic-lighting.pdf|http://www.research.scea.com/gdc2003/spherical-harmonic-lighting.pdf]]

And to complete the little circle of SH incest, a talk by Chris Oat of ATI/AMD at GDC04: [[http://ati.amd.com/developer/gdc/Oat-GDC04-SphericalHarmonicLighting.pdf|http://ati.amd.com/developer/gdc/Oat-GDC04-SphericalHarmonicLighting.pdf]].



What I also had to miss out from the slides were some details of the implementation that I used in the Xbox1, ~PS2 and Gamecube versions of a game that never got finished before [[MuckyFoot]] vanished. This is actually the coolest bit about using SH illumination in real games. So the situation is that we wanted to give the artists a lot of control over their lighting environment. The problem is, what that usually means is they want a lot of lights. That gets expensive - the practical limit on Xbox1 and ~PS2 lights is about three directional lights with specular (on the Xbox you'll get diffuse bumpmapping on one or two of those with a pixel shader), and the Gamecube only has vanilla ~OpenGL lights (which is a lot worse than it sounds - they're not very useful in practice), and they're not fast.

Obviously the bar has been raised since - the 360 and ~PS3 can do really quite a lot of vertex and pixel shading, and although the Wii is still the same old Gamecube hardware, it is a bit faster. However, when I say the artists wanted a lot of lights, I don't mean 5, I mean 50. Most of them are fill and mood lights and don't need anything but per-vertex diffuse shading (i.e. not bumpmapping or specular shading), but that's still a lot of lights to ask the vertex shaders to do.

The solution we came up with was elegant. First, bake all the "mood" lights into the SH. Then of the "focus" lights (that need bumpmapping and specular), pick the brightest four shining on the object. Bake the rest into the SH. Take the brightest four, order them by brightness (1st = brightest, 4th = least bright). The fourth always gets baked into the SH as well. The third one is split into two parts - a real light and an SH fraction - and the brightness of these two always adds up the brightness of the real 3rd light. You split it up by using the relative brightnesses of the 2nd, 3rd and 4th lights so that when the 3rd and 4th lights are almost the same brightness, the 3rd light is all SH. And when the 2nd and 3rd lights are almost the same brightness, the 3rd light is all real, with no SH component. This way, as the object moves around, lights will move from being real lights to being baked into the SH nice and smoothly and you won't see any popping as they change types.

So that gives you two lights and some fraction of a third that are done properly - bumpmapping, specular, point light (instead of a directional approximation), etc. You can of course use four or five or even two. I found that two looked not that great, and four or five didn't improve matters significantly - three (well, two-and-a-bit) seems to be the the visually good number, but you should experiment yourself.

That's what we were going to do originally, and that remained the plan for the Xbox1. However, the ~PS2 didn't seem able to vertex-shade even three point lights with proper specular, so that was cut back to just one! Those paying attention will realise that actually that's some-fraction-of-one light - yes, if you were standing in between two equally bright lights, there was no proper lighting performed at all! Still, that's what needed to happen for speed, and it didn't actually look objectionable, just a bit washed-out at times compared to the Xbox1.

The Gamecube was a challenge - it has no vertex shading hardware at all - so what to do? We tried just feeding the brightest four lights into the standard ~OpenGL lights, and then feeding the zeroth element of the SH into the ambient term (which is something people have been doing for ages - way before SH came along). But it looked rubbish. The artists would put ten yellow lights spread out on the ceiling of a corridor and ten blue lights providing underlighting, and instead of doing the right stark cinematic yellow-opposite-blue lighting, all that would happen is you'd get a random assortment of a soft blue and/or yellow tinge in a random direction, and a muddy grey ambient. Looked awful. So instead the chap doing the GC version (Andy Firth) suggested we take the first four coefficients of the SH (rather than the first nine) and emulate them with OGL lights. So the zero term was still the ambient, the brightest real light was a real OGL light, and then the six cardinal directions (positive and negative x, y, z) each got a directional OGL light shining along that axis. You need six because the OGL lights get clamped on the backside of the light, so if your first coefficient is -0.5, then you need a -0.5 light shining along from the +X axis and a +0.5 light shining along from the -X axis. This was a super neat hack, and it brought the GC lighting quality nearly up to the ~PS2's.

Although we weren't exactly //happy// with only having one light and a 4-component SH, the important thing was that the artists now had a reliable lighting model - they could put a bunch of lights in the scene and you'd something pretty close on all three platforms. They accepted that Xbox1>~PS2>GC in terms of quality, and it was //predictable//. Whereas the previous pre-SH lighting schemes would pick a random assortment of lights out of the ones they'd placed, which was very frustrating for them, and heavily limited the sort of "mood" they could set with lights.


But that wasn't the really cool bit. Oh no. The //really// cool bit is that you don't have to just project a directional light into the SH. You're doing this once per object - you've got a few CPU cycles to play with. So why not have some more flexible light types? The simplest tweak, which actually worked really nicely, was to give lights a volume. Normally, lights are done as point lights - infinitely small. Of course you typically set up some sort of falloff with distance to control it a bit better. I loathe the canonical OGL/DX light attenuation - it's really hard for artists to control well. Far better is a linear falloff that goes from a minimum distance (at which the light has maximum brightness, whatever the artist sets that to be) out to a maximum distance (at which the light is ZERO). The maximum distance is also nice for culling - you can actually throw the light away and not worry about it all. That's all pretty easy, and again people have been doing that for ages.

The problem is that sometimes you want a single bright light that illuminates a large area - let's say it has a brightness of the standard 1.0 at a distance of 10m. But now what happens when an object is 5m from the light? If you set the minimum distance at 10m, that object won't be any brighter, which looks odd. So move the minimum distance in and crank up the brightness. Well, your object is kinda brighter - but the lighting is still odd. Imagine the illuminated object is a sphere. So now a large part of the sphere (maybe a quarter of the surface) is saturating at full brightness, and then there's a falloff, and then still half the sphere is black. In a real room what you'd get from a bright light like that is backscatter illuminating the back side of the sphere, and indeed the shading of the sphere would still be smooth all over without any odd-looking saturation. The other effect is that really a light like that would often be physically larger than the sphere, so it wouldn't just be a directional light - the outer parts of the light would be illuminating around the edges of the sphere. And indeed as you get closer, almost the entire sphere would have some part of the light that would be able to see it directly.

Anyway, so I gave the artists a physical radius in meters that they could set the light to, rather than being a single point in space. Then I do a bit of sensible-seeming maths involving the radius of the object being illuminated, the radius of the light illuminating it, and the distance of the light from the object. This gives me an approximation to how much "wrap around" I should give this light (a small object should get more wraparound than a large one). There's also an "ambient bounce" factor that the artist can set to simulate the light being in a small, white-walled room and causing backscatter as opposed to stuck in the middle of a big open space. But notice this ambient bounce still scales according to the distance between light and object.

And then when I go to do the canonical directional-light SH evaluation as given in the slides, the ambient bounce factor just adds light into the sharm[0] ambient factor, and the wrap around factor does that and also shrinks sharm[4] through sharm[8] towards zero. This last one needs some explaining - basically what I'm doing is lerping from the real clamped cosine kernel (as given in the slides) into one that does (N.L*0.5+0.5), i.e. it has a bright spot on one side and a dark spot on the other, and a smooth blend between those, with no clamping needed. This is the fully wrapped around lighting state - when the object is right up against the light source. It should be obvious that a kernel like that is just sharm[0] = 0.5, sharm[1]-[3] = light_direction and sharm[4]-[8] = 0.0. So lerp between that and the standard directional coefficients using the wrap-around factor.

The neat bit is that this is also pretty easy to add to the per-pixel bumpmapping stuff (it's a modification of the idea of hemispherical lighting), so you can get wraparound lighting on that as well. You can also tweak the specular lighting so that your highlight size gets larger as you approach the light. This looks really nice - as objects get close to large, bright lights, they still seem to get brighter, even though the brightest-lit pixels are still maxed at full-white. It's pretty neat, and the artists loved having even more knobs to play with on their lights. Sadly, the game (and the company) was shut down before we got to the stage where they could really play with final assets and lighting, so we never found out what the system could do when given some full-time artist loving.
The first question in [[DMT]] was - what shape is the map? Squares are the obvious answer, but I've never liked them much. If you disallow diagonal movement and interaction, the code is simple, but things look ugly and distances are counter-intuitive. If you allow diagonal movement and interaction then the code's a pain - you now have to handle the diagonals, and you have to handle strange things like four people can meet and can all interact with (which in a game usually means "kill") each other, which is also counter-intuitive. And when you move to 3D you now have three cases of interactions - cube-face-to-cube-face, cube-edge-to-cube-edge, and cube-corner-to-cube-corner. It's a huge pain in the arse.

Hexes have always been the shape of choice for nerdcore games - wargamers swear by them for example - and they solve all of these problems. Movement looks better, and there's no strange corner cases. Another neat thing from rendering is if you want to render them as a heightfield (or indeed collide against them), they tessellate to triangles in obvious ways - whereas with squares you have to make a fairly arbitrary choice about which way to stick the diagonal. But it's not intuitive how to deal with them in code - we're so used to using a right-angled 2D coordinate system to refer to stuff. But I thought I'd bite the bullet and figure this stuff out right at the start. If it didn't work, well... go back to squares.

The system I decided on was to have three axes - X, Y, and W (Z is for height off the ground). X and Y are the two major ones, and that's how the storage is actually indexed - so I have a conventional 2D array in memory. X is where it normally is (aligned with Cartesian world X), and then Y is at 120 degrees to it rather than the usual 90. W is an implicit axis - it is midway between X and Y and the W coordinate value is equal to (X+Y)/2. Note that W is not an independent coordinate - you can't change it without changing either or both of X and Y. In practice W is never actually used as a coordinate or stored in any structure, it's really just a reference to a direction that you can move in. Of course as I write this I realise if I'd wanted proper symmetry, I would have pointed W the other way, so that X, Y and W were all at 120 degrees from each other. That would have been far more elegant, and (I now realise) analogous to barycentric coordinates in a triangle. Ah well, I don't think it makes much difference in practice.

So now movement and proximity is elegant again - things can only move or interact along the edges of their hexagons, so there's (literally) no corner cases any more. As long as you have a bunch of simple wrapper functions to say "given coordinate A, what is the new coordinate if I move in direction D", where there are six possible directions - positive X, Y, W and negative X, Y, W. In practice I numbered the directions as an enum going anticlockwise from positive X:

enum ~IntMapHeading
{
	~MAPHEADING_PX = 0,
	~MAPHEADING_PW,
	~MAPHEADING_PY,
	~MAPHEADING_NX,
	~MAPHEADING_NW,
	~MAPHEADING_NY,

	~MAPHEADING_LAST			/* Always last */
};

The other big choice I made was to follow my own advice in the post [[A matter of precision]] and use integers rather than floats for coordinates in space and time. Along the way, try not to ever use the X,Y coordinates directly (and thus confuse myself about where W is) - always use wrapper functions. So I have things like the following, where ~IntMapPos is just struct { int32 x, y; }

bool ~IsOnMap ( ~IntMapPos pos );
~MapNode *~GetMapNode ( ~IntMapPos pos );
~IntMapPos ~MoveBy ( ~IntMapPos pos, ~IntMapHeading Heading, ~MapSlope Slope );
~IntMapHeading ~GetDeltaInfo ( float *pHowFar, int *pHowFarManhattan, float *pFloatHeading, ~IntMapPos p1, ~IntMapPos p2 );
~D3DXVECTOR3 ~GetWorldDeltaFromMapDelta ( ~IntMapPos From, ~IntMapPos To );

"Manhattan" distance is a funny concept on a hex grid, but it should be obvious what it means - distance in terms of moves. I don't know of any cities laid out in hexes, so Manhattan it stays.

The last function is the one that all the rendering uses - the camera data is stored as looking and rotating around a certain "target" map hex, and so I pass that target hex in as the "From" argument, and the result is a standard Cartesian float32 XYZ that everything uses for rendering data. Again, the idea is to make sure nothing else cares about this wacky hex grid, and that I don't get any precision problems when the map gets really big.

Of course as you may have spotted from the above, having made my lovely 2D hex world with no corner cases, I then wanted to have a 3D (volumetric) world, and added in a Z axis, and unfortunately that means that in a vertical cross-section I have squares and that means I now have diagonal corner cases to deal with. Doh! In theory I think I could have done some sort of "orange-stacking" arrangement to offset adjacent Z layers, for example "face-centered-cubic" (which I think is the same as the Voronoi diagram of packed tetrahedrons?), but that would have made my head explode trying to think about the connectivity. So I think horizontal hexes and vertical alignment should work out as a reasonable compromise.
The game I'm most proud of working on. A "peep sim" very much in the vein of Theme Hospital and Dungeon Keeper (not much surprise really - people from both those teams joined [[MuckyFoot]] to work on StarTopia, and indeed later worked on Lionhead's "The Movies"). Lots of interesting graphics stuff to do there, and much later I even retrofitted shadowbuffers to it. More [[here|http://eelpi.gotdns.org/startopia/startopia.html]].
In a recent "what were you doing in 2001" Twitter meme I mentioned I had just shipped StarTopia then. And I got a ton of StarTopia love back, which is immensely gratifying, but I want to make something really clear - almost none of it is for me. I joined the project nine months before ship, and by then it was basically what you saw and loved. I just took over the graphics engine from Jan Svarovsky who could then focus on finishing the non-graphics stuff. I ported the engine from ~DX5 to ~DX7 (not exactly rocket-science), tarted it up with a bit of multitexture, and did all the graphics card compatibility stuff. The latter is actually a hard job, especially in 2001 when there were still tens of graphics card manufacturers each with multiple strange variants, but it's just something that needed to be done - it didn't really affect the game. I guess I did a few gameplay things here and there - we all did - I don't really remember the details. But yeah - the real heroes are all these great folks: http://www.mobygames.com/game/windows/startopia/credits  with special kudos to Wayne Imlach who spent years "playing" and tweaking the Excel spreadsheet simulator with the millions of tweakable numbers in the game to make it the right blend of challenging but fair. I still enjoy playing the game from time to time - even when you know the magic behind it all.
''WORK IN PROGRESS'' - which is why it's not on the main page yet.

This is an attempt to gather together some of the knowledge I have about writing streaming resource systems. That is, game engines (rendering, sound, collision, etc) which do little or no bulk-loading of data in "levels", but can instead stream data as the player progresses.

This is mainly from my experience writing engines at [[MuckyFoot]], firstly with the game [[Blade II]] which shipped on ~PS2 and Xbox1, and then the engine used for the next two projects, neither of which were actually shipped, but were far enough along in production to pound out the problems with the system. It also contains the results of discussions with others about their streaming systems, and the problems they had with them. I suspect the success of my system has more to do with luck than planning, but the lessons are nevertheless useful!

Some resource types, such as textures and to a lesser extent meshes were the space-hogs, and there were sophisticated systems to make sure we only loaded exactly what we needed at the LOD we needed.

''Stuff I wrote on forums on the internets''

One day I'll arrange these into something coherent, but might be interesting even in this form.

---

If you can see stuff streaming in, you're doing it wrong. It's not simple - good streaming requires pretty comprehensive support in the engine. I know from painful experience that it's difficult to slap one onto an engine retrospectively. You need aggressive and smart prefetching to hide seek times. You need good markup for things that might pop in suddenly (HUD elements, muzzle flashes and things associated with explosions such as particle system textures) to avoid paging out things that can be needed in a hurry. You need good fallbacks - mipmap levels or alternatives that are never unloaded. You especially need a lot of work with cutscenes and any other place with jump-cuts between cameras or game features like teleportation. You absolutely cannot just wait until the renderer needs the texture before you load it in - that looks like complete pants.

I also think that just streaming one part of your assets, e.g. just textures or just sounds or just meshes, is doing most of the work for only a small part of the benefit. Stream everything! Textures, meshes, sounds, nav data and collision hulls all stream well, and the first three also LOD well.

People have been streaming stuff for decades at least (at least I know I have!). People just never noticed before because they did it right I always meant to write an article or something about how to do it properly, but I never did, and I've probably forgotten half the important things. Ah well.

To answer the question - the big advantage is that when you do texture streaming right ([[Knowing which mipmap levels are needed]]), you suddenly don't care about the size of textures. Aside from running out of disk space (which OK, can still be a problem), it doesn't matter if the artists make textures a bit too big, or unevenly allocate texel space in the level - if the player never gets close enough to see, those mipmaps are never loaded. Think about the time the poor pixel-pushers spend fussing over exactly how big to make their textures and whether they have the budget - they get to spend all that time making stuff pretty instead.

---

It's not simple or easy. Even smart engineers do it wrong. But many smart engineers have done it right, shipped games, and you never knew they were streaming because - they did it right! We have this sort of argument every few years with every approximation technology. Progressive meshes, texture streaming, impostors, SH lighting, etc - plenty of people do them well and nobody notices. It only takes a few to do a bodge'n'ship job and suddenly forums are full of "new technology X is a piece of shit - look at EngineY - they're using it and they look shit." Conclusion - don't do things shittily. I think we can agree on that?

The benefits on fine-grained streaming are significant. At a given visual quality you have a LOT more memory available (or at a fixed memory budget you have much higher visual quality - take your pick). That leads to far more flexibility about where you use that budget, and that means development times drop and people don't spend quite as much time obsessing over memory. It is true that this can lead to trying to stuff more in and making life hell all over again. Pro tip - don't do that.

I think chunk-based streaming is the worst of both worlds. On the surface it sounds simpler than fine-grained streaming, but in practice it's not. You have all the management headaches of streaming data and prefetching in time to avoid pops, and teleports/cuts and all that headache. But now your chunks are big monolithic blobs, so you have the same detail for the thing two blocks away and around the corner as for the thing right in front of your nose, so you're still burning huge amounts of memory on stuff you can't see.

Worse, in many systems the chunk has to contain everything that COULD be there, so if Cigar-Chomping Sergeant is in this level, his mesh, textures, sounds, etc need to be in every chunk in the level. So even if he's not in the scene, his stuff is loaded, and even worse if you're on boundaries between chunks it can be loaded three or four times.

I've had very clever people tell me chunk-streaming is a good system, and while I respect their opinion on many things, I just cannot see the win. When we discuss the problems with streaming systems they all seem to be the same whether fine-grained or coarse, but coarse leaves so many of the advantages on the table.

Or there's the hybrid - chunk-streaming loads in stuff up to a certain LOD, but then you have individual textures stream their top few mipmaps like fine-grained streaming. That one totally confuses me - you lose the two acknowledged benefits of chunks - simple memory management and predictability. And you're still capping the possible win by where you draw the line between chunk and fine-grained. Draw it at 512x512 and you're still loading a ton of stuff that isn't needed, draw it at 64x64 and now things can still look like blurry shit unless you do a really good fine-grained streaming system (in which case why not do it properly...).

(technically the system I used is on that continuum since I always kept 8x8 copies of all textures around - it's 32 bytes which was part of the texture description, filename, etc)

---

Another thing I see some games screw up is to take the view frustum into account when streaming. In almost every game style the camera can rotate far faster than a streaming system can keep up. So when doing calculations to see what's needed, it's important to just use visibility and distance, not view direction. It's also important to go past any visibility blockers that the player can get past in under 2 seconds, e.g. bends in corridors, unlocked doors, etc.

If you have something like a portal system, instead of doing the normal frustum-culling visibility test like you do for rendering, do a flood-fill out to a given movement distance (e.g. 3 seconds * max run speed). Then starting from that set of "potentially visible in 3 seconds" portals, do normal portal/portal clipping to do visibility culling, but ignore the viewers frustum.

On the other hand, do grant a priority boost to things that are visible. If the streaming is stressed and not able to load everything you want, you want the visible stuff loaded first. If it's not stressed (e.g. the player hasn't moved recently), you want it to still load everything even if it's off-screen.

---

''Outline for a book or something''

_Resource framework_

Resources + priorities + LOD levels
Prod & decay
Decay done with a timestamp & logarithmic decay. Preserves the idea of "twice as important as". Timestamp allows decay without constant checks.
Priority queue
Fetch queue - 20 items long - allows optimisation of seek times. Once something is in fetch queue, it is loaded
Evict queue - can't remove items immediately (GPU may still be using them).
Avoiding constant PQ updates - stochastic updates instead.
Fragmentation
Defragmentation
Changing up a LOD level
Changing down a LOD level

_Setting priorities_

Scale by distance
Walk through visible portals
Walk through closed but unlocked doors
Walk through locked doors with lower priority
Flood-fill X portals after visible frustum (corridor problem)
Breaking objects (suddenly need new textures, etc)

_Cinematics_

Record & play back 5 seconds ahead
At decision points, have scripts add "future cameras" that are treated like real cameras with slightly lower priority.

_Textures_

Just-too-late
Tinytextures
Console mipmap streaming
PC mipmap streaming

_Sound_

No traditional LOD like meshes & textures.
Spot effects - JTL - if not there, not played.
Barks - JTL - if not there, played when loaded, unless later than 1 second.
Longer music & speech - traditional buffered streaming with max-out priority bump at half-empty

_Animations_

In Blade II, these didn't take that much memory, so were not a sophisticated system.
Didn't have any LODs in our engine, but some libraries (e.g. Granny) do, so use them like texture LODs.

_Meshes_

PS2 did not have VIPM (too complex for available engineering time), so were either present or not.
Xbox1 used VIPM LOD with bumpmapping.
Almost every mesh in game had VIPM chain - most were automatically generated.
For simplicity, VIPM resources only had three LODs. Example for a character was min=150 tris, normal = 5k tris (same as PS2 version), max = 50k.
No equivalent of "tinytexture" that is always resident - meshes can't go much below 150 tris, and that's still too big (~6kbytes each)
Actual Xbox rendering was continuous LOD. No popping, finer granularity saved vertex throughput as well as for streaming.
Meshes cause texture prods - they know what mipmap levels they need. Textures inherit modified priority of mesh.

_Game objects_

Prod meshes with priorities.
Prod animations with sounds.
Priority bump for player, enemy, important objects etc. Scenery and so on considered at normal priority.
Artificial prod for alternative meshes (e.g. broken versions, "used" versions) when player nearby.
Prods for equippable objects, e.g. weapons, so when you change, you don't have empty hands for a second.
Prods for animations.

_Collision objects & AIs_

Collision objects have high priority. Range-bubble dependence from player - no portals.
AIs also range-dependence. If they find a missing collision object, they freeze & request it with even higher priority.
Certain important AIs have their own range bubbles.
Scripts can raise priorities, give certain AIs larger range bubbles, etc
Keep raising priorities until there are no AI freezes. This data is usually small, and over-loading is usually not a problem.

_Refinements_

Better PQ implementation. Maybe use buckets & only finely-sort the top bucket?
Maybe try a hardware-like L1/L2 cache? L1 = resident, L2 = waiting to be resident.
Separating LOD from priority. Ideal LOD = normal priority, next one down = higher priority, next one up = low priority.
Maybe always have two "objects" in LODdable objects - the one that is loaded, and the next higher one.
AI LODs - use corser collision geometry, or have pre-authored movement tracks that they revert to (e.g. street pedestrians)
Try low-rez sounds - maybe far-off sounds are fine to play back at higher compression levels? Drop to 22kHz, 11kHz, or more compressed MP3/Ogg levels?
Have 1st second of sound in memory, as soon at starts playing, stream the rest? Many sounds <1second long - doesn't help them.
For sounds with multiple variants (e.g. footsteps, gunfire, etc) only have a few permanently loaded, and when they start playing, load the rest.
Same for animations, e.g. 10 different walk anims. Or use generic walk anim until specific depressed-wounded-troll-with-limp anim has been loaded.
While I'm having a rant - strippers. STOP IT. Stop writing academic papers about generating the ultimate strips. It's all totally pointless - pretty much every bit of hardware has indexed primitives (except the PS2, which has a bunch of completely bonkers rules that are ''nothing'' like standard strips anyway). The ultimate stripper will get you one vertex per triangle. But even a very quick and dirty indexer will get you that, and good indexer (e.g. the one I put in [[Granny3D]]) will get close to 0.65 vertices per triangle for most meshes with a 16-entry FIFO vertex cache. The theoretical limit for a regular triangulated plane with an infinitely large vertex cache is 0.5 verts/tri (think of a regular 2D grid - there's twice as many triangles as vertices), so thats not too shabby.

Now, it is true that once you have chosen your triangle ordering for optimal vertex-cache efficiency, you may then want to express them as indexed triangle strips. But the key is in the first word - //indexed//. You've gone to all that trouble to order your tris well - don't change that order! Given that order, you can still rotate the triangles to express things as short strips, especially when you have a strip-restart index (e.g. the newer consoles and DX10), and that is pretty efficient. Since vertices are so much larger than indices, and re-transforming them if they drop out of the cache is so expensive, it's worth spending a few extra indices to keep the vertex cache toasty warm. But generating those optimal strips is pathetically simple - precisely because you can't change the ordering of the triangles.

All those academics still writing papers on this - go find something more useful to spend your time on. We've had indexed primitive hardware for <$50 for almost a decade now. Welcome to the 21st century.
Now only secret, not //super// secret - it's [[Larrabee]].
The other day [[I asked|https://twitter.com/tom_forsyth/status/570841117727420416]]>Quick straw poll - UV(0,0) at top-left of texture bitmap or bottom-left? (and yes I know some ~APIs are switchable - choose!)

Direct reply results:
33 for top-left
12 for bottom-left

Favourites/retweets of direct replies (I'm far too lazy to remove duplicates, so take with a pinch of salt): 
11 for top-left
5 for bottom-left

So there you have it - a victory for top-left, and even if I disagreed and got out the Veto Pen it would still be motion carried. So say we all.

A few of the interesting supporting/dissenting/hatstand comments:

Oliver Franzke @p1xelcoder
>Definitely top-left. It makes image debugging in Photoshop much easier.

Richard Mitton ‏@grumpygiant
>Top-left of course, everything ~OpenGL does is backwards and wrong :)

David Farrell ‏@Nosferalatu
>Top left. I still imagine there's an electron gun that begins each frame at the top left corner.

Jon Watte ‏@jwatte
>Some people think ~Direct3D and ~OpenGL differ on this, which is nonsense.

Joel de Vahl ‏@joeldevahl
>I think we should learn from economics and swap x and y.

Philip Rideout ‏@prideout
>bottom-left, because Descartes was French, and the French know their shit.

Tim Sweeney @~TimSweeneyEpic
>Top left if user language is English, top right if Chinese, Arabic, or Hebrew.

Kevin Francis ‏@pinbender
>Top-left, because if it were consistent, it would be easy, and then nobody would pay us to do this stuff.

And there were also a couple of votes for "center", i.e. they're the same thing as NDC, which is actually a pretty reasonable answer to be honest. Though it still ducks the question of whether positive V is up or down, just like NDC, and people have just the same arguments about that.

There's also the question of what you even mean by "top" of the image - is it the first byte or the last? That's relatively easy - there are many file formats that allow you to switch (BMP and TGA go bottom-up by default, but can be switched), but the file formats that do not allow you to switch (e.g. PNG) are top-down only - the first byte is at the top-left of the image when displayed on-screen.
When streaming textures from disk (see entries tagged as "Streaming"), there's no reason you must have the same format on disk as you do in memory. You're limited by the speed of the drive, so you can do some decompression there. This reduces disk footprint, which is nice, but it also reduces the distance the head has to seek between resources (because they take less disk space), which is very good. Obvious things to do are Zip-style lossless compression, but ~DXTn doesn't compress that well. So you'd want some slightly lossy compression (because hey - ~DXTn is pretty lossy already - what's a bit extra?), but people are having a tough time finding ways to start with a ~DXTn image and add further lossiness - ~DXTn's sort of compression is really unfriendly to that sort of stuff.

So the other way to do it is to start with the original image and use proper image-compression methods like JPEG, PNG, Bink, etc. Those get great compression ratios. The problem is that when you load them off disk, they decompress to something like 888, which chews video memory and can slow down rendering because it takes valuable bandwidth. We really want to compress to something smaller. It would also be very useful to be able to do procedural operations either on the CPU or GPU producing 888 data, and then compress the result.

But good ~DXTn compression is notoriously slow - the problem is searching for the two endpoint colours and the interpolation through 3D colour space - it becomes a really big searching problem. There have been some interesting technqiues - [["Real-Time YCoCg-DXT Compression" by J.M.P. van Waveren and Ignacio Castaño|http://developer.nvidia.com/object/real-time-ycocg-dxt-compression.html]] is a good round up, and also a nice exploration of some alternative encodings (store the Y channel in the alpha of ~DXT5, which de-correlates it from the the ~CoCg part which increases quality and speeds up searches).

It's a neat trick, but I went the other way. ~DXTn-style compression of 2D and 3D colour spaces is slow because the search space is large. But compressing 1D data has a far smaller search space and is much quicker. We can either use ~DXT1 with greyscale data, or in ~DX10 there is a ~BC4 format that stores single-channel data - it's basically just the alpha-channel of ~DXT5, without the colour data, so it's also 4 bits per texel.

The core of the idea is to take the image, shrink it by 4x in each direction, quantise it, and store it in a 565 format texture. Then scale that back up with bilinear filtering, and find the luminance of both it and the original image (i.e. greyscale them). Divide one luminance by the other - now you have a bunch of values that corrects the luminance of the low-rez 565 texture. Find a texture-wide scale&bias to let you store these luminances well - most of them are around 1.0, and we're only interested in the deviations from 1.0. Then store those as a full-size ~DXT1 or ~BC4.

The decompression shader code is very simple - sample both textures with the same U,V coords, scale&bias the luminance correction with globals fed in through pixel shader constants, and multiply with the result from the 565 channel.

The quality compared to standard ~DXT1 is pretty good. It takes 5 bits per texel (4 for the luminance correction channel, and 1 for the 565 texture, because it's shrunk by 16x). This is slightly larger than ~DXT1, but the visual quality seems to be very close, though I haven't done any PSNR studies. The average quality seems comparable, and the main difference is in highly-coloured areas. Here, ~DXT1 gives its distinctive block-edge errors, which are pretty objectionable. Using this technique instead blurs the colours out - essentially, you get the raw upsampled 565, with the luminance correction not doing all that much. While not significantly more correct than ~DXT1, it does not have any visible blocking, and so attracts the eye far less.

I have also tried only shrinking the 565 by 2x in each direction. This obviously gives better fidelity, but increases the cost from 5 bits/texel to 8 bits/texel.

In the future I'd like to try converting to a luminance+chrominance space (e.g. ~YCbCr or ~YCoCg) and then just storing the luminance raw in the ~DXT1, and the chrominance in one half of a 4444 texture (i.e. either the RG or the BA channels). That way you can share the 4444 between two different textures and amortise the cost. It brings the 4x4 downsample to 4.5 bits per texel, or the 2x2 downsample to 6 bits, both of which are pretty acceptable. I'm a little curious about the quantisation there - is 4 bits enough for chrominance? You don't see quantisation problems on the current 565 texture because the luminance scaling corrects for it, but this would be a somewhat different colourspace.

I'd also like to do something smarter with the ~DXT1 texture. Currently, I just store the luminance as RGB like you'd expect. But ~DXT1 has 565 end points, so you don't get true greys. Also, you could do some clever maths with dot-products to get higher precision. However, the thing to remember is that we do a scale&bias on this value - most of the errors are very close to 1.0. So in fact I suspect any extra precision and non-greyscale effects are insignificant in practice.

This also makes me wish we had some texture formats that were less than 4 bits per texel, even at the cost of accuracy. Those would be ideal for storing the luminance correction term. I guess you could use 332 to store three different luminance channels (2.7 bits per texel), but the logistics of sharing textures gets tricky.

You could also pack three luminance channels into a ~DXT1 - the problem is then you have to compress 3D data into ~DXT1, and avoiding that was the whole point of this exercise! However, if you didn't mind the higher compression cost, it might work quite well. The thing to realise is that most 4x4 blocks of luminance correction have very little data - most of the time, the upsampled 565 is very close to correct. It's only a minority of blocks that have correction data. So if you choose your three textures well (an exercise left to the reader :-), you can end up with very few blocks that have data in more than one of the 3 channels, and so still get good quality. Cost would be very small - it's 4 bits per texel ~DXT1, shared between 3 textures, and then 1 bits per texel for 4x4 shrink 565, for a total of 2.7 bits per texel!
Re-reading the [[Visibility and priority tests for streaming assets]] entry I realise I didn't actually go into why the SVT/Megatexture stuff wasn't all you need for good streaming. It's sort-of obvious, but it's worth spelling out for clarity.

If you look through the list of the problems I faced and the ways I fixed them, many of them illustrate why the SVT/Rage systems are fine, but aren't sufficient. They tell you what is visible right now, but as-is they give you no advance warning, and drive heads take significant time to seek. You could add some heuristics to prefetch adjacent 64x64 blocks, but it seems both simpler and more flexible to use game-specific knowledge to do that prefetching rather than trying to reverse-engineer from the visibility data.

In practice what id actually do is to have a two-level system. First they use methods just like the ones I talked about to stream texture data off disk into a large cache in memory. The difference is that this data is not directly usable by the graphics card, it's in a ~JPEG-like format, and it's also multi-channel (they have the diffuse, specular and normal maps all squashed together in a single blob). I seem to recall also that the chunks they load off disk are larger, e.g. if the virtual texture system pages on 64x64 granularity, the disk streaming happens on 256x256 blocks (sorry, I forget the actual sizes, and Google isn't helping). This gives the compression more data to work with and increases compression ratios.

Then each frame they use the visibility data back from the card to decide which of these chunks need to be decompressed from the multi-channel DCT blocks to the multiple ~DXTn textures, and on the PC and ~PS3 these are then uploaded to video memory. This takes processor time, but it's relatively low-latency compared to a drive-head seek, so I'm not sure if they use any prefetching heuristics here at all. Using a custom compression format is a really interesting thing to and allows you to reduce drive bandwidth and maximise the amount of data you can store in available memory, which means your prefetching can do an even better job of coping with unpredictable actions.

Sean Barrett has an even more extreme version of compression. His presentation was also his demo, and each "slide" is simply a billboard in the world that the camera looks at, and then to do the demo he just pulls the camera back and shows you the rest of the world. The clever thing is the slides are not stored as bitmaps, they're just the actual text with markup. That is, when the SVT system says "I need the 64x64 chunk starting at 2048,192 in mipmap level 5" the system actually goes and renders the text vectors onto the bitmap right there and then, at whatever resolution it's asked for. It's difficult to beat that sort of compression ratio, and you could see it being used in real game examples for newspapers or shop signs or book spines in a library.

So the two systems (disk->sysmem and sysmem->vidmem) are largely orthogonal, and indeed you don't need the cunning visibility feedback or page-table-based virtual texturing methods to get some of the benefits. One thing I always wanted to try adding to the streaming system was this idea of the two-level cache. In Blade II we did do a quick'n'dirty zlib decompression of all data coming from the disk, but that happened the instant the data was loaded. What you'd want to do is only do that decompression when the texture was actually needed for rendering, but otherwise leave it in memory still compressed. ~DXTn will get around a 2:1 compression with zlib, so it's a worthwhile thing to do, and of course it's lossless. Mesh data compresses even better, and there's some very clever schemes for compressing things like index buffers. Another obvious trick is that normal maps are often rendered using ~DXT5 textures with the U channel in the green and the V channel in the alpha, with the red and blue channels unused (though this trick isn't necessary when you have the ~BCn formats available). Clearly the compressed version can discard those extra channels.

And you can go much further and start generating texture on the fly - procedural generation, aging effects, dynamic scratches, bullet holes, etc. I wrote an article in Game Programming Gems 3 called "Unique Textures" about this that I should probably webify one day.
(this is also posted on [[GameDevDaily|https://gamedevdaily.io/]])

Gamma encoding is a way to efficiently use the limited number of bits available in displays and buffers. For most monitors and image formats, we have 8 bits per channel. The naive way to distribute them would be to encode physical intensity (i.e. number-of-photons-per-second) linearly. However, the human eye responds to photons in a more logarithmic way, so a linear distribution gives too many shades in the brighter areas and not enough shades in the darker ones, causing visible banding.

If instead the 256 values are distributed according to a power law - typically somewhere in the range 2.0-2.5 - so that there are more shades assigned to darker areas than bright ones - the perceived intensity changes are more evenly distributed.

As it happens, this power law is closely matched to how the electron gun of a CRT responds to signals. This happy match between the CRT's mechanics and the eye's response made everyone very happy. The hardware was simple, and the distribution of precision matched human perception well. That reasoning has vanished now, but the legacy remains, and as legacies go it's actually a pretty good one.

''Gamma 2.2''

The simplest and most common form of power law suitable for encoding is gamma 2.2. That is, the relationship between photon count and encoding value (if they are both mapped to the 0-1 range) is:

photons = power(encoding, 2.2)
encoding = power(photons, 1/2.2)

It is really easy to get these switched around when doing conversions - I do it all the time. The way to help think about it is - what happens to the encoded value 0.5? We know that we want to give more encoding space to darker shades, so an encoded value of 0.5 should produce significantly fewer photons than half the full amount. And to check using the first equation: power(0.5, 2.2) = 0.22, which is indeed a lot less than half.

Going the other way, a photon count of half the maximum appears a lot brighter than half, so we want the encoding of that 0.5 to be high on the scale. And indeed using the second equation: power(0.5,1/2.2) = 0.73.

This has the (I think) counter-intuitive result that to go FROM linear TO gamma, you actually use the INVERSE of the "gamma power". So shouldn't it really be called "gamma 0.45" rather than "gamma 2.2"? Well, whatever, the convention is established, and we need to deal with it.

2.2 is certainly not the only sensible gamma value. People tune their monitors to all sorts of different gamma curves for various reasons. Apple Macs used 1.8 before 2009, and photographic film can follow all sorts of curves, many of them not simple power equations. However, the world has slowly settled towards something around 2.2 as a reasonable standard, but with a little twist.

''What is sRGB?''

sRGB is a slight tweaking of the simple gamma 2.2 curve. If you plot them both on a graph, they look almost identical. The formal equation for sRGB is:

{{{
float D3DX_FLOAT_to_SRGB(float val)
{ 
    if( val < 0.0031308f )
        val *= 12.92f;
    else
        val = 1.055f * pow(val,1.0f/2.4f) - 0.055f;
    return val;
}
}}}

(this code is taken from the ~DirectX utility file ~D3DX_DXGIFormatConvert.inl which used to be part of the ~DirectX SDK, but is now somewhat in limbo. But it should be in every DX coder's toolbox, so just search for it and download it!)

As you can see, at very low values, sRGB is linear, and at higher values it follows a pow(2.4) curve. However, the overall shape most closely follows a pow(2.2) curve. The reason for the complexity is that a simple power curve has a gradient of zero at the input/output value zero. This is undesirable for a variety of analytical reasons, and so sRGB uses a linear relationship to approach zero.

(the broader sRGB standard also has a bunch of gamut and colour-transform specifications, but we'll ignore those and just focus on the gamma-curve part for now, since that is what concerns us for graphics rendering)

''How different is sRGB from gamma 2.2''

It is very tempting to ignore the two-part nature of sRGB and just use pow(2.2) as an approximation. When drawn on a graph, they really are very very close - here you can just about see the dotted red line of gamma 2.2. peeping out from behind the solid blue sRGB line:

[img[gamma22_vs_sRGB.png]]

However, when actually used to display images, the differences are apparent, especially around the darker regions, and your artists will not be happy if you display one format as the other! Here is an image of several colour ramps from a  Shadertoy (https://www.shadertoy.com/view/lsd3zN) used to illustrate the precision of linear, sRGB, and gamma 2.2. Note this is simulating 6 bits per channel to highlight the precision differences. The three bars for each colour ramp are linear on the left, sRGB in the middle, and gamma 2.2 on the right. As you can see, although gamma 2.2 and sRGB are similar, they are certainly not the same.

[img[colour ramps.png]]

Don't skimp here - do it right. ~GPUs are impressively fast at maths these days, and the speed difference between proper sRGB and pow(2.2) is trivial in most cases. In most cases you will actually be using the built-in hardware support, making proper sRGB even cheaper to use.

''It's not a gamma curve, it's a compression format''

The important thing to remember when adopting sRGB, or indeed any gamma representation, is that it is not a sensible place to do maths in. You can't add sRGB numbers together, or blend them, or multiply them. Before you do any of that, you have to convert it into a linear space, and only then can you do sensible maths.

If it helps, you can think of sRGB as being an opaque compression format. You wouldn't try to add two ZIP files together, and you wouldn't try to multiply a ~CRC32 result by 2 and expect to get something useful, so don't do it with sRGB! The fact that you can get something kinda reasonable out is a red herring, and will lead you down the path of pain and deep deep bugs. Before doing any maths, you have to "decompress" from sRGB to linear, do the maths, and then "recompress" back.

It is also important that you do not think of sRGB data as being "in gamma-space". That implies the shader has done some sort of gamma transform when using the data - it has not! The shader writes linear-space values, and the hardware "compresses" to sRGB. Later, when reading, the hardware will "decompress" from sRGB and present the shader with linear-space values. The fact that the sRGB "compression format" resembles the gamma 2.2 curve is not relevant when thinking about what the data means, only when thinking about its relative precision. For those of you familiar with the binary format of floating-point numbers, we do not think of them as being "gamma 2.0" - they are still just linear values, we just bear in mind that their precision characteristics are similar to those of a gamma 2.0 curve.

''Hardware support for sRGB on the desktop''

All remotely modern desktop ~GPUs have comprehensive and fast support for sRGB. It has been mandatory since ~DirectX10 in 2008, so there's been plenty of time for them to get this right.

When you bind an sRGB texture and sample from it, all the pixels are converted from sRGB to linear, and THEN filtering happens. This is the right way to do things! Filtering is a mathematical operation, so you can't do it to the raw sRGB values and expect to get the right result. Hardware does this correctly and very quickly these days.

Similarly, when rendering to an sRGB surface, the shader outputs standard linear data. If alpha-blending is enabled, the hardware reads the sRGB destination data, converts it to linear values, performs the alpha-blending in linear space, and then converts the result into sRGB data and writes it out.

To say it again for emphasis, at no time do we deal with "gamma-space" data - we can continue thinking of sRGB as an opaque compression format that the hardware deals with for us on both read and write.

''Gamma is not a new thing, in fact it's the old thing!''

It is important to note that gamma-space images are not a new thing. Until about 1990, all images were gamma space. Monitors were (and still are) gamma-space display devices, and so all images were implicitly gamma, and all image operations were as well. Even today all common image formats (GIF, JPEG, PNG, etc) are assumed to be in gamma space. This is typically 2.2, though some formats are so old they can't even specify it - the gamma curve is just assumed to be whatever your monitor is!

Academics certainly did know this was "wrong", but there were no available clock cycles or spare transistors to do it right, and there were plenty of other image problems to deal with first, like basic alpha blending and texture filtering. And since the input textures were in gamma space, and the output data was displayed on a gamma-space monitor, it all kinda matched up. The obvious problem came along with lighting. People tended to calculate lighting in linear space (which is fine), but then they would multiply gamma-space texture data with linear-space lighting data, and then show the result on a gamma-space monitor. This multiply is technically wrong - but until about 2002 nobody really cared enough to worry about it.

Once our rendering pipelines hit high enough quality, and hardware had enough gates to spare, gamma correction and sRGB were added to the hardware. However, it's important to understand that all along, these images had been sRGB (or at least in gamma space) - we had just never had the hardware support available, and so we ignored it. But the concept of gamma-space buffers has been there all along. In fact the NEW thing invented by the High Dynamic Range pipeline is the concept of buffers in linear space! This means that if you're starting to port an existing pipeline to HDR and you're trying to decide whether a given buffer should be declared as sRGB or linear, the answer in 90% of the cases is sRGB.

Technically, this also includes the final framebuffer. Given the choice between a linear format and an sRGB format, sRGB is a much much closer match to the response of a real monitor than linear. However, even better is if the app performs an explicit tone-mapping step and matches the gamma response of the monitor more precisely (since as mentioned, sRGB and gamma 2.2 are slightly different). As a result, and for backward-compatibility reasons, ~OSes expect apps to provide them with a buffer that is declared as a linear format buffer, but which will contain gamma-space data. This special case is extremely counter-intuitive to many people, and causes much confusion - they think that because this is a "linear" buffer format, that it does not contain gamma-space data.

I have not tried to create sRGB or higher-precision framebuffers to see what the OS does with them. One option is that it still assumes they are in the same gamma space as the monitor. Another would be to assume they really are in linear space (especially for floating-point data) and apply a gamma-2.2 ramp (or thereabouts) during processing. It would be an interesting experiment to do, just be aware that the results are not obvious.

''Oculus Rift support''

The Oculus Rift (desktop) support for sRGB buffers is somewhat different to the standard OS support. Unlike fullscreen display on a monitor, the application does not supply a framebuffer that is directly applied to the Head Mounted Display. Instead, the app supplies buffers that then have distortion, chromatic aberration, and a bunch of other post-processing operations applied to them by the SDK and Oculus' Display Service before they can be shown on the HMD. Oculus have been very careful with quality and calibration, and the Display Service knows a great deal about the HMD's characteristics, so we can apply a very precise gamma ramp as part of the distortion process.

Because the buffers supplied by the app must be filtered and blended as part of the distortion and chromatic aberration processing, and filtering is maths, it is important that the data in the buffers is in linear space, not gamma space, because you can only do maths correctly in linear space. Starting with version 0.7, the Oculus SDK expects the application to supply data in linear space, in whatever format the buffer is declared as. The Display Service will apply the required filtering and distortion, and then apply a precise gamma curve to match the actual response of the HMD. Because we know the precise characteristics of the display, we also play various tricks to extend its precision a little. We support most standard image formats, including float16, as input buffers - so don't be afraid to feed us your HDR data. 

This means the application does not need to know the exact gamma response of the HMD, which is excellent. But it does mean some applications will get a surprise. They are used to the OS accepting a buffer that is declared as a "linear" buffer, but then silently interpreting that data as gamma-space (whatever gamma the monitor is set to). But now, if the buffer is declared as a linear buffer, the Display Service takes the application literally and interprets the data as if it was indeed in linear space, producing an image that is typically far brighter than expected.

For an application that is fully gamma-aware and applying its own gamma curve as a post-process, the solution is simple - don't do that! Leave the data in linear space in the shader, do not apply a gamma ramp (tone-mapping and brightness correction are fine, just remove the linear-to-gamma part), write the data to an sRGB or HDR surface to preserve low-end precision, and the Display Service will do the rest. Again, it is important to remember that if using an sRGB surface, it is not "in gamma space" - it is still a surface with data in linear-space, it is just that its representation in memory looks very similar to a gamma curve.

For an application that is not gamma-aware, and that has been happily (if naively) using gamma data throughout its entire shader pipeline forever, the solution is only a tiny bit more complex. Here, the application uses an sRGB buffer, but creates a rendertarget view that pretends the buffer is linear, and uses that to write to it. This allows the shaders to continue to produce gamma-space data the way they always have. Because the rendertarget view is set to linear, no remapping is performed while writing to the surface, and gamma-space data goes directly to memory. But when the Display Service reads the data, it interprets it as sRGB data. Because sRGB is very close to a gamma 2.2 representation, this works out almost perfectly - the Display Service reads the gamma-space data, the hardware converts it to linear-space, and then filtering and blending happens as it should. This process is also explained in the Oculus SDK docs. It is a hack, but it works well, and is much simpler than converting the application to use a fully gamma-aware pipeline from start to end.

''Alternatives to sRGB''

sRGB is a great format, but it's limited to 8 bits per channel. What if you want a little more precision? This subject could occupy a blog post all by itself (and might well do in the future!), but here's a brief list of interesting alternatives:

* float16. Simple, robust, and excellent API and platform support. Sadly it's significantly larger and so can cause performance problems.

* 10:10:10:2 linear. Looks promising, however note that it has less precision in the dark areas than sRGB! In fact because at the low end sRGB is a line with slope 1/12.92, it effectively has about 3-and-a-half-bits of extra precision, making it almost as good as 12-bit linear!

* 10:10:10:2 gamma 2.0. This uses the standard 10:10:10:2 format, but you apply the gamma 2.0 manually in the shader - squaring the data after reading, and taking the square root before writing it. This gives significantly higher precision than 8-bit sRGB throughout the range. However, because the square and square-root are done in the shader, texture filtering and alpha-blending will not work correctly. This may still be acceptable in some use cases.

* float11:float11:float10. Similar to float16 but smaller and less accurate. Note that unlike float16 it cannot store negative numbers. I have not used this myself, but it looks useful for the same sort of data as sRGB stores.

* Luma/chroma buffers. This is a really advanced topic with lots of variants. One example is transforming the data to ~YCgCo and storing Y at higher precision than Cg and Co. Again, filtering and alpha blending may not work correctly without some effort.

''Cross-platform support''

On the desktop it's pretty simple. Although sRGB support was introduced in previous versions, ~DirectX10 finally required full, correct, fast support. As a result all useful graphics cards shipped since 2008 (in both PC and Mac) have included excellent sRGB support and for all real-world purposes you can regard support as being free.

I don't know the details of Nintendo's hardware, but ~PS4 and Xbox One have the same high-quality sRGB support as their PC counterparts. ~PS3 does sRGB conversion before alpha-blending, so the blending is done in gamma space, which is not quite right. And ~Xbox360 was notorious for having a really terrible approximation of sRGB - while better than linear, it was a totally different-shaped response to actual sRGB, and required special authoring steps to avoid significant artefacts.

On mobile, John Carmack informs me that "Adreno and Mali are both very high quality. There is a small perf penalty on the Adreno Note 4, none on newer ones." Unlike ~DirectX10, mobile parts are permitted to perform sRGB conversion after texture filtering, rather than on each raw texel before filtering. This is much cheaper in hardware, but can cause slightly incorrect results. It is unclear which parts, if any, actually take this shortcut, but it's something to watch out for. Also note that the Oculus mobile SDK handles sRGB slightly differently than on desktop, so check the deocs for details.

''Simple rules to follow when making a pipeline sRGB-correct''

* Any image format you can display on a monitor without processing (e.g. in ~MSPaint or a browser) is almost certainly in either gamma 2.2 or sRGB space. This means ~GIFs, ~JPEGs, ~PNGs, ~TGAs, ~BMPs, etc. The same is true of anything that an artist made, coming out of a paint package.

* Any time you are storing "thing that looks like a picture" in 8 bits per channel, use sRGB. It will give you much less banding in the dark areas. But always use the hardware to "decompress" to linear before doing anything maths-like with the data (e.g. in a shader).

* Any time you are storing "thing that is mathy" e.g. lookup tables, normal maps, material ~IDs, roughness maps, just use linear formats. It is rare that sRGB compression will do what you want without deeper thought.

* Lightmaps and shadowmaps are an interesting case. Are they "like a picture" or not? Opinions differ, and some experimentation may be required. Similarly specular maps - sRGB may or may not do what you want depending on context and content. Keep your pipeline flexible, and try it both ways.

* Floating-point formats are conceptually linear, and no conversion is needed for them, you can just use the values in maths immediately. However, the way floats are encoded gives you many of the precision advantages of gamma 2.0 format, so you can safely store "things that look like pictures" in floating point formats and you will get higher precision in darker areas.

''In conclusion''

Hopefully this has given you an overview of the occasionally confusing world of sRGB and gamma-correct rendering. Don’t worry — I’ve been doing it a while and I still get confused at times. Once you know the right lingo, you will be forever correcting others and being a real pedant about it. But the persistence pays off in the end by having robust, controllable colour response curves that don’t suddenly do strange things in very bright or dark areas. In VR, when used with a well-calibrated HMD and display software, carefully controlling the colour spaces of your data and pipeline can give the user more convincing and realistic experiences.
Louise Forsyth (nee Rutter) is my //darling// wife (dear, please stop peering over my shoulder). We're coming up to our 12th anniversary together, and just past our 2nd year married. Living in sin was such fun, but it's all over now. Conventional at last. She's meant to be a vet, but the US authoritays don't realise that UK cats and dogs are uncannily similar to US cats and dogs and want her to pass some more exams. When my mum told me that a Cambridge degree was the key that unlocked the world, I believed her. They fuck you up, your mum and dad :-)

She has a [[hiking blog|http://eelpi.livejournal.com/]] which has some gorgeous pictures of the surrounding countryside. Volcanos rock. Quite literally.
That's me - Tom Forsyth. I'm now pretty much permanently TomF though - even when I write emails to people like my parents or my wife. It started at MuckyFoot where I was the second Tom to join, and Tom Ireland had already nabbed tom@muckyfoot.com, so I had to make do with tomf@muckyfoot.com. However, Tom Ireland would get a constant stream of mail meant for me, so I started to make a point of always signing myself TomF in the hope that people would remember to add the "F". It only partly worked, but the habit's kinda stuck now. Of course then I worked with Tom Fletcher at Intel, so that put a spanner in the works. But there's no other TomF's at Oculus, so we're back to normal.

The main URL for my site is http://eelpi.gotdns.org/  When you go there, it redirects to whatever ISP I am paying money to this year, so don't use whatever address you see in your browser bar - stick to this one, coz it will move around as I do.

More:
* EmailMe.
* TweetMe.
* My surname has NoE.
* Where I work: [[OculusVR|http://www.oculusvr.com/]].
* What I work on: [[Virtual reality hardware and software|http://www.oculusvr.com/blog/team-fortress-2-in-the-oculus-rift/]]

Disambiguation. These are me:
* [[@tom_forsyth|http://twitter.com/tom_forsyth]]
* The Tom Forsyth who is on various game-related programming mailing lists.
* The Tom Forsyth who has written and edited programming book articles.
* The Tom Forsyth who has talked at GDC.
* The Tom Forsyth who worked at [[MuckyFoot]], [[RadGameTools]], [[Intel|http://www.intel.com]] and [[Valve|http://www.valvesoftware.com/]]
* The Tom Forsyth who worked on [[Urban Chaos]], [[StarTopia]], [[Blade II]], [[Granny3D]], [[Larrabee]] and [[Team Fortress 2 VR|http://www.oculusvr.com/blog/team-fortress-2-in-the-oculus-rift/]]
* The Tom Forsyth or Andrew T Forsyth on various patents to do with processor tech (through Larrabee).
* The Tom Forsyth who was a Microsoft MVP.

These are NOT me:
* The Tom Forsyth who works/worked for Nokia.
* The [[Tom Forsythe|http://www.tomforsythe.com/]] who takes photos of naked Barbie dolls.
* The [[Thomas Forsyth|http://www.thomasforsyth.co.uk]] who makes furniture.
* The [[Tom Forsyth|http://en.wikipedia.org/wiki/Tom_Forsyth]] who scored the winning goal in the 1973 Scottish Cup Final.
* I am not knowingly related to these or any other famous Forsyth or Forsythe (Frederick, Bruce, John, etc).
AKA TomF. http://eelpi.gotdns.org/
I was talking to someone about simple in-game lighting models the other day and we realised we had totally different terminology for a bunch of lighting tricks that we'd each independently invented. With my instinctive loathing for reinventing the wheel, I wondered why this had happened. And the reason is that some lighting tricks are so simple they don't warrant a proper paper or a book article or a GDC talk, so they never get passed along to others. So I thought I'd write up a light type I've been using for a while that I call a "trilight" that I've not seen anyone else talk about. It's very simple, and I'm sure others have invented it too, but at least now I'm happy I've documented it. It's on my [[papers|http://www.eelpi.gotdns.org/papers/papers.html]] page, and it's got a simple demo to go with it.

By the way, it's really satisfying to write a small demo and a short paper and do it all start to finish in half a day for a change. Normally I tackle big problems like shadowing, which take months to code and lots of explanation. Excellent way to spend a Sunday afternoon - more people should try it.
I am [[@tom_forsyth|https://twitter.com/tom_forsyth]] - tweet me harder.
Supposedly from the very famous "Four Yorkshiremen Sketch" from Monty Python. Reproduced here in its entirety just because it makes for easier cut'n'pasting. TheWife will kill you if you describe her as a Yorkshireman, because Yorkshire is on the ''OTHER SIDE OF THE FUCKING HILL''. She, as any well-brought-up midlander will be able to instantly tell from her authentic Estuary/Fenland accent, is from Lancashire. But that's an amusing anecdote for another time. Meanwhile, pop the stack and on with the primary anecdote:

The Players:
    Michael Palin - First Yorkshireman;
    Graham Chapman - Second Yorkshireman;
    Terry Jones - Third Yorkshireman;
    Eric Idle - Fourth Yorkshireman;

The Scene:
    Four well-dressed men are sitting together at a vacation resort.
    'Farewell to Thee' is played in the background on Hawaiian guitar. 

FIRST YORKSHIREMAN:
    Aye, very passable, that, very passable bit of risotto.
SECOND YORKSHIREMAN:
    Nothing like a good glass of Château de Chasselas, eh, Josiah?
THIRD YORKSHIREMAN:
    You're right there, Obadiah.
FOURTH YORKSHIREMAN:
    Who'd have thought thirty year ago we'd all be sittin' here drinking Château de Chasselas, eh?
FIRST YORKSHIREMAN:
    In them days we was glad to have the price of a cup o' tea.
SECOND YORKSHIREMAN:
    A cup o' cold tea.
FOURTH YORKSHIREMAN:
    Without milk or sugar.
THIRD YORKSHIREMAN:
    Or tea.
FIRST YORKSHIREMAN:
    In a cracked cup, an' all.
FOURTH YORKSHIREMAN:
    Oh, we never had a cup. We used to have to drink out of a rolled up newspaper.
SECOND YORKSHIREMAN:
    The best we could manage was to suck on a piece of damp cloth.
THIRD YORKSHIREMAN:
    But you know, we were happy in those days, though we were poor.
FIRST YORKSHIREMAN:
    Because we were poor. My old Dad used to say to me, "Money doesn't buy you happiness, son".
FOURTH YORKSHIREMAN:
    Aye, 'e was right.
FIRST YORKSHIREMAN:
    Aye, 'e was.
FOURTH YORKSHIREMAN:
    I was happier then and I had nothin'. We used to live in this tiny old house with great big holes in the roof.
SECOND YORKSHIREMAN:
    House! You were lucky to live in a house! We used to live in one room, all twenty-six of us, no furniture, 'alf the floor was missing, and we were all 'uddled together in one corner for fear of falling.
THIRD YORKSHIREMAN:
    Eh, you were lucky to have a room! We used to have to live in t' corridor!
FIRST YORKSHIREMAN:
    Oh, we used to dream of livin' in a corridor! Would ha' been a palace to us. We used to live in an old water tank on a rubbish tip. We got woke up every morning by having a load of rotting fish dumped all over us! House? Huh.
FOURTH YORKSHIREMAN:
    Well, when I say 'house' it was only a hole in the ground covered by a sheet of tarpaulin, but it was a house to us.
SECOND YORKSHIREMAN:
    We were evicted from our 'ole in the ground; we 'ad to go and live in a lake.
THIRD YORKSHIREMAN:
    You were lucky to have a lake! There were a hundred and fifty of us living in t' shoebox in t' middle o' road.
FIRST YORKSHIREMAN:
    Cardboard box?
THIRD YORKSHIREMAN:
    Aye.
FIRST YORKSHIREMAN:
    You were lucky. We lived for three months in a paper bag in a septic tank. We used to have to get up at six in the morning, clean the paper bag, eat a crust of stale bread, go to work down t' mill, fourteen hours a day, week-in week-out, for sixpence a week, and when we got home our Dad would thrash us to sleep wi' his belt.
SECOND YORKSHIREMAN:
    Luxury. We used to have to get out of the lake at six o'clock in the morning, clean the lake, eat a handful of 'ot gravel, work twenty hour day at mill for tuppence a month, come home, and Dad would thrash us to sleep with a broken bottle, if we were lucky!
THIRD YORKSHIREMAN:
    Well, of course, we had it tough. We used to 'ave to get up out of shoebox at twelve o'clock at night and lick road clean wit' tongue. We had two bits of cold gravel, worked twenty-four hours a day at mill for sixpence every four years, and when we got home our Dad would slice us in two wit' bread knife.
FOURTH YORKSHIREMAN:
    Right. I had to get up in the morning at ten o'clock at night half an hour before I went to bed, drink a cup of sulphuric acid, work twenty-nine hours a day down mill, and pay mill owner for permission to come to work, and when we got home, our Dad and our mother would kill us and dance about on our graves singing Hallelujah.
FIRST YORKSHIREMAN:
    And you try and tell the young people of today that ..... they won't believe you.
ALL:
    They won't!


[[Geek culture|http://apple.slashdot.org/comments.pl?sid=138720&cid=11609478]] - fantastic!

The attentive will have noticed that the phrase "uphill both ways" does not in fact appear in this sketch. That's because a fairly cursory bit of research will reveal that it is from a fairly similar but entirely unrelated Bill Cosby sketch. Bill Cosby is not a native Yorkshireman. Not a lot of people know that.
Released on PC, ~PS1, Dreamcast. I joined [[MuckyFoot]] just before the PC version was finished, so I didn't have much to do with the game itself, but it's got a lot of neat stuff in it - I enjoyed playing it. I did the Dreamcast port after the PC version had shipped - just me and Karl Zielinski testing in nine months start to finish. The Dreamcast was a really nice machine to work with - it just did what it was meant to without any fuss - a big contrast to the drama the ~PS1 version was going through. It's a shame the DC died - it was a nice elegant bit of kit.
The [[Utah Teapotahedron|http://www.ics.uci.edu/~arvo/images.html]] is an awesome demo model. It's easily recognisable, so it's easy to tell when you've screwed up a scaling or are clamping N.L the wrong way or managed to get your backface culling wrong. It's got insta-create functions in 3DSMax and D3DX, and the spline data is easily available (see [[Steve Baker's excellent homage page|http://www.sjbaker.org/teapot/]]). It has areas of all types of curvature, it can project shadows onto itself really nicely (handle, spout, lid handle). And it's easy to throw a texture map onto it - apart from the pole on the lid, it's easy to parameterise.

Also, it helps a pet gripe of mine - algorithms that require non-intersecting watertight manifolds. The teapot is not closed and it self-intersects (the handle punches through the side). So it's really useful for pounding on algorithms that don't like either of those. Getting artists to generate watertight manifolds is a horrible job - please could all the researchers out there stop developing algorithms that require it? If your algorithm doesn't even handle the sixth platonic solid, it's not much use to anyone, is it?

Frankly, the only reason [[Lenna|http://www.lenna.org/]] is a better icon for computer graphics is that she's been around longer, and she's naked.
I keep having this stupid conversation with people. It goes a little something like this:

Them: I need to get a new card - one with dual-DVI. Any ideas?
Me: You really need dual DVI?
Them: Yeah, VGA's shit.
Me: VGA's actually pretty good you know - are you running at some sort of crazy res?
Them: Yeah - 1600x1200 - it's a blurry mess.
Me: That's not crazy at all. It's well within VGA's capabilities.
Them: No, analogue is crap for anything like that. You have to go digital. Especially if you're reading text.
Me: Dude, at home I run an Iiyama 21-inch CRT at 1600x1200 at 85Hz on a VGA cable. I can write code all day with black text on white backgrounds. At work I run my second Samsung 213T off a VGA cable as well, and that's the screen I use for email - black on white again. They're both crystal-sharp.
Them: Rubbish. I just tried it myself - it's an unreadable blurry mess at even 60Hz.
Me: Are you by any chance ... using an nVidia graphics card?
Them: Sure, but what's that go to do with it? Have you got some ninja bastard card?
Me: An elderly and perfectly standard Radeon 9700.
Them: I've got a 7800 GT - it should kick the shit out of that.
Me: Yes, at shuffling pixels. But it's got an nVidia RAMDAC. Which is a large steaming pile of poo.

Seriously - what the hell is up with nVidia and their RAMDACs? They've been shit right from day one in the NV1, they were shit when I worked at 3Dlabs and the $50 Permedia2 had an infinitely superior display quality to their top-end GeForce2s, and they've continued that grand tradition right up to current cards. That was acceptable when a RivaTNT cost $50, but now they're asking $1000 for an SLI rig. My boss was trying to get two monitors hooked up to a fancy nVidia card that only had one DVI port on it, and whichever monitor he plugged into the VGA port was ghosting like crazy. Swap the card out for a cheap no-frills Radeon X300 and hey - lovely picture on both.

Now you're going to think I'm an ATI fanboi. And I am, because I like elegant orthogonal hardware. But I'm not syaing ATI RAMDACs are great - it's just that they don't suck. Matrox, 3Dlabs and Intel all have decent RAMDACs. Even the S3 and PowerVR zombies have better RAMDACs. Beaten by S3! That's absurd.

For example, a colleague had a high-spec "desktop replacement" laptop with an nVidia chipset of some sort that he could never get to drive his cheap CRT with half-readable text. Naturally he blamed the monitor. He's recently replaced it with a new Dell 700m, and it drives the CRT wonderfully. This is a $5 Intel graphics chip in a laptop! It's totally worthless as a 3D card, but even it does orders of magnitude better than the nVidia cards at running a CRT.

The one time nVidia cards have decent RAMDACs is when it's by someone else. Some of their "multi media" cards with the fancy TV in/out stuff have a nice external RAMDAC made by someone else, and apparently (never tried them myself) they work just fine. I'm all for new tech, but we've all been bounced into switching to DVI has for such bogus reasons - monitor sizes just aren't growing that fast.

So if someone tells you that old steam-powered analogue VGA is totally obsolete because DVI quality is just sooooo much better, ask them if they've got an nVidia graphics card.
I've added a version of my VIPM comparison article to my website in the [[papers section|http://www.eelpi.gotdns.org/papers/papers.html]] down the bottom. It's always nice to have articles "alive" and searchable by engines. I added some hindsights about the process of integrating VIPM into Blade, since the article was written before that had been done.
*** Article under construction ***


I've noticed in a lot of discussion the trouble with naming these various technologies. Here's the words we in the Wearable Computing group use:

~Head-Mounted Display (HMD): general term for any sort of display that is fairly rigidly mounted on your head. May be one eye or both eyes, and use a wide variety of display technologies. Usually has some form of orientation and/or position tracking so that it knows where you're looking. Without the tracking it's really just a "private TV" - let's ignore those for the moment.

Virtual Reality (VR): You are immersed in the virtual world and you really don't want to see the real world - it would be distracting. The exception to this is what we call "not being blind" - if the phone rings, or you want to pause to drink your cup of coffee or the cat suddenly jumps into your lap - then you want some instant and easy way to be able to see the world, but you probably don't need to also see the virtual one, so something as crude as a flip-up visor (as on the ancient Forte ~VFX1) would work fine.

Augmented Reality (AR): Like VR, you can see a virtual world in a head-mounted display, but the displays are constructed so that they let in light from the real world as well - usually through half-silvered mirrors or similar. This allows virtual information or objects to be displayed on top of the real world. As [[Michael Abrash said in his blog|http://blogs.valvesoftware.com/abrash/why-you-wont-see-hard-ar-anytime-soon/]], the biggest problem here is that you cannot occlude the real world - there's no opacity controls.

Mediated Reality (MR): Similar to AR, except that instead of real-world photons travelling direct to your eyeball, a camera turns them into a video stream, the virtual world gets rendered over the top, and the result displayed to your eyeballs. You're still wearing an HMD and immersed in the world. This is significantly simpler than AR because the registration can be better and you can do opacity, but the question is whether the extra latency and resolution loss of going through the camera->display path is acceptable for everyday use. I wouldn't want to cross the street wearing one, let alone drive a car. However this is an obvious way to modify VR to "not being blind", since it's fairly easy to slap a camera onto a VR headset.

Augmented Video (AV): Not a standard term, but it's the one we like. This is like Mediated Reality in that you have a video feed and you add something over the top and then show it to the user, but the difference is that it's not rigidly mounted to your head, and it may only be showing a portion of the view, not all of it. There's a ton of this happening right now on smart phones, and it's really cool. Buuuuut... I'm going to be a curmudgeon and complain and say it's not "real VR" - it's orders of magnitude easier than the immersive definition of VR above. So calling it "VR" gives people false hopes - "hey we have VR on an iPhone - can't be long before it's everywhere". Maybe in a few years we'll have to give up and call this VR and the definition above will need a new name - Immersive Virtual Reality maybe?
Woohoo! Finally got off my arse and wrote a paper without some editor nagging me that it was two weeks overdue!

In my day job doing [[Granny3D]], we have a mesh preprocessor, and I was looking for a vertex-cache optimiser to add to it. But the only ones I knew of were either too complex for me to understand, too special-case, or proprietary. I'd written a vertex-cache optimiser thing with lookahead before, and it was mind-numbingly slow. Truly, awfully slow. Even with a lookahead of only six triangles, it was pathetically slow (I tried seven - it didn't finish after three hours running time). So I thought well, I'm not doing one of those again.

So I tried just writing a no-lookahead, no-backtrack one, because anything is better than [[a slap in the face with an angry herring|Strippers]], and I basically made up some heuristics that I thought would work. And it behaved scarily well. So I tweaked it a bit, and it went better. And then I wrote a thing to anneal the "magic constants", and it went really really well. So well, it got within ten percent of the optimal result. Then I compared it to the variety of scary (but excellent) papers I'd read and it was really close to them. I haven't done an apples-to-apples comparison yet, but the eyeballed mammoths-to-heffalumps comparo looked rather promising. And it's orders of magnitude faster, and way easier to understand. And then [[Eric Haines|http://www.acm.org/tog/editors/erich/]] asked a question about this on the ~GDAlgo list, and that was too much for me to stand, so I wrote the thing up and put it [[here|http://www.eelpi.gotdns.org/papers/fast_vert_cache_opt.html]]. I intend to keep it relatively "live", so if you have any feedback, fire away.
When streaming data from disk, it's important to figure out two things that are related but slightly different - which assets you'd like to have loaded, and at what priority. Priorities are used to decide which assets get loaded when memory is short and you can't load everything, which ones to throw away to free up space, and they're also used to decide what order to load things in when you don't have enough disk bandwidth. And as we all know, you never have enough memory or disk bandwidth.

For textures with mipmaps, determining which mipmap you need is reasonably efficient to do as an approximation. I cover that in the blog entry [[Knowing which mipmap levels are needed]]. You could also do it by demand-paging with virtual textures, for example Sean's [[Sparse Virtual Textures|http://silverspaceship.com/src/svt/]] or id's Rage engine ("megatextures"), but there's problems with those which I will get to at the end.

You can do a similar sort of cheap approximation with mesh ~LODs. If you use progressive meshes the collapse error metric gives you an error in real-world distances. If you use artist-generated ~LODs they usually specify the screen height in pixels that each model is designed to be viewed at.

Sound ~LODs are trickier, as the only thing that fades with distance is the volume, which does not correspond to quality in an obvious way. There's a minimum-volume cutoff, but that's not really the same thing. In general you will need some sort of "obnoxiousness level" specified by the sound folks for sound ~LODs. Only having 2 footsteps loaded instead of 10 may be a low-obnoxious change, but having 2 enemy grunts instead of 10 will be significantly more obnoxious.

After much fiddling, I decided to have two different priorities for each asset.

One is the priority for the /currently visible/ LOD (whether or not it's loaded!) - this is a measure of how bad stuff will look right now if you fail to load this. So it's something like a pixel-error metric. However those are tricky to measure and ballpark, so a more practical unit is "meters away from the standard game camera". It's much easier to empirically tweak those, because you just display both ~LODs side by side and move the camera to the point where you don't really notice the difference in quality. Biasing textures for faces and logos up, and textures for dirt and concrete down is a valid thing to do and works well. If I were as smart as [[Charles|http://cbloomrants.blogspot.com/]] I'd automate it and find the PSNR difference between adjacent mipmap levels and use that to bias the priority.

The other priority is the one for the next highest LOD to the visible one - the one you don't yet need. This one can be expressed different ways but for me it's "how many seconds away is the player from noticing this is missing" (or rather, the inverse - and then multiply by the currently-visible LOD's priority of course). If you have everything that is actually visible loaded, you then want to prefetch assets to fill the free space with available. The stuff you want to load first is the stuff that the player will be able to see soonest. Normally you get this time by dividing the distance of the object from the player by the speed the player can move at. However, if there's a locked door in the way, and the unlock-door animation takes two seconds to play, you can add that on to the simple distance measurement (as long as the door is still locked). Until the player is actually at the door to start unlocking it (or maybe one second away from the door), there's no point loading anything on the other side, no matter how close it is.

So each asset has two entries in the priority queue - one that answers the question "If I'm running out of memory, what should I unload to make room." and the other answers the question "I have spare space - what shall I fill it with". The two are slightly different.

Note that for textures you can ''prove'' you do not need a certain mipmap level on a given frame - there is literally no point loading it, the graphics card will not touch that data. For mesh LOD, you can always theoretically see the improvement by using the higher LOD. But the "seconds before you can see the difference" blurs that distinction anyway. In practice every type of priority gets a bunch of tweak knobs applied to it and you'll spend a while tweaking them until "it looks right".

The other interesting question is what you measure distance in. Simplest is linear distance, but that ignores visibility - if something is on the other side of a wall, it'll get loaded anyway, which is suboptimal. But at the other end of the scale are the virtual texture systems. In theory they're great - they load only the parts of textures that are literally visible. Unfortunately, streaming off disk needs a few seconds of prefetch, and they're useless at providing that data. So they're great at low-latency stuff like decompressing ~JPEG->~DXTn and uploading to a video card, but not so good for deciding what to load off DVD.


A bunch of things I discovered along the way, and how I solved them:

''1. The player is camping an exit. But a baddie has been smarter and sneaks up behind them. The player hears their footsteps, turns around and... attack of the blurry low-poly monster!''

Don't use the visible frustum to cull, just use omnidirectional distance (or distance-to-reach) to determine what ~LODs you need. Almost all game types can spin the camera around in less time than the drive head can seek.

The exception to this is textures that should be loaded but aren't. In this case, the streaming system has fallen behind or is running short of memory, you don't/can't have what you really want, and given the choice of loading texture A that is visible right now and texture B that is just as far away but isn't visible right now, load texture A first. This is one reason why I separate priorities into "what do I need now" and "what might I need in the future" - you bias the first by frustum, but not the second.

''2. The player is running down a corridor, turns sharp left, and even stuff nearby is blurry.''

Using a PVS or portal system is great, but you need to modify the traversal when using it for streaming visibility. The way I did it was to first flood-fill out from the player by the time-to-get-there metric (e.g. 3 seconds of travel time, unlocking doors, etc), completely ignoring line-of-sight. Then line-of-sight check through each of the portals in the enlarged visible set, but again ignoring the view frustum (see problem #1). This means you pre-load visible things on the other side of currently-closed doors, around corners the player is close to, etc.

''3. The player blows a door open with a rocket launcher, and...''

If anybody has property-destroying weapons readied, assume destructible things aren't there when doing visibility traverses.

''4. The player shoots a car, the car seems to explode, but for a few seconds either the car looks undamaged, or the car simply isn't there.''

You need to "link" the destroyed-car-mesh (and textures) with the intact car mesh. If one is visible, make sure you assign a similar priority to the other. You could make this conditional on someone having a heavy weapon equipped if you were feeling keen.

''5. The player lobs a grenade at something, and the explosion is blurry. The explosion texture loads just about the time the explosion is fading, then it immediately gets unloaded again - doh!''

The moment someone throws a grenade, start loading the explosion texture because you have about three seconds before you'll need them. Similarly, when someone pulls out any weapon, make sure you preload all the smoke & muzzle flash stuff associated with that weapon.

''6. The HUD is sometimes blurry, e.g. the pause menu, the rarely-used "super power up" effect, etc.''

Never unload any element of the in-game HUD. I tried a bunch of ways to stream the HUD, because it can take up quite a lot of texture space, but they all sucked because it's so incredibly obvious when you get it wrong. Just don't bother.

''7. A cutscene where two characters are talking by radio, the camera cutting back and forth between them, every time it cuts the textures are blurry.''

This is really difficult, as is any game with teleportation or similarly rapid transport. You need to be able to throw multiple cameras into a scene and have them all do vis & priority checks, and take the highest priority of any of them. For cutscenes you need to have a camera playing the cutscene several seconds ahead of where the player is seeing it. Cutscene systems that work by modifying the game world will really hate having two cutscenes playing at the same time. The right answer is "don't do that" - make sure you have the ability to have multiple worlds with a cutscene running in each. But of course that's frequently absurdly impractical and would take major re-engineering.

The way I did it on Blade II was pretty hacky, but did work. I had a compile-time switch that recorded the camera position when cutscenes were playing and streamed them to a file. Then with a second compile-time switch it would play back that file with a three second lead time, doing vis & priority checks from that prescient camera. This gets the scenery right, but for example if a car is spawned and speeds towards the stationary camera it won't work (because the camera is in the right place in the future, but the car isn't). So this second mode remembers the filenames of what didn't get streamed in time and their max ~LODs at each point in time, and dumps that to a second file. Merge the two files and then in the normal build you play back that file three seconds ahead. Obviously you have to redo this process every time assets or cut-scenes change. You do still need to record the camera, because you may have ~NPCs or environment that can change dynamically (e.g. looking at the devastation the player has wrought behind them). This is a horrible hack, but it worked well enough to ship!
It's a Wiki. Sort of. A Wiki is a collaborative document with really simple editing and hyperlinking so that even idiots like me can create documents with lots of confusing links to random pages. The simplest way to make a new link is to WriteInCamelCapsLikeThis, which is why some of the words seem really stupid. Just learn to live with it - the less time I have to spend thinking about formatting, the more likely I am to deign to write down my peals of wisdom, and the more likely you are to be raised to enlightenment. You lucky people.

Quick guide:
* click on blue links to open new topics (technically called "tiddlers").
* hover over open topics to get a little toolbar in the top right with a list of actions.
* click on "close" to close the current topic.
* click on "close others" to close everything but this topic.
* some pages have tags which categorise them. This one has the tag "help". You can select the "Tags" field on the far right and see a list of all the tags, and browse the topics that use those tags. Handy if all you want is tech stuff and not a lot of the random ramblings that go with them.
* experiment - there's no way you can harm this Wiki by playing - even if it looks like you can (because you can't save your changes).

Note it's only sort-of a Wiki, because although it looks like you can edit it, you can't save your edits. It lives on my server, so only I can edit it, so ner! Obviously you can save them to your hard-drive, but that's unlikely to be that interesting - nobody else will see them. If you want to play with real Wikis, there's the original WikiWikiWeb (http://c2.com/cgi/wiki) from which all are Wikis ultimately descended, and probably the most famous is the Wikipedia (http://en.wikipedia.org/wiki/Main_Page), which is an Encyclopedia that is a Wiki - created for geeks by geeks, and thus contains far more about about Pokemon characters (http://en.wikipedia.org/wiki/Jigglypuff) than it does on ball gowns (http://en.wikipedia.org/wiki/Ballgown).

For the technically-minded, this particular breed of Wiki is a TiddlyWiki, and you can read all about it at http://www.tiddlywiki.com/
Every month or so, someone will ask me what happened to [[Larrabee]] and why it failed so badly. And I then try to explain to them that not only didn't it fail, it was a pretty huge success. And they are understandably very puzzled by this, because in the public consciousness Larrabee was like the Itanic and the SPU rolled into one, wasn't it? Well, not quite. So rather than explain it in person a whole bunch more times, I thought I should write it down.

This is not a history, and I'm going to skip a TON of details for brevity. One day I'll write the whole story down, because it's a pretty decent escapade with lots of fun characters. But not today. Today you just get the very start and the very end.

When I say "Larrabee" I mean all of Knights, all of MIC, all of Xeon Phi, all of the "Isle" cards - they're all exactly the same chip and the same people and the same software effort. Marketing seemed to dream up a new codeword every week, but there was only ever three chips:
* Knights Ferry / Aubrey Isle / ~LRB1 - mostly a prototype, had some performance gotchas, but did work, and shipped to partners.
* Knights Corner / Xeon Phi / ~LRB2 - the thing we actually shipped in bulk.
* Knights Landing - the new version that is shipping any day now (mid 2016).

That's it. There were some other codenames I've forgotten over the years, but they're all of one of the above chips. Behind all the marketing smoke and mirrors there were only three chips ever made (so far), and only four planned in total (we had a thing called ~LRB3 planned between KNC and KNL for a while). All of them are "Larrabee", whether they do graphics or not.

When Larrabee was originally conceived back in about 2005, it was called "SMAC", and its original goals were, from most to least important:

1. Make the most powerful flops-per-watt machine for real-world workloads using a huge array of simple cores, on systems and boards that could be built into bazillo-core supercomputers.

2. Make it from x86 cores. That means memory coherency, store ordering, memory protection, real ~OSes, no ugly scratchpads, it runs legacy code, and so on. No funky ~DSPs or windowed register files or wacky programming models allowed. Do not build another Itanium or SPU!

3. Make it soon. That means keeping it simple.

4. Support the emerging GPGPU market with that same chip. Intel were absolutely not going to build a 150W ~PCIe card version of their embedded graphics chip (known as "Gen"), so we had to cover those programming models. As a bonus, run normal graphics well.

5. Add as little graphics-specific hardware as you can get away with.

That ordering is important - in terms of engineering and focus, Larrabee was never primarily a graphics card. If Intel had wanted a kick-ass graphics card, they already had a very good graphics team ''begging'' to be allowed to build a nice big fat hot discrete GPU - and the Gen architecture is such that they'd build a great one, too. But Intel management didn't want one, and still doesn't. But if we were going to build Larrabee anyway, they wanted us to cover that market as well.

Now remember this was around 2005 - just as GPGPU was starting to show that it could get interesting wins on real HPC workloads. But it was before the plethora of "kernel" style coding languages had spread to "normal" languages like C and FORTRAN (yes, big computing people still use FORTRAN, get over it). Intel was worried its existing Xeon CPU line just weren't cutting it against the emerging threat from things like GPGPU, the Sony CELL/SPU, Sun's Niagara, and other radical massively-parallel architectures. They needed something that was ~CPU-like to program, but ~GPU-like in number crunching power.

Over the years the external message got polluted and confused - sometimes intentionally - and I admit I played my own part in that. An awful lot of "highly speculative marketing projections" (i.e. bullshit) was issued as firm proclamations, and people talked about things that were 20 years off as if they were already in the chip/SW. It didn't help that marketing wanted to keep Larrabee (the graphics side) and MIC/Knights/Xeon Phi (the HPC side) separate in the public consciousness, even though they were the exact same chip on very nearly the exact same board. As I recall the only physical difference was that one of them didn't have a DVI connector soldered onto it!

Behind all that marketing, the design of Larrabee was of a CPU with a very wide SIMD unit, designed above all to be a real grown-up CPU - coherent caches, well-ordered memory rules, good memory protection, true multitasking, real threads, runs Linux/~FreeBSD, etc. Larrabee, in the form of KNC, went on to become the fastest supercomputer in the world for a couple of years, and it's still making a ton of money for Intel in the HPC market that it was designed for, fighting very nicely against the ~GPUs and other custom architectures. Its successor, KNL, is just being released right now (mid 2016) and should do very nicely in that space too. Remember - KNC is literally the same chip as ~LRB2. It has texture samplers and a video out port sitting on the die. They don't test them or turn them on or expose them to software, but they're still there - it's still a graphics-capable part.

So how did we do on those original goals? Did we fail? Let's review:

** 1. Make the most powerful flops-per-watt machine.

''SUCCESS!'' Fastest supercomputer in the world, and powers a whole bunch of the others in the top 10. Big win, covered a very vulnerable market for Intel, made a lot of money and good press.

** 2. Make it from x86 cores.

''SUCCESS!'' The only compromise to full x86 compatibility was KNF/KNC didn't have SSE because it would have been too much work to build & validate those units (remember we started with the original Pentium chip). KNL adds SSE legacy instructions back in, and so really truly is a proper x86 core. In fact it's so x86 that x86 grew to include it - the new ~AVX512 instruction set is Larrabee's instruction set with some encoding changes and a few tweaks.

** 3. Make it soon. That means keeping it simple.

''NOT BAD.'' It's not quite as simple as slapping a bunch of Pentiums on a chip of course, but it wasn't infeasibly complex either. We did slip a year for various reasons, and then KNF's performance was badly hurt by a bunch of bugs, but these things happen. KNC was almost on time and turned out great.

** 4. Support the emerging GPGPU market. As a bonus, run normal graphics well.

PRIMARY GOAL: ''VIRTUAL SUCCESS!'' It would have been a real success if it had ever shipped. Larrabee ran Compute Shaders and ~OpenCL very well - in many cases better (in flops/watt) than rival ~GPUs, and because it ran the other graphical bits of ~DirectX and ~OpenGL pretty well, if you were using graphics ~APIs mainly for compute, it was a compelling package. Unfortunately when the "we don't do graphics anymore" orders came down from on high, all that got thrown in the trash with the rest. It did also kickstart the development of a host of ~GPGPU-like programming models such as ISPC and CILK Plus, and those survive and are doing well.

BONUS GOAL: ''close, but no cigar.'' I'll talk about this more below.

** 5. Add as little graphics-specific hardware as you can get away with.

''MODERATE SUCCESS:'' The only dedicated graphics hardware was the texture units (and we took them almost totally from Gen, so we didn't have to reinvent that wheel). Eyeballing die photos, they took up about 10% of the area, which certainly isn't small, but isn't crazy-huge either. When they're not being used they power down, so they're not a power drain unless you're using them, in which case of course they're massively better than doing texture sampling with a core. They were such a small part that nobody bothered to make a new version of KNC without them - they still sit there today, visible on the die photograph as 8 things that look like cores but are slightly thinner. Truth be told, if KNC had been a properly graphics-focused part we would have had 16 of the things, but it wasn't so we had to make do with 8.


In total I make that three wins, one acceptable, and a loss. By that score Larrabee hardly "failed". We made Intel a bunch of money, we're very much one of the top dogs in the HPC and supercomputer market, and the "big cores" have adopted our instruction set.


So let's talk about the elephant in the room - graphics. Yes, at that we did fail. And we failed mainly for reasons of time and politics. And even then we didn't fail by nearly as much as people think. Because we were never allowed to ship it, people just saw a giant crater, but in fact Larrabee did run graphics, and it ran it surprisingly well. Larrabee emulated a fully ~DirectX11 and ~OpenGL4.x compliant graphics card - by which I mean it was a ~PCIe card, you plugged it into your machine, you plugged the monitor into the back, you installed the standard Windows driver, and... it was a graphics card. There was no other graphics cards in the system. It had the full ~DX11 feature set, and there were over 300 titles running perfectly - you download the game from Steam and they Just Work - they totally think it's a graphics card! But it's still actually running ~FreeBSD on that card, and under ~FreeBSD it's just running an x86 program called ~DirectXGfx (248 threads of it). And it shares a file system with the host and you can telnet into it and give it other work to do and steal cores from your own graphics system - it was mind-bending! And because it was software, it could evolve - Larrabee was the first fully ~DirectX11-compatible card Intel had, because unlike Gen we didn't have to make a new chip when Microsoft released a new spec. It was also the fastest graphics card Intel had - possibly still is. Of course that's a totally unfair comparison because Gen (the integrated Intel gfx processor) has far less power and area budget. But that should still tell you that Larrabee ran graphics at perfectly respectable speeds. I got very good at ~Dirt3 on Larrabee.

Of course, this was just the very first properly working chip (KNF had all sorts of problems, so KNC was the first shippable one) and the software was very young. No, it wasn't competitive with the fastest ~GPUs on the market at the time, unless you chose the workload very carefully (it was excellent at running Compute Shaders). If we'd had more time to tune the software, it would have got a lot closer. And the next rev of the chip would have closed the gap further. It would have been a very strong chip in the high-end visualization world, where tiny triangles, super-short lines and massive data sets are the main workloads - all things Larrabee was great at. But we never got the time or the political will to get there, and so the graphics side was very publicly cancelled. And then a year later it was very publicly cancelled again because they hadn't actually done it the first time - people kept working on it. And then a third time in secret a bit later, because they didn't want anybody to know people were ''still'' working on graphics on LRB.

So - am I sad Larrabee as a graphics device failed? Well, yes and no.

I have always been a graphics programmer. The whole "CPU architect" thing was a total accident along the way. As such, I am sad that we didn't get to show what KNC could do as a really amazing emulation of a GPU. But I am still hugely proud of what the team managed to do with KNC given the brief of first and foremost being a real highly-coherent legacy-compatible CPU running a grown-up OS.

Do I think we could have done more graphics with KNC? In terms of hardware, honestly not really. There's nothing I sit here and think "if only we'd done X, it all would have been saved". KNC was a hell of a chip - still is - I don't see how we could have squeezed much more out of the time time and effort budgets we had - we already had some features that were kicked in the arse until they squeezed in under the wire (thanks George!). Having 16 texture samplers instead of just 8 would have helped, but probably not enough to prevent us getting cancelled. We could certainly have done with another year of tuning the software. We could have also done with a single longer-term software effort, rather than 4-5 different short-term ones that the random whims of upper management forced upon us (that's a whole long story all by itself). I could also wish for a graphics-capable version of KNL (i.e. add the texture units back in), but honestly just imagine what people could have done in the five years since KNC shipped if they had allowed normal people to access the existing texture units, and write their own graphics pipelines - it would have been amazing! So my biggest regret is that we were forbidden from exposing the goodness that already existed - that's the thing that really annoys me about the whole process.

Even then, it's hard for me to be too morose when so many of my concepts, instructions and architecture features are now part of the broader x86 ecosystem in the form of ~AVX512. I'm not sure any other single person has added so many instructions to the most popular instruction set in the world. It's a perfectly adequate legacy for an old poly-pusher! And now I'm doing that VR thing instead. Maybe it all worked out for the best.
''[edit: this post doesn't mention the C++11 [[enum class |http://www.cprogramming.com/c++11/c++11-nullptr-strongly-typed-enum-class.html]] construct simply because I wasn't aware of it at the time. It seems most programmers still don't know about it. It doesn't solve all the problems discussed here, but it helps with some.]''

A bunch of discussions came up on Twitter recently, and they were useful and interesting, but Twitter is a terrible place to archive things or get any subtlety, so I've put it here.

So there's something concrete to talk about, here's a starting example of an enum. Some people prefer to use {{{ALL_CAPS}}} for their enums because in C they were basically just #defines. But we can drop that practice now - they're not quite so mad in C++.

{{{
enum TrainState
{
  Stuck,
  Going,
  Waiting,
  Stopped,
};
}}}

C++ enums are a bit half-baked. They are slightly better than C ones, but they still have a bunch of problems. The first is that they all get added to the enclosing namespace instead of being in their own. So in the above example, {{{Stuck}}} is now a symbol in the main scope, whereas you'd really want to have to type {{{TrainState::Stuck}}} to use it. That means you can't have two enums with the value {{{On}}}, because they'll clash, which is awful. The solution is to prefix each one with the name of the enum, which also helps things like Intellisense/~VisualAssist when you can't remember exactly what values the enum can take.

The other annoying thing is you can't ask how many items are in the enum, so if you want an array with one value per entry, or to iterate through them (e.g. when the user presses a key, you want the next one, but you also want to wrap to 0 when they get to the end). The solution I usually use is to add a special last value called {{{count}}}. Note I use lower case instead of ~CamelCaps to hint that it's "special" and not just another one of the enums. I've also seen people put it in upper case to really highlight that it's special. Also, add a comment for people who haven't seen this trick before.

So now we're getting somewhere:

{{{
enum TrainState
{
  TrainState_Stuck,
  TrainState_Going,
  TrainState_Waiting,
  TrainState_Stopped,

  TrainState_count         // Always last, indicates how many items there are.
};

// and use the count like so:
char *pTrainStateReadableNames[TrainState_count];
for ( int i = 0; i < TrainState_count; i++ )
{
  ...etc...
}
}}}

As you can see, I like to separate the type from the name of the enum with an underscore to show that they're separate parts, even though mixing ~CamelCase with underscores seems a bit goofy at first. Instead, some folks like to wrap their enums in a {{{struct}}} or {{{namespace}}} with the name:

{{{
namespace TrainState { enum
{
  Stuck,
  Going,
  Waiting,
  Stopped,

  count         // Always last, indicates how many items there are.
};}

// and use the count like so:
char *pTrainStateReadableNames[TrainState::count];
for ( int i = 0; i < TrainState::count; i++ )
{
  ...etc...
}
}}}

...but at this point I know so many people that have been doing it with the underscores so long (including before C++) that I think it might be less of a surprise to just leave it that way - the appearance and typing cost are nearly identical.

Incidentally, lots of irritation from fellow coders about people who use "last" rather than "count". Very bad - "last" implies it is the last valid item and that the count is that plus one, or that you need to use a less-or-equals test. I don't like "end" much either, but STL has taught us that "end" is actually one-off-the-end, so it's a somewhat familiar paradigm. What we really need is a word that is one past the end, i.e. the opposite to "penultimate". But I suspect putting {{{TrainState_postultimate}}} in a codebase is going to cause griping.

Some people have also said they use {{{_start}}} and {{{_end}}} for enums that don't start at zero, or for sub-ranges. I've also seen {{{_begin}}} and {{{_end}}} to match the STL pattern. In general I think this use of enums starts to get really icky and dangerous, but in some cases maybe it's necessary. So something like this:

{{{
enum PlayerColour
{
  // Referee isn't on any team.
  PlayerColour_Referee,

  // Red team.
  PlayerColour_Red_start,    // Always first red
  PlayerColour_Red_BrightRed = PlayerColour_Red_start,
  PlayerColour_Red_Orange,
  PlayerColour_Red_Pink,
  PlayerColour_Red_end,      // Always last red (not a real colour)

  // Blue team.
  PlayerColour_Blue_start,    // Always first blue
  PlayerColour_Blue_Navy = PlayerColour_Blue_start,
  PlayerColour_Blue_Cyan,
  PlayerColour_Blue_Sapphire,
  PlayerColour_Blue_end,      // Always last blue (not a real colour)
};
}}}

It should be obvious why this gets icky - the ordering is now super important, and there's a bunch of = signs, so you have to be careful when changing the first element in each block, and there's an unused entry for {{{PlayerColour_Red_end}}}. You could fix this by doing {{{PlayerColour_Blue_start = PlayerColour_Red_end}}}, but if you then added a team green in between it would be a disaster. This usage case is "extreme enums" and needs a lot of care.

A neat trick I discovered to making printable names for enums - mainly for debugging - was shown to me by [[Charles Bloom|http://cbloomrants.blogspot.com/]] to create both the enum and the list of enum strings at the same time. It's a bit ugly unfortunately, but it's not a complete disaster in terms of maintainability:

{{{
#define TrainState_XXYYTable           \
	XX(TrainState_Stuck) YY(=0),           \
	XX(TrainState_Going),                  \
	XX(TrainState_Waiting),                \
	XX(TrainState_Stopped),                \
	XX(TrainState_count)

#define XX(x) x
#define YY(y) y
enum TrainState
{
    TrainState_XXYYTable
};
#undef XX
#undef YY

#define XX(x) #x
#define YY(y)
const char *TrainState_Name[] =
{
    TrainState_XXYYTable
};
#undef XX
#undef YY
}}}

...and then you can do things like

{{{
printf ( "Movement type %s", TrainState_Name[Train->Movement] );
}}}

As mentioned in the article [[Saving, loading, replaying and debugging]], this is also useful for loading old savegames, when you may have changed the enum ordering in the meantime, or removed items. Matching with the strings allows you to cope with that - or at least to highlight that fact that it did change, and figure out what to do about it.

Another trick I've seen used is to reserve the first item as an error value that helps spot unitialized data. Some people object to zero being an error, but it may help in some cases.

{{{
enum TrainState
{
  TrainState_error,          // Always first, may indicate uninitialised data

  TrainState_Stuck,
  TrainState_Going,
  TrainState_Waiting,
  TrainState_Stopped,

  TrainState_count         // Always last, indicates how many items there are.
};
}}}

I am fairly religious about asserting on default clauses, which makes this even more likely to spot unitialised data, and indeed rather than this pattern:

{{{
switch ( train->state )
{
case TrainState_Stuck:
  ...do something interesting about stuck trains...
  break;
default:
  // Nothing to do.
  break;
}
}}}

...I prefer to explicitly list all of them every time:

{{{
switch ( train->state )
{
case TrainState_Stuck:
  ...do something interesting about stuck trains...
  break;
case TrainState_Going:
case TrainState_Waiting:
case TrainState_Stopped:
  // Nothing to do.
  break;
default: ASSERT ( false ); break;
}
}}}

The other benefit of this second method is when you add a new enumeration, you can search your codebase for any of the existing enums and see all the places you need to add a clause for the new one - even if it's just adding it to the "do nothing" part.

[[Morgan McGuire has another way of doing smart enums|http://casual-effects.blogspot.com/2013/04/a-really-smart-enum-in-c.html]] which looks interesting.

Any other enum tricks, let me know and I might add them.
I didn't say the pages contained anything useful. Move your mouse over this text - see how there's a line of text starting with "close" to the upper left? Click the word.

Viewing all articles
Browse latest Browse all 25817

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>