Itâs surprisingly hard to give a good answer (the question was raised in this article). It depends on how you count, and the details are interesting (to me anyway).
To not leave you hanging: Intel has an official x86 encoder/decoder library called XED. According to Intelâs XED, as of this writing, there are 1503 defined x86 instructions (âiclassesâ in XED lingo), from AAA
to XTEST
(this includes AMD-specific extensions too, by the way). Straightforward, right?
Well, it depends on what you wanted to count. For example, as per XED, ADD
and LOCK ADD
are different âinstruction classesâ. Many assembly programmers would consider LOCK
a prefix and LOCK ADD
an addition with said prefix, not a distinct instruction, but XED disagrees. And in fact, for the purposes of execution, so do current x86s. An atomic add does very different things from a regular add. The prefix thing crops up elsewhere: is say MOVSD
(copy a single 32-bit word) a different âinstruction classâ from REP MOVSD
(block copy several 32-bit words)? XED says yes. But it doesnât handle all prefixes this way in all contexts. For example, the operand-size prefix (0x66
) turns integer instructions operating on 32-bit registers into the equivalent instruction operating on their lower 16-bit halves, but unlike with the REP
or LOCK
prefixes, XED does not count these as separate instruction classes. If you disagree about any of these choices, your count will come out different.
Mnemonics
It all depends on how precisely we define an instruction. Is it something with a distinct mnemonic? Letâs first look at what the article I quoted above says is by far the most common x86 instruction, at 33% of the total sample set: MOV
. So letâs look up MOV in the Intel Architecture manuals. And⊠there are 3 different top-level entries? âMOVâMoveâ, âMOVâMove to/from Control registersâ, âMOVâMove to/from Debug Registersâ. The latter are sufficiently distinct from âregularâ MOV
to rate their own documentation pages, they have completely different instruction encodings (not even in the same encoding block as regular MOV
), and theyâre privileged instructions, meaning lowly user-mode code isnât even allowed to execute them. Consequently theyâre also extremely rare, and are likely to account for approximately 0% of the test sample. And, sure enough, XED counts them as separate instruction classes (MOV_CR
and MOV_DR
).
So these instructions may be calledMOV
, but theyâre weird, special snowflakes, and from the processorâs point of view theyâre entirely different instructions in a different part of the encoding space and with different rules. Calling them MOV
is essentially nothing but syntactic sugar in the official Intel assembly language.
And on the subject of syntactic sugar: some mnemonics are just aliases. For example, SAL
(shift arithmetic left) is a long-standing alias for SHL
(shift left). Both are just bit shifts; there is no distinction between âarithmeticâ and âlogicalâ left shifts like there is between arithmetic and logical right shifts, but the Intel manuals list SAL
(with an encoding that happens to be the same as SHL
) and all x86 assemblers Iâve ever used accept it. Hilariously, in official Intel syntax, weâre simultaneously miscounting in the other direction, since at least two mnemonics got assigned twice: we already saw the âcopyâ variant of MOVSD
(which has no explicit operands), but thereâs also MOVSD
as in âmove scalar doubleâ (which always has two explicit operands) which is an entirely different instruction (XED calls it MOVSD_XMM
to disambiguate, and the same problem happens with CMPSD
).
Thereâs also SSE compares like CMPSD
(the two-operand one!) and CMPPS
. XED counts these as one instruction each. But they have an 8-bit immediate constant byte that specifies what type of comparison to perform. But disassemblers usually wonât produce the hard-to-read CMPSD xmm0, xmm1, 2
; theyâll disassemble that instruction as the pseudo-instruction CMPLESD
(compare scalar doubles for lesser-than-or-equal) instead. So is CMPSD
one instruction (just the base opcode with an immediate operand), is it 8 (for the 8 different standard compare modes), or something else?
This is getting messy. AT&T syntax to the rescue? Well, it solves some of our problems but also introduces new ones. For example, AT&T adds suffixes to the mnemonics to distinguish different operation widths. What Intel calls just ADD
turns into ADDB
(8-bit bytes), ADDW
(16-bit âwordsâ), ADDL
(32-bit âlong wordsâ) and ADDQ
(64-bit âquadwordsâ) in x86-64 AT&T syntax. Do we count these as separate? As per Intel syntax, no. As per XED instruction classes, also no. But maybe we consider these distinct enough to count separately after all? Or maybe we decide that if our definition depends on the choice of assembly syntax, of which there are several, then maybe itâs not a very natural one. What does the machine do?
Instruction bytes
Note I havenât specified what part of the machine yet. This is thorny too. Weâll get there in a bit.
But first, instruction bytes. Letâs look at the aforementioned manual entry for real now: âMOVâMoveâ. If you check that page out in the current Intel Architecture Software Developerâs Manual, youâll find it lists no less than thirty-four encodings (not all of them distinct; Iâll get to that). Some of these are more special, privileged operations with special encodings (namely, moves to and from segment registers). This time, XED doesnât seem to consider segment register loads and stores to be special and lumps them into plain old MOV
, but I consider them distinct, and the machine considers them distinct enough to give them a special opcode byte in the encoding thatâs not used for anything else, so letâs call those distinct.
That leaves us with 30 âregularâ moves. Which are⊠somewhat irregular: 10 of them are doing their own thing and involve moves between memory and different parts of the RAX
(in 64-bit mode) register, all with a special absolute addressing mode (âmoffsâ) that shows up in these instructions and, to my knowledge, nowhere else. These instructions exist, and again, pretty much nothing uses them. They were useful on occasion in 16-bit mode but not anymore.
This specialness of the accumulator register is a recurring theme in x86. âop
(AL/AX/EAX/RAX), somethingâ has its own encoding (usually smaller) and various quirks for a lot of the instructions that go back to the 8086 days. So even though an asssembly programmer might consider say TEST ebx, 128
and TEST eax, 128
the same instruction (and the XED instruction class list agrees here!), these have different opcodes and different sizes. So a lot of things that look the same in an assembly listing are actually distinct for this fairly random reason. Keep that in mind. But back to our MOV!
The remaining 20 listed MOV variants fall into four distinct categories, each of which has 5 entries. These four categories are:
- âLoad-ishâ â move from memory or another same-sized register to a 8/16/32/64-bit register.
- âStore-ishâ â move from a 8/16/32/64-bit register to either another register of the same size, or memory.
- âLoad-immediate-ishâ â load an integer constant into a 8/16/32/64-bit register.
- âStore-immediate-ishâ â store an integer constant to either a 8/16/32/64-bit memory location, or a register.
All processor have some equivalent of the first three (the âstore immediateâ exists in some CPU architectures, but thereâs also many that donât have it). Load/store architectures generally have explicit load and store instructions (hence the name), and everyone has some way to load immediates (large immediate constants often require multiple instructions, but not on x86) and to move the content of one register to another. (Though the latter is not always a dedicated instruction.) So other than the fact that our âload-ishâ and âstore-ishâ instructions also support âstoring toâ and âloading fromâ a register (in particular, thereâs two distinct ways to encode register-register MOVs), this is not that remarkable. It does explain why MOVs are so common in x86 code: âloadâ, âstoreâ and âload immediateâ in particular are all very common instruction, and MOV subsumes all of them, so of course you see plenty of them.
Anyway, we have four operand sizes, and four categories. So why are there five listed encodings per category? Okay, so this is a bit awkward. x86-64 has 16 general-purpose registers. You can access them as 16 full 64-bit registers. For all 16 registers, you can read from (or write to) their low 32-bit halves. Writing to the low 32-bit half zero-extends (i.e. it sets the high half to zero). For all 16 register, you can read from (or write to) their low 16-bit quarter. Writing to the low 16-bit quarter of a register does not zero-extend; the remaining bits of the register are preserved, because thatâs what 32-bit code used to do and AMD decided to preserve that behavior when they specced 64-bit x86 for some reason. And for all 16 registers, you can read from (or write to) their low 8-bit eighth (the lowest byte). Writing the low byte again preserves all the higher bytes, because thatâs what 32-bit mode did. With me so far? Great. Because now is when it gets weird. In 16-bit and 32-bit mode, you can also access bits 8 through 15 of the A, B, C and D registers as AH
, BH
, CH
and DH
. And x86-64 mode still lets you do that! But due to a quirk of the encoding, that works only if thereâs no REX prefix (which is the prefix that is used to extend the addressable register count from 8 to 16) on the instruction.
So x86-64 actually has a total of 20 addressable 8-bit registers, in 3 disjoint sets: AL
through DL
, which can be used in any encoding. AH
through DH
, which can only be accessed if no REX prefix is present on the instruction. And the low 8 bits of the remaining 12 registers, which can only be accessed if a REX prefix is present.
This quirk is why Intel lists all 8-bit variants twice: once without REX and one with REX, because they can access slightly different parts of the register space! Alright, but surely, other than that, we must have 4 different opcodes, right? One each for move byte, word, doubleword, quadword?
Nope. Of course not. In fact, in each of these categories, there are two different opcode bytes: one used for 8-bit accesses, and one for âlarger than 8-bitâ. This dates back to the 8086, which was a 16-bit machine: â8-bitâ and â16-bitâ was all the distinction needed. Then the 386 came along and needed a way to encode 32-bit destinations, and we got the already mentioned operand size prefix byte. In 32-bit mode (handwaving here, the details are a bit more complicated), the instructions that used to mean 16-bit now default to 32-bit, and getting actual 16-bit instrutions requires an operand size prefix. And I already mentioned that 64-bit mode added its own set of prefixes (REX), and this REX prefix is used to upgrade the now default-32-bit âwordâ instructions to 64-bit width.
So even though Intel lists 5 different encodings of the instructions in each group, all of which have somewhat different semantics, thereâs only 2 opcodes each associated to them: â8-bitâ or ânot 8-bitâ. The rest is handled via prefix bytes. And as we (now) know, thereâs lots of different types of MOVs that do very different things, all of which fall under the same XED âinstruction classâ.
Maybe instruction classes is the wrong metric to use? XED has another, finer-grained thing called âiformsâ that considers the different subtypes of instructions separately. For example, for the just-discussed MOV, we get this list:
XED_IFORM_MOV_AL_MEMb=804, XED_IFORM_MOV_GPR8_GPR8_88=805, XED_IFORM_MOV_GPR8_GPR8_8A=806, XED_IFORM_MOV_GPR8_IMMb_C6r0=807, XED_IFORM_MOV_GPR8_IMMb_D0=808, XED_IFORM_MOV_GPR8_MEMb=809, XED_IFORM_MOV_GPRv_GPRv_89=810, XED_IFORM_MOV_GPRv_GPRv_8B=811, XED_IFORM_MOV_GPRv_IMMv=812, XED_IFORM_MOV_GPRv_IMMz=813, XED_IFORM_MOV_GPRv_MEMv=814, XED_IFORM_MOV_GPRv_SEG=815, XED_IFORM_MOV_MEMb_AL=816, XED_IFORM_MOV_MEMb_GPR8=817, XED_IFORM_MOV_MEMb_IMMb=818, XED_IFORM_MOV_MEMv_GPRv=819, XED_IFORM_MOV_MEMv_IMMz=820, XED_IFORM_MOV_MEMv_OrAX=821, XED_IFORM_MOV_MEMw_SEG=822, XED_IFORM_MOV_OrAX_MEMv=823, XED_IFORM_MOV_SEG_GPR16=824, XED_IFORM_MOV_SEG_MEMw=825,
As you can see, that list basically matches the way the instruction encoding works, where 8-bit anything is considered a separate instruction, but size overrides by way of prefixes are not. So thatâs basically the rule for XED iforms: if itâs a separate instruction (or a separate encoding), it gets a new iform. But just modifying the size of an existing instruction (for example, widening MMX instructions to SSE, or changing the size of a MOV via prefix bytes) doesnât.
So how many x86 instructions are there if we count distinct iforms as distinct? Turns out, an even 6000. Is that all of them? No. There are some undocumented instructions that XED doesnât include (in addition to the several formerly undocumented instructions that Intel at some point just decided to make official). If you look at the Intel manuals, youâll find the curious âUD2â, the defined âUndefined instructionâ which is architecturally guaranteed to produce an âinvalid opcodeâ exception. As the name suggests, itâs not the first of its kind. Its older colleague âUD1â half-exists, but not officially so. Since the semantics of UD1 are exactly the same as if it was never defined to begin with. Does a non-instruction that is non-defined and unofficially guaranteed to non-execute exactly as if it had never been in the instruction set to begin with count as an x86 instruction? For that matter, does UD2 itself, the defined undefined instruction, count as an instruction?
Instruction decoders
But back to those iforms: 6000 instructions, huh? And these must all be handled in the decoder? That must be terrible.
Well, no. Not really. I mean, itâs not pleasant, but itâs not the end of the world.
First off, letâs talk about how x86 is decoded in the first place: all x86 CPUs youâre likely to interact with can decode (and execute) multiple instructions per cycle. Think about what that means: we have an (aggressively!) variable-length encoding, and weâre continually fetching instructions. These chips can decode (given the right code) 4 instructions per clock cycle. How does that work? Theyâre variable-length! We may know where the first instruction weâre looking at in this cycle starts, but how does the CPU know where to start decoding the second, third, and fourth instructions? Thatâs straightforward when your instructions are fixed-size, but for x86 they are most certainly not. And we do need to decide this quickly (within a single cycle), because if we take longer, we donât know where the last instruction in our current âbundleâ ends, and we donât know where to resume decoding in the next cycle!
You do not have enough time in a 4GHz clock cycle (all 0.25ns of it) to fully decode 4 x86 instructions. For that matter, you donât even have close to enough time to âfully
decodeâ (what exactly that means is fuzzy, and I wonât try to make it precise here) one. Two basic ways to proceed: the first is simply, donât do that! Try to avoid it at all cost. Keep extra predecoding information (such as marking the locations where instructions start) in your instruction cache, or keep a separate decoded cache altogether, like Intels uOp caches. This works, but it doesnât help you the first time round when youâre running code that isnât currently cached.
Which brings us to option two: deal with it. And the way to do it is pretty much brute force. Keep a queue of upcoming instruction bytes (this ties in with branch target prediction and other things). As long as thereâs enough space in there, you just keep fetching another 16 (or whatever) instruction bytes and throw them into the queue.
Then, for every single byte position in that queue, you pretend that an x86 instruction starts at that byte, and determine how long it is. Just the length. No need to know what the instruction is. No need to know what the operands are, or where the bytes denoting these operands are stored, or whether itâs an invalid encoding, or if itâs a privileged instruction that weâre not allowed to execute. None of that matters at this stage. We just want to know âsupposing that this is a valid instruction, what is itâs length?â. But if we add 16 bytes per cycle to the queue, we need 16 of these predecoders in parallel to make sure that we keep up and get an instruction length for every single possible starting location. We can pipeline these predecoders over multiple cycles if necessary; we just keep fetching ahead.
Once our queue is sufficiently full and we know that size estimate for every single location in it, then we decide where the instruction boundaries are. Thatâs the stage that keeps track. It grabs 16 queue entries (or whatever) starting at the location for the current instruction, and then it just needs to âswitch throughâ. âFirst instruction says size starting from there is 5 bytes, okay; that means second instruction is at byte 5, and the queue entry says that oneâs 3 bytes; okay, third instruction starts at byte 8, 6 bytesâ. No computation in that stage, just âtable lookupsâ in the small size table we just spent a few cycles computing.
Thatâs one way to do it. As said, very much brute force, but it works. However, if you need 16 predecoders (as you do to sustain a fetch rate of 16 bytes/cycle), then you really want these to be as dumb and simple as you can possibly get away with. These things most certainly donât care about 6000 different iforms. They just squint at the instruction just enough to figure out the size, and leave the rest for later.
Luckily, if you look at the actual opcode map, youâll see that this is not all that bad. Thereâs large groups of opcodes that all have basically the same size and operands, just with different operations â which we donât care about at this stage at all.
And this kind of pattern exists pretty much everywhere. For example, look at that conspicuous, regular block of integer ALU instructions near the top of the opcode map. These all look (and work) pretty similar to the CPU. Most of them have essentially the same encodings (except for a few opcode bits that are different) and the same operand patterns. In fact, the decoder really doesnât care whether itâs an OR
, an ADD
, a CMP
, or a XOR
. To an assembly-language programmer, a compiler, or a disassembler, these are very different instructions. To the CPU instruction decoder, these are all pretty much the same instruction: âALU something-or-other mumble-mumble donât careâ. Which one of these gets performed will only be decided way later (and probably only after that operation make it to the ALU itself). What the decoder cares about is whether itâs an ALU instruction with an immediate operand, or if it has a memory operand, and what that memory operand looks like. And the instructions are conveniently organized in groups where the answers to these questions are always the same. With plenty of exceptions of course, because this is still x86, but evidently it can be made to work.
Further down the pipe
Instructions really donât get decoded all at once, in one big âswitch statementâ, and after that they go to disjoint parts of the chip never to meet again. Thatâs not how these things are built. Thereâs plenty of similarity between different instructions, and the âunderstandingâ of what an instruction does is distributed, not centralized.
For example, for the purposes of most of the instruction decoder, the SSE2 instructions ADDPS
, SUBPS
, MULSD
and DIVPD
are all pretty much the same thing. Theyâre FP ALU instructions, they accept the same types of operands, all of which are in the same place.
Some of these instructions are so similar that theyâre almost certain to never fully get âdecodedâ. For example, for IEEE floats, a subtraction is literally just an addition where the sign bit of the second operand is flipped. If you look at the opcode table, the difference between the encoding for ADDPS and SUBPS is precisely one flipped bit: that bit is clear for ADDPS and set for SUBPS. Literally all you need to do to support both instructions is to decode them the same, make sure to grab that one bit from the instruction, and then feed it (along with the original second operand sign bit) into a single XOR gate in front of the FP adder. Thatâs it. You now support both floating point addition and subtraction.
Some of these differences matter more. For example, FP multiplies go to a different functional unit than FP adds, and they have a different latency. So the data needs a different routing, and the latency for say an add and a multiply is different, which the schedulers care about. If thereâs a memory load involved, then the load unit needs to know what size of access, and what part of the operand bypass network to send the results to (integer, float/SIMD?). So thereâs a bunch of control signals computed eventually that express the differences between all these instructions. But a lot of it happens really late. Thereâs certainly no big monolithic 6000-case âswitch statementâ anywhere.
And then thereâs further differences still. For example, MOV elimination. Many x86s can in many cases avoid real execution of register-register MOVs altogether. They just resolve it as part of their register renaming. Likewise, zeroing a register by XORing it with itself (something the author of the original article I linked to) gets resolved by renaming that register to point to a hard-wired zero register and likewise doesnât actually take any execution resources (even though it still needs to decode).
Where does that fit in our taxonomy? MOV rax, rbx
will most often take 0 cycles, but sometimes take 1 cycle due to various reasons. Does the 0-cycle version, when it happens, count as a special instruction? Is XOR rax, rax
(which goes down the magic implicit zeroing path and takes 0 cycles to execute) a different instruction from XOR rax, rcx
which is encoded essentially the same way? These two instructions differ by exactly 1 bit in both the assembly-language source file and the assembled object code, yet execute in drastically different ways and with different latencies. Should that make them a separate instruction or not? The most useful answer really depends on what part of the pipeline youâre interested in. If youâre designing a CPU core, they pretty much are separate instructions. If youâre writing a disassembler, they are not.
In conclusionâŠ
So, is there a point to all this? I wrote it because I think itâs fun, but is there something to learn here?
I think so. It makes a wonderful example for a general phenomenon Iâve encountered in a lot of different situations: questions to which a ballpark answer is fairly easy to give, but that keeps getting gnarlier the more you try to pin it down. Itâs essentially an instance of the âcoastline paradoxâ: the closer you look, the more detail you see, and the more the answer changes.
Suppose I ask you âwhere am I?â, and Iâm okay with getting an answer thatâs within about 10 meters or so. If you have a handheld GPS unit, you can just hand it to me, and if I look at the display Iâll get an answer. If I ask âwhere am I, down to the millimeter?â, things get a lot more complicated. Specifying the position of a person down to a meter or so makes sense, but specifying it down to a millimeter does not. Position of what exactly? My center of gravity? The position of my center of gravity, projected onto the ground? The position of the tip of my nose? The center point of the hangnail on my left pinky? You canât answer that question precisely when the uncertainty inherent in the question is so much larger than the level of precision youâre aiming at.
And by the way, I used x86 as an example here, but donât believe for a second the same thing doesnât apply to, say, the ARM chip in your phone. Modern ARM chips support multiple encodings and also rank over 1000 instructions if you count them at the same level of granularity as XEDs âiformsâ. In fact itâs pretty easy to get high instruction counts on any architecture as soon as any kind of vector/SIMD instruction set is involved, since most of them basically come in the form of âinstantiate these 40 instructions for 10 different data typesâ (with lots of special magic that is either typeless only works on certain types, of course). And yeah, x86 has plenty of historical warts in its encoding, but so does ARM â many of them on display in the current generation of chips, where chip makers have the pleasurable task of designing 3 distinct instruction decoders: old-school 32-bit ARM or âA32â, the more compact but variable-size Thumb-2 or âT32â, and the fixed-size-again 64-bit âA64â encoding.