A-LANG: WAYS TO IMPLEMENT COMPUTER LANGUAGES ON 6502s David A. Wheeler Here are some of ideas on how to create efficient and easy-to-use programming languages for 6502-based computers (such as the Apple II, Commodore 64/128, Atari 400/800, and the original Atari game machine). I haven't tried to implement these ideas at length, since I haven't directly used Apple II's (or any other 6502-based system) for some time (though there was a time I could rebuild an Apple II from memory!). For me, this is simply an intellectual exercise. I did create a partial prototype of some of these ideas, particularly some of the linking techniques. Still, this was an old problem that a friend of mine (Bob Pew) & I often hashed out. Bob passed away just when I completed writing the first draft of these ideas, and I was about to call him about my new ideas for this old puzzle we'd often chatted about. I miss him; perhaps this little file will serve as a small memorial to him. Perhaps some reader of this file actually DOES decide to do this. If you do: (a) are you SURE you don't have better things to do? (b) please let me know, I'd love to hear about it. Who knows; maybe the ideas of an "optimizing assembler" and "compiling linker" can be taken to other systems. It makes some sense, especially for systems where compilers are quickly written & can generate assembly, so that poor compiler code can be made a little better. It's possible these ideas have been created before, especially on far older systems. These are essentially notes, and are a little rough in places. There are probably a few minor bugs in various places. Sorry about that, but I think you'll still find them useful. --- David A. Wheeler File originally created 1994-04-11 This version (slightly revised) 2009-07-18 ================================================================ BACKGROUND/THE PROBLEM The 6502 is a weird beast with a very irregular instruction set. Bob Pew & I wanted a language that would be (a) faster to program in (than assembly) but be (b) fast when running. It needed to be pretty efficient. This turns out to be an intellectually challenging problem; below are (what I think are) some solutions. There are many odd things about the 6502 that make it hard to do this: + It's REALLY an 8-bit processor. It doesn't even have 16-bit data registers or instructions (many other 8-bit chips, like the Z80, at least had these). That means that, where possible, use 8-bit manipulations instead of 16 or 32 bit data types. + Its instruction set is rather irregular and has a few odd addressing schemes (esp. "zero page"). + Its built-in stack (for subroutines) is only 256 bytes long - possibly enough for storing the return address, but probably not big enough for storing data as well. + ANY kind of stack access takes more time and uses up a (precious) register... which is why assembly language programs tended to pass values via fixed address locations. + It has a tiny amount of addressable memory, so for "big" programs, you need to conserve space. A key to using the 6502 is to keep as much as possible in the "zero page." The zero page is the first 256 bytes of memory; the 6502 has a large set of special operations that use this area of memory, in many ways like an extended register set. Loads and stores are faster to the zero page, and some operations (such as indirect addressing) REQUIRE use of the zero page. ======================================================== THE SOLUTIONS I have developed a few new (to my knowledge) solutions to this problem: 1. Statically assign parameter and local variables in a special way. To do this, divide the work of compilation & linking in an unusual way (the linker does a ton of "compilation" work) to efficiently use 6502 resources, and in particular assign all local variables and parameters to static addresses. Global analysis is then used to assign addresses so that, in most cases, all copying between locations (e.g., between the addresses for parameters and addresses for local variables) are essentially optimized away. This could probably be best utilized by creating a different language, so it then discusses a potential ASM+ & Simple, a pair of languages that squeeze as much efficiency out of the 6502 as reasonably possible. However, the approach could be used by C, etc., especially if the programmer is willing to accept certain conventions and extensions (e.g., extending C to support pass-by-reference like C++ and then writing 6502 code that uses that extension in most cases). This approach can support recursion - even automatically - but there is a cost where the recursion occurs. 2. Fixed location data stack. This would be easier to implement & would support recursion more easily. This approach wouldn't be as fast or small as the previous approach, but it would support a more traditional approach (linking could just link, compiling could just compile). This approach would be particularly good for FORTH-based languages. You could extend this with a frame pointer (one byte?) and add easier support for traditional languages. [ Dan of opendtv (at yahoo dot com) has noted another way:] 3. "Use fixed zero page locations for locals and parameter passing, simulating a conventional CPU with a large register file. This eliminates most stack operations, except for saving/restoring the zero page "registers" at function entry/exit (leaf functions can use locations that don't need to be saved/restored). To reduce code size, the save/restore should be done with a JSR to a subroutine. This is really almost the same idea as #1, except that "register" assignment is done locally instead of globally. Global allocation would be nice, but it isn't necessary. The cc65 compiler partially implements this concept by allowing locals (but not parameters) to be explicitly assigned to zero page by using the "register" keyword. It's good for pointers since they can be used directly." Let me add a few comments on #3: Even non-leaf functions can avoid saving/restoring in many cases; if their zero page assignments don't overlap with another function's (including themselves), they don't need to save/restore. The notion here is that a call and return are slightly more expensive, because they must save/restore the input parameters and local variables, but then all accesses are way faster because they are at constant addresses in zero page (which is the best possible place to be). The save/restore can store data in a full data stack, so this would permit indefinite recursion. One oddity: The "obvious" implementation doesn't permit passing on addresses of local variables, because that will change once it's put the save/restore stack. A solution is to put such variables directly on the save/restore stack, which you could also do for larger local arrays; the local zpage values could then store its address on the save/restore stack, and you'd use special commands to access them. Yes, call/return takes longer due to completely copying the parameters and local temporaries, but that can be done in a tight loop at top speed; all other operations to the current frame happen using constant-location zpage accesses, which are blazingly fast, and that is likely to more than compensate. It's hard to argue that these are "always better", but I suspect they're better in at least some circumstances than other approaches. Stack-based approaches that use a zero page two-byte pointer to an arbitrary stack are flexible, but SLOW on a 6502 and create large code sizes. Anyway, if you're implementing these kinds of systems, you might want to think about these ideas. There's also the issue of whether to generate straight 6502 code, or use some sort of virtual machine... and if the latter, what to do. The 6502 has little memory, so some sort of VM that can do operations like 16-bit operations more easily has a lot of advantages. So further down, I then discuss some of the ways to implement virtual machines (like Sweet-16, p-code, threading, etc.); if you generate straight 6502 code, it's easy to run out of space in 64K, so often you want to use a virtual system simply to save space. The assembler syntax I've used below is S-C assembler like; in particular, I presume that the assembler will choose zero page access when it can. ======================================================== SOLUTION 1: Overloaded static parameter/local addresses - ASM+ and Simple. This approach creates overloaded static addresses for parameters and local variables, in a special way that maximizes use of the zero page. It's probably easiest if specialized languages are designed to exploit this property, so I also discuss 2 specialized languages to do so. The approach below uses the zero page as much as possible, overloading each zero page location with many different variable values (as long as they aren't used simultaneously). In fact, it's difficult to believe that normal humans could overload its usage so well, so this approach has the potential to be in some ways more efficient than humans. This approach doesn't _require_ specialized language, but it might be easier and more flexible that way, and originally I had in mind the idea of creating a high-level language specifically to be easier to efficiently implement on a 6502 (of which this idea would be part). Two specialized computer langauges are discussed here: ASM+: A macro assembler with 2 peephole optimization stages and a special macro language with common operations (increment integer, etc). It can be used as a "smart" assembly language directly, or as an intermediate compilation language generated by a "Simple" compiler. Simple: A higher-level language that would be C (and Ada) like, but would pass by reference (like Fortran) instead of by value (like C). Why reference-passing, when the 6502 poorly supports arbitrary pointers and copies? The answer is that, with this new overloading approach, reference passing could often be implemented by a copy that never actually happens (via address overloading). Since the 6502 doesn't have a usable data stack, this approach is more efficient on this platform (in the implementation, the "reference" is implemented not necessarily by copying values, but by having exactly the same address assigned to the parameter and the variable calling it, creating "zero-copy" calls where possible). Recursion could still be done, though recursive calls could be expensive when executed. By using Ada-like parameter passing (noting IN, OUT, and IN OUT) passing data is MUCH more efficient than for C, because most "copies" could be eliminated. Creating code when everything was implemented would go through the following steps (with product types in square brackets): [Simple Code as ASCII text, created by a programmer] -> Simple compiler. -> [ASM+ Code as ASCII text] -> Special preprocessor and/or done as part of the "linker" step -> [List of procedures & what they call; also input/output params, globals, and a list of variable sizes which need allocation] -> (At this step "compilation" is done. Now to link:) Run "linker" across all ASM+ code - first determine what procedures call what & then do a topological sort. Then determine address locations of all variables (including variables used to pass data between procedures), trying to use zero page as much as possible. Generate a file of all address locations. I've written a quickie "demo" program that handles the topological sort and shows that you really can stick in a lot of variables into zero page if you do this (it also generates a huge batch of EQU's that a simple assembler could use to support this idea). Note that you also want to "prefer" a memory allocation that eliminates copying memory for parameter passing; this is possible by judiciously "overlapping" memory locations from different calls. Pass one of ASM+ assembler. Transforms "early" macro expansions, then does a peephole optimization pass on the assembly-with-some-macros. Pass two of ASM+ assembler. Transform all normal macro expansions into 6502 code, then do another peephole optimization. [Executable Code] "Compiling" a Simple or ASM+ file is relatively cheap. Linking on larger systems would get expensive, since there would be some global analysis followed by assembly of everything. This is necessary to maximize zero page usage. The ASM+ assembler could fit in the restricted 6502 memory. The assembler would need to keep the macros, global definitions, jump addresses of procedures, and addresses for parameters in memory at all times; for each procedure definitions could be kept & then dropped. Code could come in & out via the disk. Thus, this could be self-hosting. The procedure call system should be by-reference, since this is cheaper on a system without a stack and where "copies" aren't really copied. By-value can be supported too, but the goal is to minimize actual copies. Values are simply written to fixed locations (preferably in the zero page) and the procedures are then JSRed. Return values are handled the same way. Recursive calls can be handled, though expensively: copy the values that might be changed onto a stack and/or malloced area, call the recursive call, then pop the values off the stack. This can be noticed at link time; by simply following the call tree you can see which calls "loop back", and wrap those calls with the commands to save & restore the variables covered by everything memory location possibly used "between". Implementation order: naturally not everything need be built at once. The ASM+ assembler could start out as a stock macro assembler, and just create macros for it in the stock macro language. The macro language would need the ability to do .IF's during expansion, since it would need to pick from a number of expansions depending on whether or not certain values were contant, in the zero page, etc. Thus only the "linker" would need to be built, which would simply examine for special macros that define procedures, etc, and generate addresses for everything. The peephole optimizers could be built later. Bob originally wanted an ASM+-like language, and he always liked Fortran-- I think he would have liked a Fortran-like parameter system! Here's a sample ASM+ syntax: "I" is used for (2-byte) integers, "C" for (1-byte) characters, "L" is used for (4-byte) longs, "P" for (2-byte) pointers, "R" for records of other sizes. Results are always the left-hand-side argument (so that later Simple operations will look similar). I++ A increment integer A. INC A ; BNE .1 ; INC A+1 ; .1 C++ A increment character A. INC A I-- C-- I:=I A,B copy B into A. C:=C A,B copy B into A. I:=C A,B copy B into A. I+=I A,B C+=C A,B I+=C A,B also -=, *=, /* I:=I+I A,B,C also -, *, /, and for C as well. MEMCOPY IFI==0 A, BR IFI==I A, B, BR also !=, >, >=, JMP X Two peephole steps could be used - one to see "wider" areas of code (collections of copies, etc.), and the second one to cover special-case asm instructions, etc. As a quick-and-dirty implementation, lex/flex or a regular expression engine could be used to implement the peephole optimizations. Now for the "Simple" language. It is to generate the ASM+ language. It's designed to be simple to parse (probably by recursive descent or LL(1), since they're easy to build & small). It would be easier to create if, at first, the language were VERY restricted. The syntax I've put down is very C-like, but with some Ada & Pascal influences. Ideally, with the "right" definition files or simple text preprocessing, it should compile on a stock compiler on other machines. The syntax should also fit within the target machine limits. For example, old Apple II's can't generate curly brackets ({}) or underscores (_). I would substitute the dollar sign ($) for _, and use [] instead of {} - again, to support self-hosting. As far as parsing it goes, recursive descent is often easy to understand, but I suspect an LL(1) parser would be better (see Fisher, page 120). Eventually it'd be good if Simple were written in itself, and using an LL(1) parser driver would mean that recursion wouldn't be necessary (which would be an expensive operation in Simple). Also, a table-driven parser would be much smaller (inserting comparisons everywhere would make the compiler large) & according to Fisher they tend to be faster too (that's surprising). The table could be used to at least list what are legal tokens where an error occurs, and for LL(1) it's easy to add an automatic error repairer. Down the road, it'd be better if there was a way to specify that some parameters get passed in a register, and/or fix certain addresses. This could be used to reduce link-time, -- precompile large modules, and could be used to create really nice interfaces to underlying OSs. While it might be difficult to generate code that used the params in registers, it'd be useful to low-level routines -- allow direct calls to assembly-level procedures. File = External_Statements* External_Statements = Global_Declaration | Subprogram_Definition Global_Declaration = ["private"] variable_definition Subprogram_Definition = ( "function" | "procedure" ) identifier [ "(" argument_list ")" ] Here's a sample Simple program: private char current_x, current_y; procedure move$cursor(in char x, in char y) [ int b = 0; // Move the cursor to the new position x, y. x++; y += current_y; b += 0x1fff; ]; Which would generate the following ASM+: MODULE LOCAL$FILE PRIVATE-BYTE LOCAL$FILE.CURRENT_X PRIVATE-BYTE LOCAL$FILE.CURRENT_Y PROC MOVE$CURSOR IN X,1 ; second parameter is size IN Y,1 ALLOC B,2 ; so far we've just been allocating space. Here's executable code! IZERO B C++ X C+=C Y, CURRENT_Y I++ B ; The following should probably deallocate storage in the assembler so ; that entire large programs can be assembled in one pass. ENDPROC MOVE$CURSOR ENDMODULE LOCAL$FILE Which in the end would probably generate something like the following: ; Example of Link-generated allocation of storage. MOVE$CURSOR.X EQU $10 MOVE$CURSOR.Y EQU $11 MOVE$CURSOR.B EQU $12 ; for low byte; also covers $13. LOCAL$FILE.CURRENT_X EQU $900 LOCAL$FILE.CURRENT_Y EQU $901 ; Example of code. MOVE$CURSOR LDA #0 ; zero STA $12 STA $13 INC $10 ; ++ for character. CLC ; += for character. LDA $11 ADC $901 STA $901 LDA $12 ; B += 0x1fff ADC #$ff STA $12 LDA $13 ADC #$1f STA $13 There's a lot of allocation operations in the intermediary code, but that makes sense.. the key to using the 6502 is to allocate scarce zero page resources, and the key to making a faster assembler is to allocate & deallocate procedure space so all procedures can be assembled at one time. There needs to be a keyword that hints "put this in the zero page." The C "register" keyword isn't quite right, since it's perfectly reasonable to take the address of such a variable. Perhaps the right keyword is "zpage". Probably recursive calls should be marked as "recurse" to note that they're expensive and that's okay, though maybe not - the compiler in particular might need to recurse. Ideally, the compiler/linker combo could figure this all out automatically (this is quite practical, since the linker has to determine the call tree any for allocation purposes). For self-hosting interactive use, it might be necessary to require the marker (since it's hard to do otherwise). Another option would be to require that to do recursive calls, callers must explicitly call functions that push/pop the storage area of that callee. E.G.: FPUSH(FUNCTION_NAME); // Push to a data stack the FUNCTION_NAME's locals ... now you can directly copy to the areas representing function parameters FUNCTION_NAME(...); FPOP(FUNCTION_NAME); The FPUSH would store the local variables, parameter area, etc., for FUNCTION_NAME, and FPOP would reverse it. A language that did this could be easily implemented and self-hosting, because you're depending on the human to figure it out. That's annoying if you recurse *often*, but if recursion is relatively uncommon, then that's not too bad. Procedures-in-procedure support could be added later. That's easy as long as they don't recurse - simply reference the address of the object. Various other words as compiler hints (like "register" in C) would be helpful, esp. for the loops (so that the X or Y register can be used as a loop register where appropriate). Such keywords probably include ZPAGE, REG{A,X,Y}. Perhaps FOR REGX I := 10 DOWNTO 0 {by 2} DO ... ENDFOR; could become LDX #$0a ; if constant start, don't need to check here! .1 ... DEX BPL .1 Probably it should support modules (like C, based on files), header files (maybe add a C-like macro expander?), etc. Note one odd thing: in a big program, it might be efficient if the tightest loops are in separate subroutines called by others; that way, since they're lower in the tree, they're more likely to get zero pages assigned to them. CC65, a freeware C compiler for 6502s, has a "-static-locals" option. This option lets locals be in static locations, and they correctly note that this produces faster code but doesn't permit it to be re-entrant. However, it doesn't appear that CC65 takes advantage of allowing function parameters to also be passed statically, nor does it exploit the approach discussed here to allocate so that copying (e.g., to parameters) is only notional and doesn't actually occur in the code itself. CC65 appears to be open source software, and is available at: http://www.cc65.org CC65 normally takes a different approach (based on the older SmallC); it uses the A and X registers as a register, and pushes parameters on a small stack through multiple JSR's (that are simple to implement as a compiler, but not really very efficient). Then a small text editor could be built, both to use these things (eventually the Simple compiler, linker, and ASM+ assembler should be self-hosted) and to show how to use these things. It would probably have a buffer module (for storing the text) and a display module (to take what's in the buffer & display it). If text has been modified, move it to a line buffer internally so that inserts & deletes don't move everything. Allow cursor motions keys like wordstar (control ESDX), delete previous letter, move around like emacs, find (control-F?), and ESC back to a menu for saving, loading, cataloging, help. Two option sets: one for normal text vs. one for programs, and a general host-based one (shift key mod, etc). The key here is to have a basic, easy-to-use text editor; one didn't come with the Apples, oddly enough. ==================================================== SOLUTION 2: FIXED LOCATION DATA STACK This approach depends on fixing the location of the data stack, but spreads the lower and upper bytes to different locations (possibly different pages) instead of having the bytes be adjacent in the memory space. It would work nicely for Forth, C, etc. The trick is that the upper & lower bytes of multibyte values aren't stored contiguously in memory, but are instead "spread" into parallel pages; this trick increases the available space significantly, makes stack movement code faster, and increases safety. IE, you have "all data stack low bytes" continguous, and "all data stack high bytes" continguous if you allow 2-bytes of data on your stack. Though it's the normal 6502 convention, there's no REQUIREMENT to store 16-bit values in the order of "lower 8 bits", followed immediately by "upper 8 bits". The upper & lower bytes could be separated by a standard distance (e.g., 256 bytes) instead, and that might have some advantages. If you create fixed positions for the array of "lower bytes" and the "upper bytes", as long as there aren't more than 256 elements, access is easy. Simply set the X or Y register to the index, then use the "address,X" or "address,Y" access mode to get or store the lower & upper byte of whatever you want. Of course, the address+X shouldn't cross pages, since this would slow down access on an already slow machine. Thus, the X or Y register becomes the stack pointer, into a stack that might use 2 pages. For data stacks (necessary for Forth, C, etc), this works out particularly well. This means that if you're willing to live with a 256-element data stack (not completely unreasonable), you can have a pretty clean & quick interface. Moving up & down the stack involves a single increment or decrement to the X or Y register, the cheapest possible operation. You can also cheaply access operations "inside" the stack by selecting precomputed offsets, again making access MUCH easier. I'd earlier toyed with a 128-element data stack, but this "spreading" approach is easier, and allows twice as many data elements! Pushing and popping are cheaper (one inc/dec instead of two, without lots of carry checks). This also means that char-vs-int errors aren't deadly (as they are if chars push 1 byte and ints push 2). Just use the X or Y register as the data stack pointer, and have two pages set aside for data (one for low, one for high). Even better, you could have 4 pages, and long ints can be passed around without worrying about incorrectly taking them off the stack as the wrong size! Type conversions between char, int, and long become no-ops, eliminating a set of dangerous mistakes. If you want the approach but can't live with only 256 data elements, you could check on each stack push & pop, and move (most of the) data elements when the stack got full or empty. This is slower, obviously. You do a check less often (e.g., on each call/return to a new function). Conversely, if you're willing to accept a far smaller data stack, the results would be ESPECIALLY efficient if you slap the stack into the zero page. If you normally have to deal with 16-bit integers, this won't waste any more space, but it saves all the incrementing/decrementing you have to do to fix up stacks, AND if there's an error in stack manipulations (a serious problem in Forth) you'll have far less damage. With a split stack, you coudl even put the low byte in zpage, and the high byte elsewhere... if you often use 8-bit values (e.g., as truth values or characters) then that could be quite useful. The original fig-FORTH did this, and was remarkably fast. You could even combine techniques: use the zero page for the data stack, with a standard offset. If the stack overflows or underflows you copy to/from elsewhere. Having a zero-page stack, accessed this way, with overflows might give really good performance for a data stack-based system. One problem is that the stack can't handle many large data elements (no local variables with 200 bytes each); I've programmed on such systems before, and as long as there's a heap allocation system that's not a serious problem. The implementation could even put variables beyond a certain size on the heap automatically. One design decision: do you keep an X or Y register set nearly all the time as the stack pointer? Almost certainly yes, since this would be heavily used. If so, do you use the X or Y register, and which direction do you grow (up or down)? I'm not _certain_ if the X or Y register is best, nor if growing up or down is best. At first I used X, then someone suggested I use Y, but after examination I think X would be better. There are some accesses using indirect pointers that are useful and require the Y register. Even more importantly, many instructions such as the incrementing and decrementing instructions can only be used directly if X is used as the stack pointer; with Y it's more complicated. (Brad Rodriguez came to the same conclusion - that X would be better - through the same method - analyzing the address modes available. See: http://www.bradrodriguez.com/papers/moving1.htm ). Here's a sample. For the stack, the X-reg points to the top of stack, growing downwards in memory. Thus X starts large & get smaller as more items get put onto it. If it becomes 0, the stack is full. On decrementing X, you can check to see if it's become negative, or to be more paranoid you could declare an error if it's zero (X=0). Set X=$FE (say) as an "empty" value. To look "inside" the stack, simply use addresses with assembly-time increments (i.e. no runtime cost!). ; perhaps alias "LOWSTACK+1" as "SECOND$PARAMETER$LOWBYTE", ; "LOWSTACK+2" as "THIRD$PARAMETER$LOWBYTE", etc. ; LOWSTACK and HIGHSTACK could be in the zero page, giving less data space ; but more speed and smaller code -- or they could be at the beginning ; of two different pages, giving 256 possible data values. ; Note that the assembly language given here can assemble either way, ; so that could be an easy configuraion option. ADD-INTEGERS ; pop & add top two integers on the stack. ; pushing result onto stack. CLC LDA LOWSTACK,X ; add low bytes. ADC LOWSTACK+1,X ; this is how to look at the "one inside". STA LOWSTACK+1,X ; replace 2nd parameter (which will be the return value) LDA HIGHSTACK,X ; add high bytes. ADC HIGHSTACK+1,X STA HIGHSTACK+1,X INX ; pop the stack (once), leaving result on stack. RTS DUP-INTEGER DEX ; push the stack, making room for the new value. ; Always make room for a value first by pushing first; ; that way, the system could support ; interrupts which use this stack too. LDA LOWSTACK+1,X STA LOWSTACK,X LDA HIGHSTACK+1,X STA HIGHSTACK,X RTS If the stack were always pushed before data was stored on it, interrupt handling could be well-supported. This would make it easily possible to create time-sharing on a 6502 (a frightening thought!). Obviously, stack-checking code could be inserted too, at some cost in code size and performance. Let's compare this to the the FIG Forth implementation available at: http://www.6502.org/crossdev/lang/index.htm Here's its code for PLUS (+): CLC LDA 0,X ADC 2,X STA 2,X LDA 1,X ADC 3,X STA 3,X INX INX JMP NEXT That's not too bad. It does use the zero page, making it more efficient (it's only 15 bytes). It has to do a some work to manipulate the stack. Forth cells on 6502s are essentially universally 2 bytes long, and that's fine. But because it puts the high byte right after the low byte (as is traditionally done), it has to do more fiddling with the X pointer (e.g., every time it adds or drops something from the data stack, a key operation, it has to do two increments or two decrements of a register instead of just one). Access to the top-of-stack requires indexed addressing mode, which as we'll discuss is unnecessary. Even more importantly, if the programmer makes a mistake, it's curtains for the data. In contrast, having separate high and low stacks encourages the use of single-byte data (something the 6502 is FAR more efficient at), since it no longer is as dangerous to use in a Forth system. So, you could have Forth words that specifically operate on 8-bit values, which could be implemented in-line (since they're fast). For example, if truth values are always considered to be 8 bits, then IF and so on only need to look at the low byte, and tests only need to set the low byte... which, given the many tests a real program needs to perform and then check, could be a real savings. The FIG Forth implementation is a word-based intepretive system, which is slower than a JSR-based implementation (where the Forth words are compiled into JSR WORD everywhere)... I'm partial to the JSR design due to its speed, but the approach described in this paper will work with either. So, how do you implement stack-based approaches? Some Forth implementations store the top of the data stack in a register, and NOT in the stack itself. That's definitely worth thinking about for ANY system that has a data stack. Unfortunately, the 6502's puny registers make this a little tricky to do with its built-in registers. (But zpage can be used as a register; more about that in a moment.) Here's a much LESS efficient way to access the stack, by a real program. Mini-QForth works by calling a routine that pops the top-of-stack into the register pair A and Y, and then another routine that pushes the pair onto the stack. This uses up 2 of the 3 registers, and creates lots of extra work. For example, Here's the mini-QForth code that implements "Add the top 2 numbers": ADD JSR POPDATA STY TEMP STX TEMP+1 JSR POPDATA TYA CLC ADC TEMP TAY TXA ADC TEMP+1 TAX JMP PUSHDATA And this has to call PUSHDATA and POPDATA, which are: PUSHDATA TXA LDX DATSTACK STA DATAAREA,X DEX TYA STA DATAAREA,X DEX STX DATSTACK INC DATITEMS BEQ :ERROR RTS POPDATA LDA DATITEMS BEQ :ERROR DEC DATITEMS LDX DATSTACK INX LDA DATAAREA,X TAY INX STX DATSTACK LDA DATAAREA,X TAX RTS That is a LOT of code. It's WAY more slower and more complicated than Fig-Forth, for example. This is terribly inefficient, and I don't see the upside of it. A very interesting alternative is to consider using 2 fixed zero-page locations as the top of stack value (not a pointer, but the actual value itself). After all, using a register as the "top of stack" value is a common optimization on other processors, and accessing a fixed location in zpage is about the fastest thing a 6502 can do unless the data fits in its puny registers. There's a trade: * Operations that don't move the stack up or down are faster (because you can use direct zpage access to the TOS), and you can directly use the TOS as a pointer (if it's in the moving stack you'd have to copy it to another zpage location to use it) * On the other hand, any implied push/pop requires copying the TOS value to/from the "traditional" stack. In such a case, a "+" operator that added the top two items on the stack, putting the result onto the stack, would look like this: CLC LDA LTOS ; two bytes; constant address, so VERY fast. ADC LSTACK,X ; two bytes if stack on zpage STA LTOS LDA HSTACK,X ADC HTOS STA HSTACK,X INX ; pop stack (split stack means we need only one) This is only one more byte long than the code for loading/storing from constant zero page addresses (the one INX), and only 3 more cycles in time (for the 3 instructions that use the X index). Even more remarkably, it only takes 1 more cycle of time than when adding a value in a constant address in zero page with a value in a constant address outside of zero page (because both "zero page,X" and "Absolute" addressing modes take 4 cycles). Storing the top-of-stack (TOS) value in the zpage is ESPECIALLY nifty for the case where a function takes one argument, and returns one argument... in that case, the caller will need to push the "old" one, but setting the new TOS, and getting the return value, all involve writes to constant zpage locations.. much faster. Here's the ENTIRE code for increment TOS (Forth's 1+): CLC ; 1 byte, 2 cycles INC LTOS ; 2 bytes, 5 cycles BNE .1 ; 2 bytes, 2 cycles if not taken (3 if taken, 4 if cross page) INC HTOS ; 2 bytes, 5 cycles .1 ; 7 bytes total, 15 cycles in most cases (most of the time branch is taken) This is ESPECIALLY nifty because you can then store LTOS directly before HTOS in zero page - which means that you can then use the TOS _directly_ as a pointer. A problem with the "split Low/High" stack is that you have to copy it elsewhere to use as a pointer, but if it's already there in the TOS, then there's no copying required at all.. you just use it! Let's illustrate that; here's an operation that uses the top-of-stack as the address, replacing the top of stack with the 16-bit value at that address. This is Forth's "@" word, aka "fetch", with format ( a-addr -- x ); it's also the equivalent of C's (*p) when used as an rvalue (e.g., x = *p): ; TEMP should be in zpage. LDY #0 LDA (LTOS),Y STA TEMP ; don't overwrite address until we get both bytes INY LDA (LTOS),Y STA HTOS LDA TEMP STA LTOS What about storing values? Forth's "!" word (aka "store") has the format ( x a-addr -- ); in C, this would be *p used as an lvalue (e.g., *p = x): LDY #0 LDA LSTACK,X STA (LTOS),Y INY LDA HSTACK,X STA (LTOS),Y INX ; Pop one item from stack. INX ; pop second item from stack. (You could insert an INX before this sequence; then you can choose if you want to drop the item being written or not. C in particular lets you do x = y = z = expr, so having a "doesn't throw it away" version is useful. Or, you could require the use of a separate "pop" operation, and NEVER throw expressions away. Or, always pop away, and just use DUP if you need it.) If you're implementing a Forth, you really need to look here: http://www.bradrodriguez.com/papers/moving1.htm The usual FORTH implementation is threaded, and need to implement: * NEXT: Find and run the next "instruction". * DOCOLON or ENTER: The "instruction" at the beginning of a function which stores the IP on the return stack (so that the function can return later) * EXIT (;S): The "instruction" at the end of a function which returns. Indirect Threaded Code (ITC) is the traditional Forth approach. NEXT: (IP) -> W fetch memory pointed by IP into "W" register ...W now holds address of the Code Field IP+2 -> IP advance IP, just like a program counter (assuming 2-byte addresses in the thread) (W) -> X fetch memory pointed by W into "X" register ...X now holds address of the machine code JP (X) jump to the address in the X register DOCOLON: (aka ENTER) PUSH IP onto the "return address stack" W+2 -> IP W still points to the Code Field, so W+2 is the address of the Body! (Assuming a 2-byte address -- other Forths may be different.) JUMP to interpreter ("NEXT") EXIT: POP IP from the "return address stack" JUMP to interpreter ("NEXT") In direct-threaded, you change NEXT to: (IP) -> W fetch memory pointed by IP into "W" register IP+2 -> IP advance IP (assuming 2-byte addresses) JP (W) jump to the address in the W register and ENTER is invoked via a "JMP" in front of its address. (If you want direct 6502 code, you'd do that instead). Which would be faster on a 6502. Implementing a C compiler this way would produce less efficient code that a "Simple" compiler would, in general, since reading or writing many values requires an indirection using a register (,X or ,Y) instead of a direct write to memory, and would mainly use non-zero-page (instead of zero page). Thus, it'd take more bytes of code, and it'd be slower to execute. However, recursive code would be much easier to handle - Simple would have to work hard to do it. Also, implementing this scheme is _extremely_ easy, while doing "ASM+" right would take a LOT more work. Having the TOS be a pair of bytes in a FIXED zero page location makes this approach lots more efficient, thankfully. If I were building a "language shop", this would be a good way to add a Forth & a C compiler to the "language collection." Programming languages which are often implemented using a stack-based language would fit this model easily too (Java, C#, Python, Perl, PHP, Pascal). The next trick would be to figure out how to make sure each can call the others. Probably not too bad via some special "external language call" interface, at least no worse than other approaches. Forth in particular would work nicely in this scheme. This brings us to the idea of frame pointers. It'd be nice to be able to access the calling parameters and local variables of a more traditional language (C, Ada, Pascal, etc.). The traditional method of letting stacks grow/shrink, yet getting access to fixed locations, is by using a frame pointer. Forth doesn't normally provide frame pointers, and has no idea how many data values are pushed/pulled onto the data stack by each function, which forces the human to do all the work to figure out "where they are" (ugh!). This is one of the reasons Forth can be tricky to use (LOCAL helps, but adds inefficiencies). Now, if you know how many words a Forth call consumes and produces on the stack, you don't need a frame pointer... you could just insert calls that "dup 3 down" or whatever, as long as the system can track how many items are on the stack (where branches merge, like the end of an "IF" statement, there must be the same number). So, you can have frame pointers without their overhead, just insert the constant offset to the "current frame pointer". For functions that return a value, you can insert the "return value" on the stack before everything else (this is "caller result allocation", per Baker's "CONS should not CONS its arguments"), and if there are any OUT or INOUT parameters, put them on the stack FIRST... that way, you can simply drop all the other parameters on a return. The "obvious" implementation of this means you must recompile everything that uses a function if the number changes, which is terrible for interactivity. For interactive use this could be stored specially and loaded by callers, restoring interactivity. If the parameters are all in one page (or split high/low between two pages), then the frame pointer is only one byte long; that could easily be pushed to the return stack as part of the call, and pulled on the return. If the push and pull was done by the callee, then this could be done optionally only by the functions where it helped. For C, you could implement this via a specialized language (like Sweet-16). But instead of fixed register locations, you could have loads/stores be "relative to current stack" positions, to permit easy access to parameters, temporaries, and return values. Other operations could always work on the "top of the stack". Callers would place the operations on the stack and call the function; the callee would push to the stack for each temporary, and then start working. E.G.: function int p1(): return p2(3,4) + 5 function int p2(int x, int y): return x + y + 22 Converts to: ; Compute right-hand-side. First, we must call p2, so push its arguments P1: PUSH #3 PUSH #4 CALL P2 PUSH #5 ADD ; Adds top two items on the stack RETURN P2: ; GOOD generated code: ADD PUSH #22 ADD RETURN P2-NAIVE: ; Here's naive code; if the compiler can't figure out the optimization, ; here is code that handles the general case that is easily generated: DUP-1 ; duplicate the item one-less-than top-of-stack. Compiler tracks frame DUP-1 ; copy in "y", which is now one less than top-of-stack. ADD PUSH #22 ADD ; At this point, the TOS has the result, we want to save it to where ; the top of stack WILL be, which is the beginning of the set of in params. STORE-1 ; Store one into the stack DROP RETURN Ah, but how can we have such optimizations? The key here is that x and y as used here are the LAST times these are used in the function/procedure. If these are the last times x and y are used, we don't need to duplicate them for future use!! This means that a sequence of "DUP-1 DUP-1" for values we'll never use again can be completely eliminated; note that the frame pointer will now have a different result. The ADD will happen as usual, as will the PUSH #22 and ADD. Then when we want to STORE it, we can trivially notice that the "store" is unnecessary - the value is already at the right place. Then we return. Hmm, how can we do that optimization? We'll want to convert to an intermediate format so we can do these optimizations, which is a non-problem when compiling on a larger system, but can be an issue when self-hosting. Is there a simple intermediate format we can convert to? I'm sure there is, here's one idea. Imagine that we first convert this to: ; return x + y + 22 P2: LOADP FIRST-PARAMETER LOADP SECOND-PARAMETER ADD LOAD #22 ADD STOREP RESULT ; equal position as FIRST-PARAMETER RETURN And as we do this, we note where the LAST use of each parameter is. (In this case, both LOADP's are the last time). Then, for each sequence of LOADP's where this is the last use, we determine if the stack is already set up that way... and if it is, the "LOADP" just turns into "empty". Similarly, for each STOREP, we determine if it's already there, and if so, it becomes a no-op. If this is a pain, we could decide to only to try to do this in "returns" or the last statement, since that's where they are more likely. A quickee tail-recursion optimization would be to note that the function is calling itself, copy the parameters down to the positions of the function, and then jump. (It's not Scheme, but hey.) There must be SOME literature on this. I've found: * http://www.ece.cmu.edu/~koopman/stack_compiler/stack_co.html "A Preliminary Exploration of Optimized Stack Code Generation" by Philip J. Koopman, Jr, Journal of Forth Applications and Research, 1994, 6(3) pp. 241-251. * http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1750 "Reducing Loads and Stores in Stack Architectures" by Thomas Vandrunen , Antony L. Hosking , Jens Palsberg (2000); this one even provides a set of useful optimizations patterns, along with proofs they are correct. This one's been further proved here in http://www.cs.york.ac.uk/ftpdir/reports/2006/YCSTR/03/YCSTR-2006-03.pdf "Verification of an Optimisation Algorithm of Stack Machine in Functional Programming Languages" by Guanhua He. * http://www.ece.cmu.edu/~koopman/stack_computers/index.html "Stack Computers: the new wave" by Philip J. Koopman, Jr. This one is an overall discussion, but no specific algorithm for this case. * http://portal.acm.org/citation.cfm?id=321901 "The Generation of Optimal Code for Stack Machines" by J. L. Bruno and T. Lassagne Journal of the ACM (JACM), Volume 22 , Issue 3 (July 1975) Pages: 382 - 396, ISSN:0004-5411. Starts with a full tree, and then generates stack machine code. Note: Supporting "in", "inout", and "out" (a la Ada) would be really helpful. The 6502 is awful at pointer manipulations in general, but indexes aren't too bad, so if you can return multiple values from a function (e.g., inout and out) you can be far more efficient. { There are many other possible optimizations, of course. QForth uses hashes to reduce identifier search time, see: http://forum.6502.org/viewtopic.php?t=555 } Of course, there are lots of ways to mix "fixed location" and "stack-based" approaches. Every stack access on a 6502 is a major chore, especially if full 16-bit addresses are permitted. For example, on a subroutine call, you could have fixed locations in the zero page that are used for the current subroutine's arguments and local parameters, and then use a stack for temporaries to do calculations (e.g., "add" would add the top two numbers on the stack). On a call, you'd save the zero page addresses you're going to use to store the arguments, set them (e.g., from the stack), and do a call. The callee would save any additional zpage needed for local variables. On return, the zero page values would be restored so that the caller gets their state back, and you can put results (if any) on the stack. Obviously, this involves copying variables around, but the point is that you hope to get "paid back" by having simple fixed zpage addresses for arguments and local variables... which might very well be true. The two different approaches need not be completely independent. For Simple, using the basic idea implies that integer arrays (up to 256 elements) could be stored as an array of low bytes, then an array of high bytes, making access easy. In fact, it seems like 2 macro languages could be developed - the ASM+ one above, based on fixed addresses, and a stack-based macro language, and lots of reuse could occur. The underlying assembler & peephole optmization system could be the same, and the same 2nd-stage peephole optimizations might be used for both (though some optimizations would only be useful in one or the other). Then you could allow something of the form: PROC HELLO ; proc hello(in int c, out char d) TEMPS 2 ; use this so can set up space for temporaries. S-IN-I C ; make sure these are in order. S-OUTI D ; the macros remember order to make declaration easy TEMP-I E ; int e; TEMP-I F ; int f; SZEROI F ; f := 0; SI+ E, C, F ; E := C + F; SI:= D, E ; D := E; SI++ E; ; I++; ENDPROC ; pop off temporaries, deallocate in assembler. Would generate: HELLO DEX ; space for E ; stack overflow check if desired. DEX ; space for F ; stack overflow check if desired. ; Now stack looks like: (top/low mem) F E D C (bottom/higher mem) LDA #0 ; F := 0 STA LSTACK,X STA HSTACK,X CLC ; E := C + F LDA LSTACK+3,X ADC LSTACK,X STA LSTACK+1,X LDA HSTACK+3,X ADC HSTACK,X STA HSTACK+1,X LDA LSTACK+1,X ; D := E STA LSTACK+2,X LDA HSTACK+1,X STA HSTACK+2,X ; Now need to implement E++. Since ; INC LSTACK+1, Y is not a legal addressing mode, ; this is a good argument for using X and not Y. LDA LSTACK+1,X ; E++ CLC ADC #1 STA LSTACK+1,X LDA HSTACK+1,X ADC #0 STA HSTACK+1,X .1 INX INX ; pop temporaries. Should params be popped by caller? RTS A few notes regarding incrementing - Geoff Weiss noted that if you find that you rarely have to increment the high byte, and want to get better performace from incrementing 8-bit values and add a penalty to the 16-bit operation, use this code: LDA LSTACK+1,X CLC ADC #1 STA LSTACK+1,X BCC .1 LDA HSTACK+1,X ADC #0 STA HSTACK+1,X .1 Again, note that implementing "E++" can't be done with the following code, since it doesn't use a legal addressing mode: INC LSTACK+1,Y ; E++. This addressing mode is not legal. ; Should use X instead of Y. BNE .1 INC HSTACK+1,Y COMPARING THE TWO APPROACHES Approach #1 takes more work to implement, has the linker do more work, and doesn't handle recursion as easily, but it has lots of performance advantages. Approach #2 is probably much easier to implement, but it has its weaknesses. Let's examine adding two numbers (ignoring any return) to compare them. Here's approach #1, adding A and B and placing them in C (to keep things consistent, I'll only use normal adding and not the adding that speeds low-only-byte adding while making higher-byte adding more expensive and adding code): ADD-INTEGERS CLC LDA A$LOW ADC B$LOW STA C$LOW LDA A$HIGH ADC B$HIGH STA C$HIGH Here's approach #2: ADD-INTEGERS ; pop & add top two integers on the stack. ; pushing result onto stack. CLC LDA LOWSTACK,X ; add low bytes. ADC LOWSTACK+1,X ; this is how to look at the "one inside". STA LOWSTACK+1,X ; replace 2nd parameter (which will be the return value) LDA HIGHSTACK,X ; add high bytes. ADC HIGHSTACK+1,X STA HIGHSTACK+1,X INX ; pop the stack (once), leaving result on stack. Pulling up the 6502 opcodes (e.g., at http://www.6502.org/tutorials/6502opcodes.htm), we can compare the approaches. Approach #1 - all zero page: 13 bytes of code, 19 cycles to run no zero page: 19 bytes of code, 25 cycles to run Approach #2 - all zero page: 14 bytes of code, 26 cycles to run no zero page: 20 bytes of code, 28 cycles to run Approach #2 is only one byte longer in code because it has to manipulate the stack at the end. Approach #1 might have to do copying, which would take more effort and make it much slower than #2, but if the linkage overlap idea works well that copying is rare. Approach #2 is always slower than #1; even when approach #1 can't use a zero page, it's faster than approach #2 when it's using the zero page. Is that a problem? Well, it depends. Approach #2 is more traditional. Notes on creating other languages: A finalization system would be useful, since a number of allocation requests could be to compensate for the inability of the system to have large local variables. Heck, an automatic garbage collector might be fine - mark & sweep collectors wouldn't have much to do, since there's not much memory to sweep. Initialization and adjustment (like Ada 95) would be useful in eliminating common errors. Inheritance can be implemented as with C++ or Ada 95. Dispatching would be little expensive, though; each indirection would require copying an address into the zero page and using it. A completely controlled, unbounded-length string type would be very handy; it would make text manipulation a lot easier. Obviously, for real efficiency this would need to be combined with many other techniques. For example, using registers is even faster, so using A,X, and/or Y to pass some data is even faster in most cases (no saving out and writing back). However, there are only 3 8-bit registers, and they are needed for other things too, so while using them in part for parameter passing is a good idea, it's limited in its utility. PHA and PLA and the zero page can be used to briefly store and retrieve registers, and there are lots of clever techniques for performance aids too. It's critical to use 8-bit (char) values when you can, e.g., for booleans, because the 6502 is FAR more efficient at them. Even on the rare cases where Approach #1 has to do a copy, it's far less. Approach #2 even encourages this, because an error in typing is less catestrophic. APPROACH #3 This approach has all parameters and local variables in fixed zpage locations, so inside a function/procedure, it's just as fast. Since you don't HAVE to do analysis across the whole program, this can be easily self-hosting.... each input parameter, and each local (temporary) variable, is just assigned the next zpage location as you compile. You should probably have a separate zpage byte pair at a fixed location, call it "RESULT", to hold any function return values; that way, it's easy to access the result after function completion. You should store values in LOW, HIGH order (as traditional), so they can be used as pointers. Let's call the starting position in zpage FRAME. Each call becomes the following: * Save the locations used by the callee to saved-stack. If the callee has no parameters, and no local variables, this does nothing. * Set the input parameters. These are shared by the caller, so be sure to not overwrite ones that will need to be read from while creating the call. If simplicity of the compiler is more important than efficient code, you could ONLY reference the save-stack when accessing current values, and then write the new parameters. * JSR to the new routine. Each return becomes: * If a function, save to RESULT * Restore from saved-stack * RTS So for example: int x(a,b) { int c; // This is a local variable. Remember that used space. a = 0; b = 4; a++; c = a + b; c = y(c,b); return c; } int y(f,g) { int q = f + g; return q; } Becomes: FRAME .EQU $00 ; Or somewhere on zpage PARM1 .EQU FRAME PARM1H .EQU PARM1+1 PARM2 .EQU FRAME+2 PARM2 .EQU PARM2+1 PARM3 .EQU FRAME+4 PARM3 .EQU PARM3+1 FUNCTIONX: LDA #0 ; a=0 STA PARM1 STA PARM1H STA PARM2H ; b=4 LDA #4 STA PARM2 ; okay, this reordering wouldn't be done by a naive impl. INC PARM1 ; a++; BNE .1 INC PARM1H .1 CLC ; c = a + b LDA PARM1 ADC PARM2 STA PARM1 LDA PARM1H ADC PARM2H STA PARM1H ; LDY #FUNCTIONY.SAVELEN ; save results to prepare for the call to Y JSR SAVEFRAME LDA PARM3 ; y.f = x.c STA PARM1 LDA PARM3H STA PARM1H ; "y.g = x.b" is a copy to its own position - compiler should notice this! ; LDA PARM2 ; STA PARM2 ; LDA PARM2H ; STA PARM2H JSR FUNCTIONY ; Compiler should notice "copy result to result" is a no-op. LDY #FUNCTIONX.SAVELEN JMP RESTOREFRAME FUNCTIONX.SAVELEN .EQU $4+$2 ; bytes for parameters + bytes for locals ; int y(f,g) { ; int q = f + g; ; return q; FUNCTIONY: CLC ; q = f + g LDA PARM1 ADC PARM2 STA PARM3 LDA PARM1H ADC PARM2H STA PARM3H ; store q into result LDA PARM3 STA RESULT LDA PARM3H STA RESULTH LDY #FUNCTIONY.SAVELEN JMP RESTOREFRAME ; This is JSR RESTOREFRAME followed by RTS. FUNCTIONY.SAVELEN .EQU $4+$2 ; bytes for parametesr + bytes for locals. We need SAVEFRAME and RESTOREFRAME, so let's do this: FRAMEPTR .EQU $FE ; Point to beginning of stored frames. SAVEFRAME: ; Store Y items onto the saved frames. We'll grow down. ; First, make room for the new values STY TEMP CLC LDA FRAMEPTR SBC TEMP STA FRAMEPTR LDA FRAMEPTR+1 SBC #0 STA FRAMEPTR+1 ; TODO: Could check if we've run out of room. ; Now, copy the values into the new area: .1 LDA FRAME,Y STA (FRAMEPTR),Y DEY BVS .1 RTS RESTOREFRAME: STY TEMP .1 LDA FRAME,Y STA (FRAMEPTR),Y DEY BVS .1 ; Now give up the area CLC LDA FRAMEPTR ADC TEMP STA FRAMEPTR LDA FRAMEPTR+1 ADC #0 STA FRAMEPTR+1 RTS If a function has no parameters, and doesn't have local variables, you can skip the copying. The stack that stores frames - call it the "framestack" - can be used to store arrays (including large arrays) on the frame, essentially an alloca()-like approach. E.G., to support: int f() { char buffer[1024]; } the start of f() can increase the framespace by 1024, and store the resulting start-of-frame (buffer's start) in a local variable on zpage. Now accessing buffer is pretty easy. On return, deallocate this. You can even handle displays (as required by Pascal) - just store the back reference in a local parameter. You could even implement a stack-based system this way; the frame of parameters could implemented this way, and the stack (e.g., for temporaries or parameter passing) could be built separately. VIRTUAL MACHINE INSTRUCTIONS There's only 64K, so if the program is longer, you may want to implement an instruction set that takes less space and generate that (except for the most time-critical components). For the virtual machine to run FAST on 6502, it should be designed to be fast to implement a 6502. You MUST look at Woz's Sweet-16 for an example: http://www.6502.org/source/interpreters/sweet16.htm which is amazingly short. You could easily switch in and out of use of sweet-16, and in addition, you could even JSR specific Sweet-16 instructions straight from 6502 code if you just want to use 1-2 operations... making it very easy to intermix them. This has 16 registers (0-15); R0 is the accumulator. These are its opcodes: 00 RTN Return to 6502 code. 0l ea BR addr Unconditional Branch. 02 ea BNC addr Branch if Carry=0. 03 ea BC addr Branch if Carry=1. 04 ea BP addr Branch if last result positive. 0S ea BM addr Branch if last result negative. 06 ea BZ addr Branch if last result zero. 07 ea BNZ addr Branch if last result non-zero. 08 ea BM1 addr Branch if last result = -1. 09 ea BNM1 addr Branch if last result not -1. 0A BK Execute 6502 BRK instruction. 0B RS Return from SWEET-16 subroutine. 0C ea BS addr Call SWEET-16 subroutine. Register Opcodes: The SET opcode uses three bytes, to load a 16-bit immediate value into a register. All the rest of the register opcodes only use one byte. ("MA" = memory address) 1n lo hi SET n,value Rn
↧