home   |   stacks treatise index   |   1. Intro: stack basics   |   2. subroutine return addresses & nesting   |   3. interrupts   |   4. virtual stacks   |   5. stack addressing   |   6. passing parameters   |   7. inlined data   |   8. RPN operations   |   9. RPN efficiency   |   10. 65c02 added instructions   |   11. synth instructions w/ RTS/RTI/JSR   |   12. where-am-I routines   |   13. synthesizing 65816 stack instructions   |   14. local variables, environments   |   15. recursion   |   16. enough stack space?   |   17. forming program structures   |   18. stack potpourri   |   19. further reading   |   A: StackOps.ASM   |   B: 816StackOps.ASM   |   Appendix C


6502 STACKS TREATISE


Stack addressing, both hardware and virtual, plus tricks
(there will be more tricks in later sections)

We have discussed four types of stacks:

  1. hardware (in page 1 on 6502), the one always used for subroutine and interrupt return addresses (although its use is not limited to those)
  2. main data stack (residing in page 0 on 6502, partly because of the extra addressing modes available there)
  3. floating-point, high-precision, complex, ...?..., similar to #2 above but typically for larger cells of say four bytes or more, and, since it doesn't need the added addressing modes afforded by ZP, might be anywhere in RAM, not necessarily in page 0 or 1
  4. strings & other random-length items


Dealing with the hardware stack, let's first get into a few tricks to make things interesting before moving on.


#1. If you have a JSR instruction immediately followed by RTS (ie, the JSR instruction is at the end of a subroutine, calling another subroutine), you can usually replace the pair with JMP, and the subroutine's RTS instruction will serve to return from both, saving a byte and nine clock cycles' time.  Add the comment as follows to explain what you did:


        JMP  <subroutine_name>   ; (JSR, RTS)

(This is in the program tips page of the 6502 primer.)  The only time you wouldn't be able to do this is if you pass parameters to the subroutine via the hardware stack, as the subroutine would expect the return address put there by the extra JSR to be on top.  In that case, using the JMP instead of JSR would mean the input parameters would not be at the depth where the subroutine expects to find them.  (This is another case where a separate data stack proves helpful.)


#2. You can temporarily store values without using variables, by using PHA/PHX/PHY...PLA/PLX/PLY.  It's slighty faster than STA/STX/STY abs...LDA/LDX/LDY abs (7 clocks versus 8), and PHA/PHX/PHY...PLA/PLX/PLY take only one byte of program memory each, compared to two for ST_ ZP and LD_ ZP and three for ST_ abs and LD_ abs.  A disadvantage is that you cannot directly pull and add the top stack value to what's in the accumulator for example.  (More on stack addressing in a minute.)

Note that if you're temporarily storing values and not particularly saving registers like you would in an ISR which needs to return them to their previous state before giving control back to the main program, you can PHA...PLY for example.  The stack cell doesn't care where it got its contents or where they get read into, so it is not necessary to match up PHA to PLA, PHX to PLX, PHY to PLY, or PHP to PLP.  You can mix and cross them if it suits the program.


#3. If you need a delay or other counting loop, it is common to use a processor register and decrement it and drop through when it becomes 0 or negative.  If you need to loop more than 255 times, it is common to use a second processor register; but what if you can't afford it?  The next common thing is to decrement a variable.  Again though, one use for the stack is to reduce the need for variables, and to reduce the risk that one routine will overwrite a variable that will still be needed by another pending routine.  So use the stack this way:


FOOBAR:                         ; Enter the subroutine with the number of times to do the outer loop in X.
           PHX                  ; Start by pushing the number of outer loops left to do.
           <other instructions> ; These can use all the processor registers.
           LDX  #8              ; X is available as a counter for an inner loop here.
 1$:          <inner-loop instructions>
              DEX
           BNE  1$
           <more instructions>  ; These can use all the processor registers.
           PLX                  ; Now get the outer-loop counter
           DEX                  ; and decrement and test it.
        BNE  FOOBAR             ; When the BNE drops through, the counter on the stack has already been pulled,
        RTS                     ; so the stack is cleaned up and the RTS gets the correct return address.
 ;------------------

Each time the outer loop runs, the inner loop will go eight times.  This could be further nested if necessary of course, again without needing more processor registers or RAM variables.

I don't particularly recommend using this for long delay loops though.  For those, you probably should use a real-time clock (RTC) of some sort so that the computer can do something else useful during the delay.  Then each time you come to the part of the program that needed the delay, it compares the current time to the target time, and if you haven't reached it yet, it exits and lets you do other useful things instead of twiddling its thumbs in a delay loop.  For more detail, see my article on simple methods to do multitasking in systems that don't have the resources for a multitasking OS, or where hard realtime requirements may rule one out anyway.


#4. You can use the hardware stack as a pointer for an indirect jump, instead of using a pair of ZP bytes:


        LDA  ____     ; (high byte)  (or it could be that you calculate the target address here.)
        PHA
        LDA  ____     ; (low byte)   Remember RTS requires the 16-bit addr to be the target minus 1.
        PHA
        RTS           ; (Not including the LDA's, this takes 3 bytes and 12 clocks.)

instead of:

        LDA  ____     ; (low byte)   (or it could be that you calculate the target address here.)
        STA  temp
        LDA  ____     ; (high byte)
        STA  temp+1
        JMP  (temp)   ; Not including the LDA's, this takes:
                      ; NMOS, addr in ZP:  7 bytes, 11 clocks.   NMOS does it wrong if addr contains $xxFF
                      ;       addr abs:    9 bytes, 13 clocks.
                      ; CMOS, addr in ZP:  7 bytes, 12 clocks.
                      ;       addr abs:    9 bytes, 14 clocks.

So if you stay away from the NMOS bug, the stack method is not only always fewer bytes, but just as fast or faster, and of course needs no variables.

Before the 65c02 came along with its JMP (abs,X) instruction, a way to do an indirect jump using a jump table with the NMOS 6502 was:


                         ; Start with function (an even number) in X.
        LDA  TABLE+1,X   ; Read high address byte from the actual table, and
        PHA              ; push it.  Low byte comes next, below.
        LDA  TABLE,X     ; Be sure to make the table reflect start addresses
        PHA              ; minus 1, since RTS increments the address by 1.
        RTS              ; RTS does the absolute indexed indirect jump.

If you're using something like a Commodore 64 where you really can't use the CMOS version, you could still make a macro of the above routine, to get it on a single line.




To review:  Stack addressing is a form of indexed addressing.  The stack pointer (register S) points to the next memory location to be written.  After a byte is pushed onto the stack, the pointer register S is automatically decremented.  If you pull a byte off the stack, S is incremented first and then the memory location it is pointing to is read.  If you push another byte, you can see that that that memory location will get overwritten.  It's not a problem though, because it was no longer needed.  That's the nature of a stack.  The memory is automatically freed up when it is no longer needed, without leaving gaps and fragmenting the stack area.  The self-maintaining nature keeps all the free space together at one end without any need to pack the stack area.




So far, regarding the hardware stack, we have only approached it as a last-on-first-off memory.  That's not to say there aren't tricks to read or replace items in the stack that aren't at the top.  There are!  It's just that the additions to, and removals from, the stack only happen at the top.  Stack items further down the stack are still accessible, without having to pull everything above them off first.

On the 6502, it starts with TSX, the "Transfer Stack pointer to X register" instruction.  (Previously we touched on TXS to go the opposite direction (ie, X to Stack pointer) in the reset routine to initialize the stack pointer.  Don't confuse them.)  TSX lets us use X instead of S for indexing, and the instructions available in the abs,X addressing mode allow changing the base address, so we can index different depths into the stack, without even changing the X value, or even necessarily caring what it is.

Note that stack addressing is the same as a 100,S addressing mode, with a hypothetical preceding INS or trailing DES (INncrement S or DEcrement S) instruction built in, for pulling or pushing a byte, respectively.  You could synthesize the stack instructions with the abs,X addressing mode.  (Part of the penalty is that you lose the use of X for other things if you don't save it somewhere else—possibly on the stack itself.)  You could for example synthesize PHA with STA $100,X, DEX, and PLA with INX, LDA $100,X.  But why would you?  You wouldn't.  It would be a waste, since PHA and PLA are built in.

But! there is a related technique for gaining stack-relative addressing.  Again, stack addressing is indexed already; so stack relative addressing is doubly indexed.  When you write the program, you won't normally know the value of S (the stack pointer register) at any given point; but if you do for example TSX, ADC $105,X (it was mentioned near the top of this section that you cannot directly pull a byte off the stack and add it to the accumulator's content all in one step), you have the base value $100, plus the index 5 that you added, plus the X index which you got from S, and now you've just accessed the fifth byte on the stack without first pulling off everything above it to get there.  Also, it's still there for future use.  You accessed it without pulling (or in the language of some other processors, "popping") it.  Going in order then is not required, and you're not limited to just PLA, PLX, and PLY, (as we showed with the ADC example).  (To make things easier, the 65816 has additional built-in stack-relative addressing modes.  These will be covered briefly in section 10 and further in section 13.)

Note that for something like the TSX, LDA $105,X example above to always work, you must start the program with (or have in the reset routine) LDX #$FF, TXS (or replace $FF with whatever is the highest address in page 1 that you want the hardware stack to use).  Otherwise, with a random start point, if the stack wraps from $100 to $1FF to $1FE long before the stack space is full, the abs,X indexing could put you into page 2, completely out of the stack area.  For example, if you started randomly at $107 and put nine bytes on the stack, they would be at $107 through $100, then wrap to $1FF, and S would contain $FE.  Now if you do TSX, LDA $105,X, $105+$FE puts you at $203, which is out of the stack area.

Another way to do stack-relative addressing in stack frames (which will be discussed later, in Section 14, on local variables and environments) is the (ZP),Y addressing mode.  You use a pair of ZP bytes as a virtual processor register (call it FramePtrInZP for sake of discussion), put the base address of the stack frame there, load Y with the desired index value into the frame, then do for example LDA (FramePtrInZP),Y.  Rob Finch mentions this on the forum, here.  There are a couple of sources of inefficiency with this, that this pair of ZP bytes for the pointer will have to be pushed and pulled sometimes too, and that the program will have to keep modifying Y; but you may find the technique is appropriate sometimes nonetheless.

Armed with this info, you could build a set of subroutines and macros that operate on parameters that get put on the hardware stack.  Again one of the attractions of using the stack for this is that you can get away with fewer variables, and nested routines can use the same stack replacements of variables without interfering with each other; IOW, one doesn't have to worry about overwriting data that a pending routine still needs.  Also, when a "variable" on the stack is no longer needed, it ceases to exist, and the space it occupied is freed up for something else to use it.  Automatic and efficient garbage collection (on a small scale)!

A problem with the fewer hardware stack addressing modes however is that for certain operations, you might need one or more pairs of bytes in ZP to act as virtual processor registers to transfer numbers to.  For example, you can't do LDA (100,X) (although the 65816 lets you do LDA (sr,S),Y which would be like an LDA (100,X),Y but using the stack pointer without needing TSX first).  So what do you do?  After deriving the address in place (on the hardware stack), you would transfer it to ZP for the operation, like this example:


        LDA  104,X
        STA  temp+1    ; (in ZP)
        LDA  103,X
        STA  temp
        LDA  (temp)

The older NMOS 6502 does not have LDA (temp)—it's either LDA (ZP,X) or LDA (ZP),Y—so we would have to use LDY #0, LDA (temp),Y, using Y since we want to keep X intact for continued stack-relative addressing.  If Y's value must be kept, you'd have to keep a copy elsewhere.  There again you can't just use PHY and PLY like you can on the CMOS 6502.  You have to go through A.  It's so much better to just use the CMOS processor, unless it's on something like the C64 which used the 6510 which was never available in CMOS!

Suppose you wanted STA instead of LDA, and you wanted to preserve all the registers.  You could do:


        PHA
           LDA  104,X
           STA  temp+1    ; (in ZP)
           LDA  103,X
           STA  temp
        PLA
        STA    (temp)

(Again if you're using an NMOS 6502, you'll have to add the use of Y back in.)  Note that the X does not need to change, because although we pushed something else (A) onto the stack (and then pulled it back off), we did not need to access that, and the other stuff stayed where it was.  If there was a chance that a pending routine might still need what was in temp, you could push that too (although it gets slightly more tricky)!  Variables like temp above could in a way be viewed as processor registers which you might have to save on the stack like A, X, and/or Y.  And if an ISR might use them, it will definitely need to save them.  IOW, we still haven't gotten completely away from saving variables.

With a little creativity and knowledge of the instruction set, you can do nearly anything you could want; but efficiency will suffer.  This is one of the reasons to have a virtual stack for data in ZP.

Even if you are using the 65816 touted above, you still won't have any stack-relative addressing modes at all for many of the instructions.  Here's a comparison of addressing modes relevant to both hardware and virtual stack usage:


              6502  65c02  65816
ADC abs,X      x      x      x     ; This set of 7 addressing modes is also true
ADC abs,Y      x      x      x     ; of AND, CMP, EOR, LDA, ORA, SBC, and STA.
ADC ZP,X       x      x      x
ADC (ZP,X)     x      x      x
ADC (ZP),Y     x      x      x
ADC sr,S                     x     ; "sr" stands for "stack relative."  These two could
ADC (sr,S),Y                 x     ; also be written as <offset>,S and (<offset,S>),Y
                                   ; to highlight the fact that the instruction's
                                   ; operand is an unsigned eight-bit offset.

ASL abs,X      x      x      x     ; This pair of addressing modes is also true of
ASL ZP,X       x      x      x     ; BIT, DEC, INC, LDY, LSR, ROL, ROR,
                                   ; and of STZ except on NMOS 6502 which has no STZ.

JMP (abs,X)           x      x
JSR (abs,X)                  x
LDX abs,Y      x      x      x
LDX ZP,Y       x      x      x
STX ZP,Y       x      x      x
STY ZP,X       x      x      x
TXY                          x
TYX                          x
TSC                          x     ; Transfer the 16-bit S to the always-16-bit accum C.
TCD                          x     ; Transfer 16-bit accum C to 16-bit direct-page register D.

So even with the '816 with its sr,S and (sr,S),Y (remember "sr" stands for "stack-relative," and these could also be written as <offset>,S and (<offset,S>),Y) addressing modes, the lack of them in so many instructions that do have ZP indexed (if not also indirect) addressing modes strengthens the case for using a data stack in ZP.  (The 65816's "ZP" is actually called "DP" for "direct page" because it can be moved around.  It is not stuck at addresses 0-$FF, nor does it have to start on a page boundary.)  (And again, separating the data and return stacks solves certain other problems anyway.  This will be discussed briefly in the next section.)

Note that the '816 adds PEA, PEI, and PER, the Push Effective Absolute (ie, 16-bit, regardless of register-width settings, and is neither indirect nor relative), Indirect, and Relative instructions.  PER is always for addresses, whereas PEA and PEI can be anything.  Section 13 has 6502 routines to synthesize these.  These are useful especially in relocatable code and synthesizing more instructions, and they preserve the benefit of also having a data stack that's separate from the return stack.

The '816 also has a 16-bit stack pointer, something which is more needed with the 816's greater number of things to save and restore in ISRs, and which is more useful with its sr,S addressing modes and its much greater suitability for multitasking.




If we use a virtual stack in ZP, subroutine return addresses and a few register savings still happen on the hardware stack, but addressing data on the virtual stack has these advantages:


Since we don't need X for the hardware stack now, we'll use it as a pointer into the virtual stack in ZP.  Repeating the LDA routine above but modified for the data virtual stack in ZP, we get:


        LDA  (2,X)

and repeating the STA above but modified for the data virtual stack in ZP, we get:

        STA  (2,X)

Wow— was that too simple?  (The 2 in "LDA(2,X)" can be adjusted to go different depths into the stack, for example LDA(4,X).)

Now suppose you want to add-with-carry a number in a memory location whose address is in the stack.  With the virtual stack in ZP it becomes just:


        ADC  (2,X)

Again, simple.  There's no need to first copy the address to a ZP location like there was when we tried to keep everything on the hardware stack.  Something else that also makes it easier on the programmer is that without return addresses mixed in, it's easier to know how deep to index into the stack.

A side note about using ZP addressing in the hardware stack area in the 6502 and '816:  Since the '816 lets you move the direct page (DP, like the 6502's ZP) around, and since the 256 bytes of DP do not need to start on a page boundary but can start anywhere in the 64K of bank 0, and since it can overlap the hardware stack area, you can use it to get DP addressing into hardware-stack frames.  There is more on this here on the 6502.org wiki.  Also, BDD has source code here on this site that uses this technique for generating disc-resident bitmaps that define filesystem structure for his mkfs 816NIX filesystem generator program.  Bit-twiddling of stack elements becomes more convenient too, since you can use TRB, TSB, etc..

There is a way to get this on a 6502 too, (but with a big caveat!).  If RAM ignores A8 in a teensy system with only 256 bytes of RAM, 100-1FF becomes a duplicate of 00-FF, making the hardware stack addressable in page 0 also, opening up ZP addressing modes for it.  There could be more than 256 bytes of RAM, but using A8 in the rest of RAM but not in the first two pages will take some extra logic.  If that is not applied, you can't have any contiguous section of code or data that crosses the boundary from an even-numbered page to an odd one.

Going back to addressing modes that are both indirect and indexed:  There's no JMP((addr),Y) (where addr points to a table, and you read the Y and Y+1 bytes into the table to find out where to jump to), but you can do the following, taking the address from the ZP data stack (code corrected 2/10/23):



                     ; With table address and the offset on the data stack,
        JSR  PLUS    ; add them together to get the address of an address.
                     ; (This allows offsets of >255.) 
        LDA  (0,X)   ; Read the resulting calculated addr, low byte first,
        TAY          ; and save it to push after you've pushed the high byte.
        INC  0,X     ; Prepare to read the high byte (see the note below),
        LDA  (0,X)   ; then read it, 
        PHA          ; push it onto the hardware (page-1) stack,
        PHY          ; then push the low byte.
        INX          ; Drop the cell from the data (page-0) stack.
        INX
        RTS          ; Use RTS to make the actual jump.  Remember that the
                     ; address derived will need to be the destination-1.


If you have an NMOS 6502, the PHY will need to be replaced with TYA, PHA.  If the table straddles a page boundary, you'll want it to start on an even address so the INC can't go from FF to 00 and require extra instructions to see if incrementing the high byte is also necessary.

Or, without even using the hardware stack (but having a caveat):



                     ; With table address and the offset on the data stack,
        JSR  PLUS    ; add them together.  (This allows offsets of >255.) 
        JMP  (0,X)   ; (Absolute, not ZP, so operand low byte will be 00.)


The caveat is that the address on the ZP data stack will still be there.  Wherever you jump to will need to start by removing it.  (Another caveat is that the NMOS 6502 does not have JMP(addr,X) but that really need apply only to something like the Commodore 64 which used the 6510 which was never available in CMOS.  For other 6502 computers, just plug a 65c02 in the socket!  It has lots of advantages.)

In the next section are examples of using the hardware stack for passing parameters to a subroutine.




4. virtual stacks <--Previous   |   Next--> 6. passing parameters

last updated Feb 10, 2023