Before we start...
If you have just stumbled upon my SPO600 series of blog posts, it has been created to document and share my learnings as I progress through my Software Portability and Optimization college course.
In this post, I will focus on explaining the logic and calculating the execution time and memory usage of a simple 6502 Assembly program. I will also review the code modifications I made to improve the program's performance.
Check out my 1-st post about 6502 Assembly: 6502 Assembly Intro
Find the 6502 emulator I used here: 6502 Emulator
Nice and colourful 6502 reference: Ultimate 6502 Reference
$
= hexadecimal (e. g.$0200
=0x0200
)
Let's take a look at the program
The following code fills the emulator's display with the yellow colour:
lda #$00 ; set a pointer in memory location $40 to point to $0200
sta $40 ; ... low byte ($00) goes in address $40
lda #$02
sta $41 ; ... high byte ($02) goes into address $41
lda #$07 ; colour number
ldy #$00 ; set index to 0
loop: sta ($40),y ; set pixel colour at the address (pointer)+Y
iny ; increment index
bne loop ; continue until done the page (256 pixels)
inc $41 ; increment the page
ldx $41 ; get the current page number
cpx #$06 ; compare with 6
bne loop ; continue until done all pages
At first glance, even despite the extensive comments, the code may appear confusing to someone new to assembly. As for me, it wasn't until I did a line-by-line breakdown while referencing the 6502 documentation that I was able to fully wrap my head around how the program works.
In-depth explanation of the code's logic
Setting the values
The first section of the program is dedicated to setting the values that are going to be used later in the code (think variable initializations, except we are manually loading and putting each value into memory, accumulator or Y register).
lda #$00 ; load the low byte of the address ($00) into the accumulator
sta $40 ; store the low byte in location $40 of the Zero Page
lda #$02 ; load the high byte ($02) into the accumulator
sta $41 ; store the high byte in the consequent memory location ($41)
lda #$07 ; load the colour number into the accumulator
ldy #$00 ; set index to 0 (loaded into Y register)
Notes:
We are storing the address (
$0200
), which points to the start of the memory region reserved for the bitmapped display ($0200
-$05FF
, = 4 pages of memory). See why it needs two consecutive memory locations to be stored in my previous blog.You can find information about the colour values by clicking the "Notes" button in the emulator (warning! you will need to do a bit of scrolling).
The value in the Y register will be used to iterate through the pixels within each of the display's memory pages.
Colouring the pages - main functionality
We colour the pages within the loop by traversing through each byte of the 4 pages of memory reserved for the bitmapped display and storing the colour value in each memory location.
loop: sta ($40),y ; store the colour number in the calculated address
iny ; increment y
bne loop ; continue to colour the page until y overflows
The code below is only executed when the program traverses all of the pixels in the current page and the Y register overflows. It handles switching to the next page once the current one is filled. The program stops when it attempts to switch to page 6, as we are only colouring bitmapped display pages (2, 3, 4, and 5).
inc $41 ; increment the page number
ldx $41 ; load current page number into X register
cpx #$06 ; compare the page number with 6
bne loop ; continue to colour until all pages are done
Notes:
The Y register can hold values from 0 to 255 and wraps around (goes to 0) after reaching its capacity, making it perfect for indexing when traversing pages (as memory page size is 256 bytes).
The
BNE
instruction allows the program to jump to a specified label (in our caseloop
) if the last operation did not result in 0.
Calculating the run-time
We can calculate the program's execution time by counting the total number of CPU cycles it takes to complete. To do this, we can refer to the 6502 documentation, which specifies the number of cycles for each instruction and addressing mode. Nonetheless, since some instructions are executed multiple times, it's important to analyze the program’s logic to determine how many times the loop executes/how many times the BNE
condition is met.
Tip: Use spreadsheets to make your calculations easier.
Instructions | Cycles | Count | Alt-Cycles | Alt-Count | Totals |
---|---|---|---|---|---|
lda #$00 | 2 | 1 | 2 | ||
sta $40 | 3 | 1 | 3 | ||
lda #$02 | 2 | 1 | 2 | ||
sta $41 | 3 | 1 | 3 | ||
lda #$07 | 2 | 1 | 2 | ||
ldy #$00 | 2 | 1 | 2 | ||
loop: | |||||
sta ($40),y | 6 | 1024 | 6144 | ||
iny | 2 | 1024 | 2048 | ||
bne loop | 3 | 1020 | 2 | 4 | 3068 |
inc $41 | 5 | 4 | 20 | ||
ldx $41 | 3 | 4 | 12 | ||
cpx #$06 | 2 | 4 | 8 | ||
bne loop | 3 | 3 | 2 | 1 | 11 |
Notes:
-
STA ($40),Y
andINY
are executed 1024 times because the bitmapped display has 1024 bytes (256 bytes * 4 pages) that need to be coloured. - Instructions like
BNE
(branch on not equal) have varying cycle counts depending on whether the branch is taken or not. In our case: 1-stBNE
- the branch is not taken 4 times (each time the Y register overflows after filling a page); 2-ndBNE
: The branch is not taken once when all pages are filled, and the program reaches its end.
Knowing the total number of cycles for each instruction, we can calculate the execution time by summing up the totals and multiplying the result by the CPU's clock speed.
Result:
Runtime = 11,325 cycles x 1 MHz = 11,325 uS (microseconds)
Runtime = 11,325 uS / 1,000,000 = 0.011325 S (seconds)
Calculating the memory usage
To calculate the memory usage of the program, we need to sum up the size of all the instructions along with the size of each pointer and variable used.
- You can find the number of bytes each instruction uses depending on its addressing mode in the 6502 documentation.
In the given program, we are only storing one 16-bit (= 2-byte) address in memory, thus, the total memory usage of the program will be equal to: total instruction size + 2 bytes.
The number of bytes each instruction occupies according to the docs:
Instruction | Bytes |
---|---|
lda #$00 | 2 |
sta $40 | 2 |
lda #$02 | 2 |
sta $41 | 2 |
lda #$07 | 2 |
ldy #$00 | 2 |
loop: | |
sta ($40),y | 2 |
iny | 1 |
bne loop | 2 |
inc $41 | 2 |
ldx $41 | 2 |
cpx #$06 | 2 |
bne loop | 2 |
Result:
Memory usage = 25 bytes + 2 bytes = 27 bytes
Optimizing Performance
I considered a couple of ways to optimize the code.
My initial option was loading the number of pages to be coloured into the X register: instead of repeatedly loading the value from memory location $41
and comparing it to 6 to track the coloured pages, the value in the X register is decremented each time a page is coloured.
lda #$00
sta $40
lda #$02
sta $41
lda #$07
ldx #$04 ; Load the number of pages to be coloured into the X register
ldy #$00
loop:
sta ($40),y
iny
bne loop
inc $41
dex ; decrementing the value (instead of ldx $41 and cpx #$06)
bne loop
But! this change only brings a slight improvement in code performance, as the operation that consumes the most cycles is the process of writing the colour data to the display memory.
To achieve a more drastic runtime improvement, I started looking into alternative optimization methods and learned about loop unrolling (an optimization technique that decreases loop overhead by executing multiple iterations of the loop body in a single pass).
However, upon re-examining the code logic, I realized that colouring a certain number of pixels at once might not be the optimal strategy (as it creates a lot of redundant code). So, instead, I went for a more logical solution: colouring the pages concurrently. This approach improves the performance while keeping the code clean and eliminating the need to switch from one page to another (since after 256 iterations each page will be fully coloured).
lda #$07 ; load the colour number into the accumulator
ldy #$00 ; set index to 0 (loaded into Y register)
loop:
sta $0200, y ; colour a pixel on page 2
sta $0300, y ; colour a pixel on page 3
sta $0400, y ; colour a pixel on page 4
sta $0500, y ; colour a pixel on page 5
iny ; go to the next pixel
bne loop ; continue until Y overflows
Let's calculate the execution time of the optimized code:
Instruction | Cycles | Cycles Count | Alt-Cycles | Alt-Count | Totals |
---|---|---|---|---|---|
lda #$07 | 2 | 1 | 2 | ||
ldy #$00 | 2 | 1 | 2 | ||
loop: | |||||
sta $0200, y | 5 | 256 | 1280 | ||
sta $0300, y | 5 | 256 | 1280 | ||
sta $0400, y | 5 | 256 | 1280 | ||
sta $0500, y | 5 | 256 | 1280 | ||
iny | 2 | 256 | 512 | ||
bne loop | 3 | 255 | 2 | 1 | 767 |
Result:
Runtime = 6,403 cycles x 1 MHz = 6,403 uS
Runtime = 6,403 uS / 1,000,000 = 0.006403 S
6,403 uS vs. 11,325 uS - the optimized version is almost twice as fast!
Afterthoughts
Analyzing the 6502 Assembly code was challenging. As I am new to 6502 Assembly, I had a hard time understanding the nitty-gritty of the code's logic. Nevertheless, my struggles were worthwhile. This experience deepened my understanding of how the 6502 CPU operates, made me more comfortable in writing code in Assembly, and introduced me to the concept of loop unrolling.
Stay tuned! In the next blog post in the Software Optimization Series, I will share how I experimented with the logic of the given Assembly code.