Ideas for improving LCD speed

I can answer my own question after reading spi_set_baudrate() in the pico-sdk.

SPI clock must be an even subdivision of another clock. At the default main clock of 125 MHz, the even subdivision is 62.5 MHz, as I got.

The trick I found while reading is when I called set_sys_clk_khz() the SPI clock then defaults to referencing a different 48 MHz clock, and thus the even subdivision is the 24 MHz I saw.

However you can request to use the system clock by adding this line into CMakeLists.txt:

add_compile_definitions(PICO_CLOCK_AJDUST_PERI_CLOCK_WITH_SYS_CLOCK=1)

It is not clear to me if changing this setting might mess up other peripherals, but my program runs OK.

With that in place I can ask for a 150 Mhz system clock and get a 75 MHz SPI clock and get close to 30 fps. Asking for 200 MHz and 100 MHz works, and I get ~40 fps.

2 Likes

This is great work Geoff!

one caveat about changing peripheral clock, watch out for it affecting the SD card as well, which is also SPI

unless you change it during runtime it’s unlikely you’ll run into an issue, but, it’s good to keep in mind in case something crops up

1 Like

I wondered if the repeated drawing was hiding any screen tearing, so I tried the 200 MHz / 100 MHz in Mandelbrot since it only writes once and there were no artifacts. Seems like my LCD can handle 100 MHz writes.

Thanks @maple. I wondered what else was on SPI. My programs don’t use the SD card so far, but I can see a use coming up. I will watch for any effects and report what I see.

I didn’t use the SD card in my project, but still wondered if this would work.

I decided to apply my speedup method to Blair’s text starter and try the ā€œtest fat32ā€ feature. I got failures at 200 MHz / 100 MHz, but success at 195 MHz / 97.5 MHz. Not sure how thorough Blair’s tests are, but it looks pretty decent.

Checking vcocalc.py, the next ā€œsimpleā€ speed is 220 MHz / 110 MHz, but I got no output from my raycaster, so I’m going to assume 200 MHz / 100 MHz is the LCD speed limit.

These are my best guesses for the LCD and SD card speed limits, at least on my PiicoCalc.

1 Like

If you want to know how to squeeze frames out of the ILI9488, check out my project on GitHub:

1 Like

Thanks for the link, but could you be more specific? Which file, and which function within it do you think I could learn from?

I’m a retired civil engineer learning programming as something to keep my brain ticking. Your code base is a bit large for me to make sense of as C++ is a language I haven’t tackled yet. So far I sort of speak Python (now abandoned), C (favourite), 6502 assembly (Ben Eater project) and RP2040 assembly (for PicoCalc).

My raycaster (Wolf3D sort of) uses Blair’s lcd.c file, which uses the pico-sdk function spi_write16_blocking(). I’m sending a 320x200x16 bit screen (1,024,000 bits) with a 100 MHz SPI clock. This should take 0.0102 seconds, and my send_window_buffer() function currently takes 0.0118 seconds. Given there’s overhead outside the actual screen bits I don’t know if this can go much faster. With the calcs and drawing the buffer on core 0 and the SPI writing on core 1 I’m currently getting 50 fps.

Your GitHub readme mentions using DMA. No idea how to do this. Would it be faster than spi_write16_blocking()?

1 Like

Sorry I took so long to see this. I love Ben Eater’s stuff! The file to look at would be src\platform\pico\Ili9488Display.cpp.
But, wow–sounds like you’re doing just fine, already. How do you get 100MHz? Are you overclocked? Are you on a Pico1? I’m also rendering on Core1–stuck at 62.5MHz. You should be able to use DMA with that sdk function.

I read above a bit. Thank you for that tip! I was able to boost my SPI speed to 75MHz.

Good question. I should summarize what I’ve done so it’s all in one posting. These changes are all to Blair’s starter program (v0.14), but I’m sure you can work out how to make them in your program.

On the system clock speed I found 200 MHz works for my LCD, but I needed to slow down to 195 MHz for my SD card to work, and I think your game is using the SD card for level data. Since 200 MHz is now approved by Raspberry I’m not sure if this still counts as over-clocking.

CMakeList.txt
Add:
#Have the SPI clock reference the system clock, rather than a 48 MHz clock
add_compile_definitions(PICO_CLOCK_AJDUST_PERI_CLOCK_WITH_SYS_CLOCK=1)

picocalc.h
Add:
#define SYS_CLOCK (200000000) // System clock in Hz

picocalc.c
In picocalc_init() add before sb_init():
set_sys_clock_hz(SYS_CLOCK, false); // Increase system clock speed
stdio_init_all(); // Re-init after clock change

lcd.h
Add:
#include ā€œpicocalc.hā€
Change:
#define LCD_BAUDRATE (SYS_CLOCK / 2)


I’ve just discovered I can get shorter keyboard polling which removes some stutter from my framerate. See my keyboard thread started yesterday. So far it’ working for me but it’s not very thoroughly tested.

southbridge.h
Change:
#define SB_BAUDRATE (100000) // 100 kHz in place of original 10 kHz


I’m now sending from my window_buffer (320x200 for an old-school look) to the LCD from core 1 in 10 horizontal slices. Core 1 lets core 0 know as it completes each slice so core 0 can start rendering the next frame in that slice of the window_buffer even while the lower slices are still sending. In the right scenario core 0 only has to render the bottom slice before telling core 1 to start sending again. In this case I’m now around 70+ FPS. However as I approach sprites this can drop to mid-30’s. Obviously I need to review my sprite rendering. This is tough since when I’m very close to 2 sprites every slice is drawn 3 times (1:walls 2:far sprite 3:near sprite).

I’ll look at your code, but I expect I’ll stick with what I’m doing. I can’t picture how DMA can push pixels over the SPI bus any faster than what I’m already doing.

Thanks, and have fun.
Geoff

1 Like

I forgot to confirm that I am on a Pico 1. It’s limitations (small memory, only integer hardware) make it more fun. I especially liked programming my own 16.16 fixed point math in assembly for this raycaster.

1 Like

Thanks for that excellent explanation, Geoff! Those are some impressive numbers. I’d really like to see your code. DMA might help more than you think—might be worth looking into more. You pay a small setup price, then hand it off. It doesn’t increase the bandwidth, but might increase your throughput, and it’s hardware you aren’t using. I also wanted to be a bit constrained by the hardware for my project. It’s no wonder we came to very similar solutions for the render pipeline as the Pico is nicely limited in RAM. I actually got a stack overflow due to not storing App statically! A little update on my project: I did some simple profiling via the IO pins, toggling pin 3 on at the beginning of renderAndFlushFrame() and off again at the bottom, I saw that moving the hot functions from XIP to SRAM really helped smooth things out. You just wrap the function name in the __no_inline_not_in_flash_func() macro. My game renderer was choking on line-dense areas before the change—trouble spots showed Core1 pegged for 1-2+ seconds. After, it was a regular clock signal as it was able to breath while the main loop (Core0) maintained 30 FPS. No hiccups. Also, they’ve corrected the typo in the PICO_CLOCK_AJDUST_PERI_CLOCK_WITH_SYS_CLOCK definition—just an fyi.

1 Like

My raycaster needs floating point values, but the floats library is too slow. I implemented 16.16 fixed point in assembly (multiply, divide, square and root) so I could take advantage of the fast integer hardware and all of them are in SRAM since they’re called from all over my code. Never thought if any of my C functions are used enough to justify being in SRAM.

I copy-pasted PICO_CLOCK… from the C SDK manual, but I now see the back of the manual corrects the spelling. Now that you’ve pointed it out I’ll fix my code.

1 Like

It’s not a question of being used enough, it’s about whether you need predictable timing and what the access pattern is. If a function is sufficiently hot (and short!) it will just remain permanently cached anyway.
For divide an assembly routine isn’t necessary, the pico has an 8 cycle hardware divider. That is extremely hard to beat.

I don’t think the divide hardware will work.

Lets say I want to do 3.5625 / 2.25 = 1.5833

Translated to 16.16 fixed point that becomes hex 0x0003900 / 0x0002400 = 0x00019555 (if I did my math right) or in integer 233472 / 147456 = 103765.

However if I give the divide hardware dividend = 233742 and divisor = 147456 I believe it’s going to come back with a quotient = 1 = 0x00000001 and a remainder = 86016 = 0x00015000. I don’t see how this can be converted to the correct 16.16 fixed point answer.

This is 128-bit/64->64 using two 64-bit divides
your case would be 64-bit/32->32 using two 32-bit divides.

I’m like 95 % sure the pico sdk implements something like this already, the sdk’s handbook has some benchmarks that strongly suggest it.
I had to look into this to perform division-by-constant-multiplication for cases where gcc can’t do it.

2 Likes

Thank you. Your post forced me to dig into the SDK.

Tricky to understand how to use the hardware divider, but using div_s64s64() with some type conversion and shifting is faster than my assembly division.

I had improved my sprite code so my frame rate range was 52-77 FPS. Switching to the hardware divider and I’m now at 57-77 FPS.