I can answer my own question after reading spi_set_baudrate() in the pico-sdk.
SPI clock must be an even subdivision of another clock. At the default main clock of 125 MHz, the even subdivision is 62.5 MHz, as I got.
The trick I found while reading is when I called set_sys_clk_khz() the SPI clock then defaults to referencing a different 48 MHz clock, and thus the even subdivision is the 24 MHz I saw.
However you can request to use the system clock by adding this line into CMakeLists.txt:
It is not clear to me if changing this setting might mess up other peripherals, but my program runs OK.
With that in place I can ask for a 150 Mhz system clock and get a 75 MHz SPI clock and get close to 30 fps. Asking for 200 MHz and 100 MHz works, and I get ~40 fps.
I wondered if the repeated drawing was hiding any screen tearing, so I tried the 200 MHz / 100 MHz in Mandelbrot since it only writes once and there were no artifacts. Seems like my LCD can handle 100 MHz writes.
Thanks @maple. I wondered what else was on SPI. My programs donāt use the SD card so far, but I can see a use coming up. I will watch for any effects and report what I see.
I didnāt use the SD card in my project, but still wondered if this would work.
I decided to apply my speedup method to Blairās text starter and try the ātest fat32ā feature. I got failures at 200 MHz / 100 MHz, but success at 195 MHz / 97.5 MHz. Not sure how thorough Blairās tests are, but it looks pretty decent.
Checking vcocalc.py, the next āsimpleā speed is 220 MHz / 110 MHz, but I got no output from my raycaster, so Iām going to assume 200 MHz / 100 MHz is the LCD speed limit.
These are my best guesses for the LCD and SD card speed limits, at least on my PiicoCalc.
Thanks for the link, but could you be more specific? Which file, and which function within it do you think I could learn from?
Iām a retired civil engineer learning programming as something to keep my brain ticking. Your code base is a bit large for me to make sense of as C++ is a language I havenāt tackled yet. So far I sort of speak Python (now abandoned), C (favourite), 6502 assembly (Ben Eater project) and RP2040 assembly (for PicoCalc).
My raycaster (Wolf3D sort of) uses Blairās lcd.c file, which uses the pico-sdk function spi_write16_blocking(). Iām sending a 320x200x16 bit screen (1,024,000 bits) with a 100 MHz SPI clock. This should take 0.0102 seconds, and my send_window_buffer() function currently takes 0.0118 seconds. Given thereās overhead outside the actual screen bits I donāt know if this can go much faster. With the calcs and drawing the buffer on core 0 and the SPI writing on core 1 Iām currently getting 50 fps.
Your GitHub readme mentions using DMA. No idea how to do this. Would it be faster than spi_write16_blocking()?
Sorry I took so long to see this. I love Ben Eaterās stuff! The file to look at would be src\platform\pico\Ili9488Display.cpp.
But, wowāsounds like youāre doing just fine, already. How do you get 100MHz? Are you overclocked? Are you on a Pico1? Iām also rendering on Core1āstuck at 62.5MHz. You should be able to use DMA with that sdk function.
Good question. I should summarize what Iāve done so itās all in one posting. These changes are all to Blairās starter program (v0.14), but Iām sure you can work out how to make them in your program.
On the system clock speed I found 200 MHz works for my LCD, but I needed to slow down to 195 MHz for my SD card to work, and I think your game is using the SD card for level data. Since 200 MHz is now approved by Raspberry Iām not sure if this still counts as over-clocking.
CMakeList.txt
Add: #Have the SPI clock reference the system clock, rather than a 48 MHz clock
add_compile_definitions(PICO_CLOCK_AJDUST_PERI_CLOCK_WITH_SYS_CLOCK=1)
picocalc.h
Add: #define SYS_CLOCK (200000000) // System clock in Hz
picocalc.c
In picocalc_init() add before sb_init():
set_sys_clock_hz(SYS_CLOCK, false); // Increase system clock speed
stdio_init_all(); // Re-init after clock change
Iāve just discovered I can get shorter keyboard polling which removes some stutter from my framerate. See my keyboard thread started yesterday. So far itā working for me but itās not very thoroughly tested.
southbridge.h
Change: #define SB_BAUDRATE (100000) // 100 kHz in place of original 10 kHz
Iām now sending from my window_buffer (320x200 for an old-school look) to the LCD from core 1 in 10 horizontal slices. Core 1 lets core 0 know as it completes each slice so core 0 can start rendering the next frame in that slice of the window_buffer even while the lower slices are still sending. In the right scenario core 0 only has to render the bottom slice before telling core 1 to start sending again. In this case Iām now around 70+ FPS. However as I approach sprites this can drop to mid-30ās. Obviously I need to review my sprite rendering. This is tough since when Iām very close to 2 sprites every slice is drawn 3 times (1:walls 2:far sprite 3:near sprite).
Iāll look at your code, but I expect Iāll stick with what Iām doing. I canāt picture how DMA can push pixels over the SPI bus any faster than what Iām already doing.
I forgot to confirm that I am on a Pico 1. Itās limitations (small memory, only integer hardware) make it more fun. I especially liked programming my own 16.16 fixed point math in assembly for this raycaster.
Thanks for that excellent explanation, Geoff! Those are some impressive numbers. Iād really like to see your code. DMA might help more than you thinkāmight be worth looking into more. You pay a small setup price, then hand it off. It doesnāt increase the bandwidth, but might increase your throughput, and itās hardware you arenāt using. I also wanted to be a bit constrained by the hardware for my project. Itās no wonder we came to very similar solutions for the render pipeline as the Pico is nicely limited in RAM. I actually got a stack overflow due to not storing App statically! A little update on my project: I did some simple profiling via the IO pins, toggling pin 3 on at the beginning of renderAndFlushFrame() and off again at the bottom, I saw that moving the hot functions from XIP to SRAM really helped smooth things out. You just wrap the function name in the __no_inline_not_in_flash_func() macro. My game renderer was choking on line-dense areas before the changeātrouble spots showed Core1 pegged for 1-2+ seconds. After, it was a regular clock signal as it was able to breath while the main loop (Core0) maintained 30 FPS. No hiccups. Also, theyāve corrected the typo in the PICO_CLOCK_AJDUST_PERI_CLOCK_WITH_SYS_CLOCK definitionājust an fyi.
My raycaster needs floating point values, but the floats library is too slow. I implemented 16.16 fixed point in assembly (multiply, divide, square and root) so I could take advantage of the fast integer hardware and all of them are in SRAM since theyāre called from all over my code. Never thought if any of my C functions are used enough to justify being in SRAM.
I copy-pasted PICO_CLOCK⦠from the C SDK manual, but I now see the back of the manual corrects the spelling. Now that youāve pointed it out Iāll fix my code.
Itās not a question of being used enough, itās about whether you need predictable timing and what the access pattern is. If a function is sufficiently hot (and short!) it will just remain permanently cached anyway.
For divide an assembly routine isnāt necessary, the pico has an 8 cycle hardware divider. That is extremely hard to beat.
Translated to 16.16 fixed point that becomes hex 0x0003900 / 0x0002400 = 0x00019555 (if I did my math right) or in integer 233472 / 147456 = 103765.
However if I give the divide hardware dividend = 233742 and divisor = 147456 I believe itās going to come back with a quotient = 1 = 0x00000001 and a remainder = 86016 = 0x00015000. I donāt see how this can be converted to the correct 16.16 fixed point answer.
This is 128-bit/64->64 using two 64-bit divides
your case would be 64-bit/32->32 using two 32-bit divides.
Iām like 95 % sure the pico sdk implements something like this already, the sdkās handbook has some benchmarks that strongly suggest it.
I had to look into this to perform division-by-constant-multiplication for cases where gcc canāt do it.
Thank you. Your post forced me to dig into the SDK.
Tricky to understand how to use the hardware divider, but using div_s64s64() with some type conversion and shifting is faster than my assembly division.
I had improved my sprite code so my frame rate range was 52-77 FPS. Switching to the hardware divider and Iām now at 57-77 FPS.