Check out my game in development:
KeithDobbelaere/GeometryVibes3D-PicoCalc
A faithful RP2040 port of Geometry Vibes 3D, targeting the ClockworkPi PicoCalc.
Current state: basic gameplay loop (scrolling level, ship movement, and collision) rendered in wireframe “fake 3D” using fixed-point math.
Features (so far)
-
Fixed-point 3D camera + projection (no frame-time floats required)
-
Streaming level playback (GVL1 / 56-bit column format), reads columns on demand from storage
-
Ship controls (keyboard input; 45° up/down travel like the original)
-
Collision detection against level geometry, including rotation/inversion modifiers
Hardware integration highlights
-
ILI9488 320×320 display: dual-core line raster + slab binning + SPI DMA streaming (~35 FPS)
-
SD card + FAT32: stream levels/*.BIN columns on demand (no full level in RAM)
-
PicoCalc keyboard: polled input via the device driver layer
Toolchain
6 Likes
Current state: playable wireframe “fake 3D” implementation with a level-select menu, HUD, portal/ship effects, and stable fixed-rate rendering on the ILI9488.
Features (Updated)
-
Fixed-point 3D camera + projection (no frame-time floats required)
-
Streaming level playback using the GVL1 / 56-bit column format
-
Level-select menu with highlighted selection
-
HUD layer with:
-
Ship controls with 45° up/down travel like the original
-
Collision detection against level geometry, including rotation/inversion modifiers
-
Wireframe effects, including:
-
animated portal rays
-
ship trail
-
ship explosion chunks
Hardware / platform highlights
-
ILI9488 320×320 display
-
SD card + FAT32
-
PicoCalc keyboard
- polled through the platform/input layer
Rendering notes
-
Fixed-capacity render lists using static storage
-
Major-axis slab line rasterization for cleaner wireframe output
-
ROM-resident 8×8 bitmap font
-
Cached screen-space text objects for HUD and menu rendering
Tools
Toolchain
1 Like
First release ready to download and try out!
Geometry Vibes 3D for PicoCalc - v0.4.0-beta.1
First public PicoCalc release with animated obstacle groups, updated tooling, and gameplay progression.
Highlights
-
Playable wireframe Geometry Vibes 3D experience on ClockworkPi PicoCalc
-
Title screen, level select, HUD, portal effects, ship trail, and explosion effects
-
Animated obstacle groups with grouped primitive definitions in the level format
-
Runtime rendering and collision support for animated groups
-
Updated Python level editor with:
Included
Controls
Notes
-
This is an early public release and is being published as a pre-release while broader testing continues.
-
Animated obstacle collision appears solid in current testing, but more gameplay testing is still planned.
Installation
-
Put the PicoCalc into BOOTSEL mode
-
Copy the included .uf2 to the device
-
Make sure the SD card contains the required files:
1 Like
Nice job for the optimization of the screen rendering!
Thank you. That display class took the longest, but without it, the game wouldn’t be possible. Utilizing both cores to ping-pong slabs using DMA over SPI was a challenge.
1 Like
The game is fully featured now, and fun to play, so check it out—I have v0.5.0-beta.1 posted on GitHub. I think we squeezed quite a bit out of the Pico 1 on this platform, and the SPI display was a real challenge. Thanks to anyone who’s shown an interest, so far. Thanks to BlairLeduc for his driver code and to Kuratius for the camera optimizations.
1 Like
I should also mention, there’s a full-featured editor if you want to create custom levels with custom animated groups, primitive painting, star placement, custom colors, etc. Then it’s as simple as clicking Export Bin and you have a new level. All 10 default levels are in the repo as *.json files and can be loaded into the editor if you prefer to just modify those.
I think you could even implement some rudimentary 3D, as long it’s scenes where overdraw isn’t a problem and only solid color triangles are used. I think a duffs device (jumping into an unrolled loop via switch statement) and some assembler using fixed regs to use stm instructions could help with that. You have nearly 10x the raw horse power of the GBA and people manage to get 10-15 fps using software rendering on it for fairly complex scenes, like with OpenLara. Thumb is on average maybe 2-3x slower than arm because the instructions aren’t very powerful, but the higher clock might compensate for it.
It’s probably not necessary for this project except for maybe the rectfill function, but it might be useful if the renderer is used for other things.
Also I left some explanations on the commit for the comments you added to my code, I think you were confused about how it works so maybe it’s helpful.
It might be worth checking if the game code can be compressed enough to execute from ram, although it’s probably not a very high priority in any case.
Another thing I noticed is that gcc seems be very reluctant to use the uxth instruction for extracting the lower 16 bits of an unsigned int, even though it ought to be faster than shifting or applying a bitmask via bic. I don’t really know what’s up with that. Even a cast to uint16_t seems to get compiled to bitshifts instead of uxth. Maybe it’s hoping to just use strh instead and skip the instruction?
1 Like
Which commit are you referring to?
aa8748b but I think you found it already.
Regarding the uxth issue a gcc developer responded to my bug report, apparently cortex-m0plus has a broken cost model.
Also I have a question about how the renderer works, my understanding of it is that the cores render into their own buffer and then kick off a dma when they are done.
Does this actually require two cores to work?
Like can the dma from both cores run simultaneously?
Or is the issue that a single core cant run dma while rendering to a separate part of the buffer due to access contention?
I thought I remembered reading that the memory banks of pi pico are interleaved so memory contention should be minimal.
https://forums.raspberrypi.com/viewtopic.php?f=145&t=311811
The DMA is separate hardware from the CPUs. It’s a ping-pong/double-buffered setup. While one buffer is being sent out through the display pipeline, the other is being built. Then they swap. The second core helps manage that pipeline, but the actual SPI data movement is still done by DMA.When I first started, single core, no DMA, I was only getting ~19 FPS pushing very, very rudimentary graphics via SPI. If you increase SLAB_ROWS past 32, you run out of RAM. So, It’s all to work around SPI and RAM limitations.
1 Like
There’s probably a lot of room for improvement. I just quit when I got the frame rates I wanted.
1 Like
To this point, maybe we could put the hottest display routines into SRAM. I’d never considered that before.
I think it depends on what the access pattern to them is, if it’s a very slow routine that takes a lot of time but also isn’t hot in the sense of being called often enough to remain in cache (like in a loop) then the ram will make it faster.
For functions inside loops it will probably not be noticeable unless they contain extremely large unrolled loops or switch statements because the cache will handle the accesses.
It will probably make performance more predictable though.
I’m also not entirely certain how static (rather than malloced) memory is handled on pi pico, in the absolute worst case they might set cache to write-through-mode and have it resident in flash or something. I would hope only static const arrays are in flash.
I haven’t done any profiling, so I have no idea where the bottlenecks would be. On the display side, Core0 handles binning, and Core1 is basically the renderer. I might try marking some functions for SRAM storage—some of the more self-contained draw functions, maybe:
drawLineIntoSlab()
drawLineXMajorIntoSlab()
drawLineYMajorIntoSlab()
I’d do them one at a time and see if it improves anything.
So, per google, we’d just wrap the function in __no_inline_not_in_flash_func(…), right?
From what I understand, yes. But there’s a good chance this isn’t very noticeable on functions that are called more than once per frame. On Nintendo DS there is a similar situation with itcm (static instruction memory directly hooked up to the cpu, same speed as cache) being the equivalent of ram and ram being the equivalent of flash, since they are hooked up to different bus clocks (cache clock is tied to the cpu clock at 60 to 120 MHz, ram is tied to 33 MHz bus clock) and there is an additional cache system on top of it, and typically itcm is most beneficial on things that run on an interrupt or that need to run during a dma, since on Nintendo DS starting DMA blocks accesses to anything other than cache and itcm.
I wired pin 2, 3, and 21 to Core0, Core1, and DMA respectively, and hooked them up to my scope. Unsurprisingly, Core1 is doing the vast majority of the work in the display class but does have a tiny bit of down-time. More importantly, though, DMA stays pegged the whole time. So, I think we’re maxed on throughput.
void Ili9488Display::renderAndFlushFrame(const Frame& f) {
probe_on(PIN_PROBE_CORE1);
…
if (slabIndex != 0) {
wait_for_spi_dma_idle();
probe_off(PIN_PROBE_DMA);
}
start_dma_slab(slab, W * rows);
probe_on(PIN_PROBE_DMA);
ping ^= 1;
}
wait_for_spi_dma_idle();
spi_set_format(spi1, 8, SPI_CPOL_0, SPI_CPHA_0, SPI_MSB_FIRST);
gpio_put(PIN_CS, 1);
probe_off(PIN_PROBE_CORE1);
}
I had my daughter playing while I watched the scope, and when entering denser regions, Core1 does become 100% utilized. So there is room to improve!