Like many others I bought the DevTerm with the R01 module to give the RISC-V CPU a test drive. I wanted to get others’ impressions. It seems like the RISC-V core is really slooow for a 64bit CPU running @ 1GHz. I haven’t tried it yet… but I’m pretty sure my PocketCHIP with an Allwinner R8 ARM32 single core @ 1GHz will out compile my new 64bit barn bunner.
This isn’t a complaint. I was just thinking it would be a better cruncher. Anyone else have any impressions/experience?
EDIT: This thread has garnered some good technical information. After running through various bench marking efforts I wanted sum up my findings here:
As far as integer, floating point and general execution (decisions, branching, calls, …) go the D1 is on par with my ARM64 board. This means that with a more modern core implementation the RISC-V would likely be impressive.
The 16bit RAM bus is a contributing factor to the slowness, but is nowhere near what is making for my poor compile times. Until further evidence surfaces I can only chalk that up to software (gcc / g++).
While the D1 does have hardware to accelerate low level 2D video operations, handle the 90degree screen rotation and do video stream decoding, it doesn’t appear that the supplied kernel or Xorg have support for them. Since a 24bit frame buffer for this size screen is sizeable, video is sluggish, especially if the single core is busy with something else.
There are some software package hints below which do improve the speed with which package removal and installation happen.
To sum up I’d say that the RISC-V 64 architecture seems to perform about the same as ARM64. This is a budget SOC and it performs as such. What really seems to be missing is the software optimizations that are common for the other architectures. So with time we could see this board perform on par with things like the Pine A64, … if it was limited to one core.
My take away is that the RISC-V architecture is intriguing enough I want to take a deeper dive into it. And… well… that’s what I bought this for. But if your looking to play movies or compete in the latest “3D shoot 'em up” you want a different compute module!
EDIT: I added the A64 time from a Pine A64 LTS SBC and notes below.
OK! Off to the races. g++ compiling one of my simpler libraries:
Allwinner A64 (ARM64): 23.9s
Allwinner R8 (Oooold ARM32): 56.4s
Allwinner D1 (RISC64) on the R-01 module: 115.1s
Yup! Its WAAY slower… Half the speed of ARM32… Will have to try some other things out. I wonder how much of this is G++ version differences. Deb 8 vs. Deb11+
The Pine A64 LTS is running Armbian based on Deb10. For full disclosure its also running an eMMC module instead of uSD for storage. According to the various load meters on the R-01 this is mostly CPU intensive, which fits my general experience.
To compare apples-and-apples the time above is from a single threaded build on the A64. Running a 4 thread (4 cores) build resulted in an expected reduction of build time to 9.3s.
I just performed some purely cerebral busy work for a single CPU core. This finds the prime numbers between 1 and 131,072 in the most inefficient means possible. This is just some plain ole integer math with some register moves, stack operations and sub routine calls thrown in for good measure. These are the core operations of most programs. The results are interesting:
Pine64 A64 (ARM64): 14.7 @1.1GHz
CPI R-01 (RISC-V): 21.9
CHIP R8 (ARM32): 42.0
For comparison my AMD Ryzen @ 3.5GHz takes 2.9s. But that’s hardly fair. And I realized I hadn’t verified the clock rate on the A64 so I did and was surprised it was slightly faster at 1.1GHz, instead of the 1GHz of the other two. So it does get a speed advantage… but certainly not enough to make up for the tests in the previous post. I’ll have to see if I can tune it back to an even 1GHz.
This test shows a doubling of performance when comparing the 64bit RISC-V CPU to the older ARM 32bit CPU. I find this fascinating since all the math in this code is well within range of a 32bit int. I’m going to have to do some encryption tests that would make use of the extended word size.
Looking at this I’m beginning to suspect the bottle-neck is the D1’s 16bit RAM data bus. The A64 has a 32bit bus. Still this is interesting and gives me hope that the RISC-V in a proper vehicle will have competitive performance. This reminds me of the original IBM PC with the 8088 instead of the 8086. There is no way the 8088 was performing anywhere near what the 8086 manual spec’d. A Z80 with a fraction of the clock rate could easily trounce it.
I wonder if the bottleneck might be disk speed? I haven’t done any benchmarks on mine but the time when it I most notice it is slow (and I mean REALLY slow) is when I’m installing/updating via apt. It takes way longer than I would expect but in normal usage is feels mostly fine.
Yes, I’ve considered this. I simply hadn’t gotten around to testing it yet. Well… not until this moment. I’ve used a range of SBCs and most have had uSD for storage. But the R-01 just felt more sluggish that the others. I was hoping the new architecture without the legacy baggage of the other common architectures would bring surprising levels of performance. So far I am surprised… but in the negative direction.
So I’ve done some sustained read tests this morning and I would say that the R-01 is on par with other SBCs I’ve used:
R-01 (D1, 1xRISC-V64, 1.0GHz): 23.8MB/s
OrangePi Zero (H2+,4xARM32, 1.2GHz): 23.7MB/s
Pine64 (A64, 4xARM64, 1.1GHz): 22.0MB/s (eMMC module)
CHIP (R8, 1xARM32, 1.0GHz): 10.5MB/s (nand flash)
The reason I hadn’t rushed to run these tests was the results of the compile tests near the top. I’ve learned that compiling is mostly CPU intensive. Way back before multi-core chips I was running multiple microprocessor machines (2xPIII@600 and such). What I found is that the more cores you can throw at a multi-file compile the faster it is, the time being divided linearly be the number of CPUs. In other words: 4CPUs cuts the compile to 1/4th a single core’s time.
These storage bench marks bring that out in sharp-relief. The slowest storage I have was the CHIP with the R8. Yet it compiles my library in 1/2 the time as the R-01.
I have this eMMC microSD card for the Raspberry Pi which I’ve been meaning to try, but it doesn’t quite fit into the slot when the case is on (and I’ve not popped it back off lately).
Also, I’ve been adding swap when I’ve been setting up new SD cards, but almost never find myself hitting it…
More experiments to be done
uSD is definitely slow. At least in terms of today’s storage. A typical USB3 attached SSD is delivering between 200 to 300MB/s. That’s one of the reasons I bought the eMMC module for my Pine A64. It _felt_faster, but not like WOW fast. Turns out according to the numbers above its slightly slower. I suppose it would be fascinating to restore the FS to uSD and do the benchmark again. But that’s a project for another time… and maybe a different forum.
I bring that up because I’m really curious to know if you’re USB stick would be any faster. I have a USB adapter for my eMMC module… maybe I can do both tests at the same time!
But I think in short the R-01s problem is somewhere else if half the CPU and storage speed can deliver twice the performance. :-/
yeah I saw in another thread that apt slowness seems to come from systemd errors
Yeah systemd has to go! The APT/dpkg data structures aren’t particularly efficient either.
I don’t think you’d be able to use a USB->eMMC adapter as that would require boot from USB? This has the advantage that it goes straight in to the microSD / TF slot - not sure what kernel options or mods would be needed to have it work from USB. I know that the Pi can benefit from booting from SSD over USB but that required various firmware changes before it got to that point.
The systemd issues appear to be at least partly related to kernel options / functions (not) being present, and I’ve not dared to try a kernel rebuild yet.
I’ve mostly improved apt performance (short of the download/unpack/install sycle, but in the scripts and triggers sections) by removing the
motd-news-update packages that were seemingly getting it stuck.
I wouldn’t use it for booting. Just test the eMMC module via a USB connection. I suppose if I were to attempt to run off of the USB I’d configure an uSD card to be the boot loader and point it to mounting the USB device as root. I once had a PPC Mac booting off of a “compact flash” card and mounting Linux from a SATA RAID controller.
re. kernel rebuild: LOL! I haven’t custom built a kernel in a long while. Takes about a day just to cruise through the options anymore. Everything has turned into such a huge time-suck. But one of these day’s I’d like to attempt a kernel build again. Probably better get setup for cross building from x86 because their isn’t enough time left in the universe to compile a kernel on the R-01.
I was toying with fbench to give me some floating point ops to test. I ran it on my same three machines for 1M iterations. Here are the results:
Pine A64 (A64, ARM64, 1.1GHz): 10.7s
R-01 (D1, RISC-V 64, 1.0GHz): 12.6s
CHIP (R8, ARM32, 1.0GHz): 53.1s
The RISC-V 64 is no slouch in the FLOPs department. The variation between the D1 and A64 seems to be very close to what I calculate the clock speed difference at.
Given there is not a whole lot of FLOPs in a compiler and the R-01’s performance is respectable I’m leaning to the RAM subsystem being slow. Now to come up with a test for that…
I ran “mbw” the Memory BandWidth benchmark on all three devices. I don’t have the exact figures and logs. My battery went flat and I had to wait in line for the only USB-C cable we have in the house to recharge my DevTerm. I have USB-C cables on order. :-/ So all I have at the moment are my general recollections and impressions. I started running 200MB tests. But the CHIP (R8) only has 512MB and that means that allocating 400MB (200*2) of RAM isn’t really an option. So I need to re-plan and re-run the tests so I can get accurate results.
Basically the A64 (64bit ARM) did the test in ~5s, while the D1 (R-01) did it in just under 9s (50% longer). But the R8 running half the test did it in 13s. My general feeling is that while a 16bit RAM bus on the D1 is a handicap, its significantly faster than the R8. Since my compile times are twice as long as the R8… this isn’t the smoking gun. Just an indicator that a better SOC/MoBo architecture would bring more performance to the table.
In short the RISC-V CPU seems to be average in performance compared with the ARM 64. It would be hard to compare with an X86 architecture since they typically don’t run this slow. This leads me to believe that the thing that is slower is newer releases of software, potentially even the newness of the RISC-V 64 implementation in the kernel. So that is what I’ll pursue next, after a break to do some more real work with the DevTerm.
Oh! and, yes, as @andypiper points out (above) removing those Debian packages has certainly improved apt-get’s finish times. In other words software was at the heart of that problem.
(I’d have quoted Andy here… but that feature doesn’t work… at least not in a way that is obvious to me. I don’t have the patience to figure it out)
So, after package updates, my R01 is much much faster than it used to be. It reminds me of an Rpi1 or Rpi2. The main thing is just that there is zero accelerated video. Also, any disk access is very slow.
Just as a follow-on from the benchmarking efforts, I’ve been trying to run the
sbc-bench tool with little success.
I wonder how much this could be stripped down to a very basic Linux installation (no X, reduced services etc) to make it more effective. I noticed various things causing CPU load even without me doing anything actively…
Thanks @andypiper for pushing on this. I haven’t written much more on this channel about the things I’ve done to try and pin down the sluggishness. Yes, there is a ton of software that could go away. I trimmed 2G before I even began with the speed testing. There are still WAAAY too many services starting at boot and a bunch more getting loaded over dbus. IMO all this bloat needs to go.
Frankly I got tired of the hours and hours spent reviewing the list of installed software and removing packages and wanted to do something fun / useful with my DevTerm. So I put the “fat trim” on hold.
Some of the other things I’ve tried to find the sluggishness:
- g++ v9 instead of 10 - saved 10secs, but not the 50% reduction I was looking for.
- compile without X11 - no appreciable difference
- compile without X11 and over an SSH connection, thinking maybe WiFi + SSH would show less overhead. No appreciable difference.
According to “gkrellm” my CPU use is pretty low most of the time, until I start to do compile or install software. At this point I think most of the compiler slowness is likely in the compiler itself. I want to work more on pinpointing the slowness but I have bigger fish to fry right now. Frankly I didn’t think anyone else cared so I hadn’t planned on writing anymore here. If your interested I’ll post whatever else I find.
I’m curious about this
sbc-bench I might take a look at that.
I do think one of the reasons the video is so slow is that it appears to be a 90degree rotation all in software. The video RAM buffer should need just under 2MB so that’s a pretty large transform to be done in software on a single core unit. If you had multiple cores than one could be busy doing the transform and others could do work. But having to share the core would definitely impact anything working the video. I didn’t think that would impact a compiler run significantly. But I tested it anyways and it doesn’t appear to hinder my compile times.
One praise I will say about the DevTerm. I think the WiFi connection is the best of any other WiFi device I have, including my laptop. Its still WiFi, but its fairly pleasant.
I ran into something yesterday that likely has considerable bearing on the performance issue:
I was compiling Nettle, the encryption library, for i386 and amd64 architectures. I noticed several architecture specific directories that contain optimized assembler for specific architectures. And guess what? No RISC-V! This was the latest release tar-ball.
This reminded me that not only the kernel is going to need some time before the RISC-V support is up to speed but so will many other software packages. I noticed this with Android tablets and phones in the past: web browsers were slooooow… Same with rPi and other SBCs. Well… the problem wasn’t really the ARM CPUs! It was that desktops (Intel) had JIT compiler support and the JIT compilers hadn’t been written for the ARM architecture yet. Took years for the ARM chip to get the same treatment.
Law #4 bytes! So churning through megs of JS on a site, without the JIT compilaton was hundred to thousands of times slower on ARM until the compilers arrived.
In short its going to be a while before RISC-V gets enough penetration to get first class treatment by various software projects. I’ve been looking forward to learning a new machine language and that is why I have an R-01. It will be fun to compare to X86 and ARM.
You cannot expect stellar performances from the AllWinner D1 because it is NOT meant to be used as a desktop replacement in any way, this is meant to be used in embedded system with no to little interface, and the CPU is mostly for driving the dedicated hardware around it (like the original Raspberry Pi if you look closely)
may sound impressive, but this is just:
RV64I: 64 bit variant integer
M; Standard integer multiplication & division
A: Atomic instructions
F: Single precision floats
D: Double precision floats
C: Compressed instructions
V: Vector operations
And U is not standard.
Only 32K of I and D cache
Only 5 stage pipeline
Only one core at 1Ghz, and I suspect it run in order without any speculation branching.
You cannot really compare that with any reasonably modern CPU, the pipeline is really short, the cache is really small, and if in order it will have huge performance impact for desktop purposes.
Thanks @Godzil those are some good details. Where did you get that alphabet hash (RV64IMAFDCVU)? Considering all the spy tools that speculative* give the cyber-bad-guys I’m not sure its such a good thing even though it is a significant boost in performance.
Considering all the handicaps you mention it does surprisingly well in the benchmarks. Fortunately for me I still remember how to get things done in only a few million clock cycles. Having a thousand, million seems REALLY HUGE.
More info about the D1: D1 - linux-sunxi.org