From what I remember ST Cortex M4 has some "wannabe cache"-like feature, "ART accelerator", something like that. This is supposedly mainly there to reduce wait states. But if it works like normal data/instruction cache (I don't know any details here, I'd have to check the friendly manual), then regular for loops is probably as good as it gets when accessing adjacent flash memory. That could mean that manual loop unrolling is actually harmful for optimization.
At any rate, it's fairly safe to assume that flash wait states is a bottleneck. So you should focus on minimizing branches.
For example, something like column_sum & 1
is 0 or 1, so there shouldn't need to be a branch there. You have to disassemble to tell if it makes any difference, but maybe code like this will eliminate the branch:
uint32_t bit = column_sum & 1;even_column_code ^= mask*bit;odd_column_code ^= (7-mask)*bit;
Come up with similar tricks to get rid of as many of those slow if
statements as possible!
In general, you should never do bitwise arithmetic on small integer types or signed integer types - the former get implicitly promoted to the latter. See Implicit type promotion rules. What you are risking is that upon setting MSB at any point, you could end up with shifting a negative number, which is poorly-defined behavior and almost always a bug.
This means that all your uint8_t
should be swapped for uint32_t
- which is unlikely to affect performance on a Cortex M.
The usual mini-optimization of no pointer aliasing between parameters is possible:
static void compute256 (const uint8_t* restrict data, uint8_t* restrict code)
It may or may not have an effect. Seems more likely to get optimized on gcc than other compilers.
However, I can't really use optimizations, as it tends to break some of STM's HAL
Well, you are done for then. When something like that happens, everyone needs to raise a support ticket with ST asking why their code sucks. If enough people do it, they will eventually have to hire professional programmers to fix the so-called "HAL".
Now if you are already stuck with their bloatware, what you can perhaps do is to play around with local optimization per translation unit: #pragma GCC optimize ("O3")
and #pragma GCC optimize ("O0")
. Brittle bloatware code gets the -O0
, properly written C code gets optimized.
For what it's worth, one of the most likely reasons for hardware drivers breaking upon optimization is missing volatile
qualifiers for register access or variables shared with ISRs or DMA etc. So if you are lucky, the problem is just something trivial like that, not the hardest bugs to track down.