Quantcast
Channel: User Lundin - Code Review Stack Exchange
Viewing all articles
Browse latest Browse all 42

Answer by Lundin for Software ECC embedded with a parallel NAND Flash interface

$
0
0

From what I remember ST Cortex M4 has some "wannabe cache"-like feature, "ART accelerator", something like that. This is supposedly mainly there to reduce wait states. But if it works like normal data/instruction cache (I don't know any details here, I'd have to check the friendly manual), then regular for loops is probably as good as it gets when accessing adjacent flash memory. That could mean that manual loop unrolling is actually harmful for optimization.

At any rate, it's fairly safe to assume that flash wait states is a bottleneck. So you should focus on minimizing branches.

For example, something like column_sum & 1 is 0 or 1, so there shouldn't need to be a branch there. You have to disassemble to tell if it makes any difference, but maybe code like this will eliminate the branch:

uint32_t bit = column_sum & 1;even_column_code ^= mask*bit;odd_column_code ^= (7-mask)*bit;

Come up with similar tricks to get rid of as many of those slow if statements as possible!


In general, you should never do bitwise arithmetic on small integer types or signed integer types - the former get implicitly promoted to the latter. See Implicit type promotion rules. What you are risking is that upon setting MSB at any point, you could end up with shifting a negative number, which is poorly-defined behavior and almost always a bug.

This means that all your uint8_t should be swapped for uint32_t - which is unlikely to affect performance on a Cortex M.


The usual mini-optimization of no pointer aliasing between parameters is possible:

static void compute256 (const uint8_t* restrict data, uint8_t* restrict code)

It may or may not have an effect. Seems more likely to get optimized on gcc than other compilers.


However, I can't really use optimizations, as it tends to break some of STM's HAL

Well, you are done for then. When something like that happens, everyone needs to raise a support ticket with ST asking why their code sucks. If enough people do it, they will eventually have to hire professional programmers to fix the so-called "HAL".

Now if you are already stuck with their bloatware, what you can perhaps do is to play around with local optimization per translation unit: #pragma GCC optimize ("O3") and #pragma GCC optimize ("O0"). Brittle bloatware code gets the -O0, properly written C code gets optimized.

For what it's worth, one of the most likely reasons for hardware drivers breaking upon optimization is missing volatile qualifiers for register access or variables shared with ISRs or DMA etc. So if you are lucky, the problem is just something trivial like that, not the hardest bugs to track down.


Viewing all articles
Browse latest Browse all 42

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>