Answer by Lundin for Software ECC embedded with a parallel NAND Flash interface

From what I remember ST Cortex M4 has some "wannabe cache"-like feature, "ART accelerator", something like that. This is supposedly mainly there to reduce wait states. But if it works like normal data/instruction cache (I don't know any details here, I'd have to check the friendly manual), then regular for loops is probably as good as it gets when accessing adjacent flash memory. That could mean that manual loop unrolling is actually harmful for optimization.

At any rate, it's fairly safe to assume that flash wait states is a bottleneck. So you should focus on minimizing branches.

For example, something like column_sum & 1 is 0 or 1, so there shouldn't need to be a branch there. You have to disassemble to tell if it makes any difference, but maybe code like this will eliminate the branch:

uint32_t bit = column_sum & 1;even_column_code ^= mask*bit;odd_column_code ^= (7-mask)*bit;

Come up with similar tricks to get rid of as many of those slow if statements as possible!

In general, you should never do bitwise arithmetic on small integer types or signed integer types - the former get implicitly promoted to the latter. See Implicit type promotion rules. What you are risking is that upon setting MSB at any point, you could end up with shifting a negative number, which is poorly-defined behavior and almost always a bug.

This means that all your uint8_t should be swapped for uint32_t - which is unlikely to affect performance on a Cortex M.

The usual mini-optimization of no pointer aliasing between parameters is possible:

static void compute256 (const uint8_t* restrict data, uint8_t* restrict code)

It may or may not have an effect. Seems more likely to get optimized on gcc than other compilers.

However, I can't really use optimizations, as it tends to break some of STM's HAL

Well, you are done for then. When something like that happens, everyone needs to raise a support ticket with ST asking why their code sucks. If enough people do it, they will eventually have to hire professional programmers to fix the so-called "HAL".

Now if you are already stuck with their bloatware, what you can perhaps do is to play around with local optimization per translation unit: #pragma GCC optimize ("O3") and #pragma GCC optimize ("O0"). Brittle bloatware code gets the -O0, properly written C code gets optimized.

For what it's worth, one of the most likely reasons for hardware drivers breaking upon optimization is missing volatile qualifiers for register access or variables shared with ISRs or DMA etc. So if you are lucky, the problem is just something trivial like that, not the hardest bugs to track down.

Answer by Lundin for Software ECC embedded with a parallel NAND Flash interface

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112