Quantcast
Viewing all articles
Browse latest Browse all 42

Answer by Lundin for c - memcopy for embedded system

Unfortunately, this function is quite naively written to the point where you would be much better off with a plain byte-by-byte loop:

for(size_t i=0; i<size; i++){  src[i] = dst[i];}

Benchmark and disassemble! If you can't beat the above performance with your own version, then you shouldn't be writing your own version.

  • As pointed out by other reviews, you have a major bug in the form of misaligned access at the beginning of the copy, you only seem to handle misalignment at the end.

  • There is also strict aliasing violations. In fact strict aliasing means that memcpy cannot be efficiently implemented in standard C - you must compile with something like gcc -fno-strict-aliasing or the compiler might go haywire. Many of the traditional embedded systems compiles don't abuse strict aliasing, but gcc has a history of doing so and it is getting increasingly popular.

  • restrict qualifying the pointers will help fixing pointer aliasing hiccups and maybe optimize the code a tiny bit. This is standard lib memcpy:

      void *memcpy(void * restrict s1,               const void * restrict s2,               size_t n);
  • Rolling out your own memcpy probably means it should be static inline and placed in a header, or otherwise it will by definition always be much slower than standard lib one.

  • You need to minimize the amount of branches. The only branches that should exist in this function if any are the ones correcting for misaligned addresses in the beginning and the end of the data. Branches are very bad since they are likely the major bottlenecks here - if the CPU can't utilize data and instruction cache, then the whole algorithm is pointless since it is then almost certainly far worse than the previously mentioned for loop, at least on all high end CPUs. Even on mid- to low-end CPUs with no cache have instruction pipelining and might benefit from no branches.

    In particular the checks against NULL are just pointless bloat that shouldn't exist in a library function but get carried out by the caller, if at all needed.

  • unsigned int is not guaranteed to be the fastest aligned type supported - it is true on 16 to 32 bit systems but not on 64 bit systems. The correct type to use in this case for maximum portability is uint_fast16_t. That one covers all systems with alignment requirements from 16 to 64 bit systems.

    However, 16 bit systems with alignment requirements are somewhat rare, I think a few oddball ones have it, but many 16 bitters don't, so consider if you actually need portability to those that do. Otherwise uint_fast32_t would work. Similarly, portability to 64 bitters might be irrelevant if you only target embedded systems.

    IMPORTANT: Most 8 and 16 bit systems have no alignment requirements and there's no obvious benefit of doing 16 bit word-sized copies on 8 bit systems. On such systems, these kind of "aligned chunk" memcpy implementations are just slow bloat - a plain byte-by-byte for loop would be much faster there. If you mean to target such systems, you will have to roll out a completely different implementation.

  • Avoid / and % unless you have reason to believe that the compile can optimize them away. Division is very CPU-intense particularly on low-end systems. In this case you shouldn't need it, the "number of chunks" calculation isn't adding anything meaningful information for the algorithm. This is a typical case of "write code so that the programmer understands what's going on", which is normally a good thing, but not when writing library quality code. Instead simply check for the end address when iterating.

  • copy is a bad function name since various library functions with that name have existed over the years.

  • Don't include library headers that you don't use.


Viewing all articles
Browse latest Browse all 42

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>