One of my contributions to Postgres 9.5 (back in 2015) was a two-stage optimisation of the CRC computation code. First, switching to a faster algorithm; and second, to use the Intel SSE4.2 CRC instructions where available. I was delighted to have the opportunity to implement such a dramatic performance improvement (CRC computation used to be at the top of the profile on every streaming replica by some distance).
Optimising something by writing assembly (even if it was only a couple of instructions, later replaced by compiler intrinsics) is always fun, but here the algorithm change was also a substantial improvement, in that it used a lookup table to process eight input bytes at a time. This technique is known as “slicing-by-N” (where N depends on the size of the lookup table), and was originally described here:
Frank L. Berry, Michael E. Kounavis, "Novel Table Lookup-Based Algorithms for High-Performance CRC Generation", IEEE Transactions on Computers, vol. 57, no. , pp. 1550-1560, November 2008, doi:10.1109/TC.2008.85
This paper, having been published in a prestigious IEEE journal, is of course not easily available for download (not when I looked in 2015, and apparently not today). I was able to find what I needed to implement the technique thanks to other reference materials, notably including Stephan Brumme's Fast CRC32 page (now considerably expanded since 2015), but I never actually got to read what Messrs. Kounavis and Berry had to say about their technique.
Recently, I had occasion to look at CRC32 implementations again, and I found a different paper that I had looked at briefly the last time: Cyclic Redundancy Check Generation Using Multiple Lookup Table Algorithms by Indu I. and Manu T.S. from TKM Institute of Technology, Kollam, in Kerala (my mother's home state in South India). I remember noting that there was something odd about the paper, but not having time enough to give it more than a passing glance. This time, I spent a while reading it, and it's certainly very odd.
ABSTRACT: The primary goal of this paper is to generate cyclic redundancy check (CRC) using multiple lookup table algorithms. A compact architecture of CRC algorithm (Slicing-by-N algorithm) based on multiple lookup tables (LUT) approach is proposed. This algorithm can ideally read large amounts of data at a time, while optimizing their memory requirement to meet the constraints of specific computer architectures. The focus of this paper is the comparison of two algorithms. These two algorithms are Slicing by-N-algorithm and Sarwate algorithm, in which slicing by-N-algorithm can read arbitrarily 512 bits at a time, but Sarwate algorithm, which can read only 8 bits at a time. This paper proposes the generation of CRC using slicing by 8 algorithm. In this, message bits are chunked to 8 blocks. All are processed at a time. Proposed Slicing-by-8 algorithm can read 64 bits of input data at a time and it doubles the performance of existing implementations of Sarwate algorithm.
Is this paper claiming to have invented the slicing-by-N algorithm? It's hard to tell from the blather in the abstract, but going through the remaining blather (and effort that, in retrospect, I cannot in good conscience commend to the reader) suggests that this is indeed the case.
Recently time is the major concern. So in order to process large amount of data at a time, Multiple Lookup based approach is more efficient. Multiple Lookup based approach contains five CRC algorithms, called Slicing by-N algorithm (N ϵ 4, 8, 16, 32, 64), which is used to read up to 512 bits at a time. So performance of the system should be increased. Here proposing Slicing by-8 algorithm to read 64 bits at a time. Here proposed an efficient design of CRC generator using Slicing by-N algorithm (N=8). In this algorithm, input message stream is sliced into N slices and each slice has 8 bits. So using this Slicing by-8 algorithm, it can read 64 bits at a time and it triples the performance of existing implementation of Sarwate algorithm.
Oho, so it triples the performance of existing implementations of the Sarwate algorithm, does it? Funny the abstract claims a paltry doubling in performance then. The paper goes on to describe CRC computation with block diagrams, and then has some more blather about VHDL and MATLAB and some screenshots of “simulation waveforms”, all of which seems to amount to showing that the various CRC algorithms produce the same results and that processing more input bytes at a time is faster than not doing so. I made judicious use of the fast-forward button to reach the conclusion, which begins with
The design of CRC generator using Multiple Look Up based approach is proposed. In this paper, slicing by-8 algorithm is designed, and compares this algorithm with the existing algorithms, that is, with Sarwate algorithm and LFSR method.
So yeah, they're claiming in a slightly roundabout way to have invented the slicing-by-8 CRC algorithm. However, the authors cite the Kounavis and Berry paper anyway, perhaps so that any criticism can be blamed on some sort of unfortunate misunderstanding. I didn't find any citations of this paper in the minute or two I spent on the search, but Citeseer and Researchgate link to it, and it's quite prominent in search results, so it's probably only a matter of time before someone cites it.
The paper was published in "International Journal of Modern Engineering Research” (ijmer.com) in 2012; the journal's name alone reminded me of the scammers, Academic Journals Online, whom I encountered a few years ago. IJMER does not, however, appear to be one of the AJO journals. Perhaps it's a second cousin.
Unfortunately, the authors include no contact information in the paper, so I was not able to send them a link to this page.