One of my contributions to Postgres 9.5 (back in 2015) was a two-stage
optimisation of the CRC computation code. First, switching to a faster
algorithm; and second, to use the Intel SSE4.2 CRC instructions where
available. I was delighted to have the opportunity to implement such a
dramatic performance improvement (CRC computation used to be at the top
of the profile on every streaming replica by some distance).
Optimising something by writing assembly (even if it was only a couple
of instructions, later replaced by compiler intrinsics) is always fun,
but here the algorithm change was also a substantial improvement, in
that it used a lookup table to process eight input bytes at a time. This
technique is known as “slicing-by-N” (where N depends on the size of the
lookup table), and was originally described here:
Frank L. Berry, Michael E. Kounavis, "Novel Table Lookup-Based
Algorithms for High-Performance CRC Generation", IEEE Transactions on
Computers, vol. 57, no. , pp. 1550-1560, November 2008,
doi:10.1109/TC.2008.85
This paper, having been published in a prestigious IEEE journal, is of
course not easily available for download (not when I looked in 2015, and
apparently not today). I was able to find what I needed to implement the
technique thanks to other reference materials, notably including
Stephan Brumme's Fast CRC32 page
(now considerably expanded since 2015), but I never actually got to read
what Messrs. Berry and Kounavis had to say about their technique.
Recently, I had occasion to look at CRC32 implementations again, and I
found a different paper that I had looked at briefly the last
time:
Cyclic Redundancy Check Generation Using Multiple Lookup Table Algorithms
by Indu I. and Manu T.S. from TKM Institute of Technology, Kollam, in
Kerala (my mother's home state in South India). I remember noting that
there was something odd about the paper, but not having time enough to
give it more than a passing glance. This time, I spent a while reading
it, and it's certainly very odd.
ABSTRACT: The primary goal of this paper is to generate cyclic
redundancy check (CRC) using multiple lookup table algorithms. A compact
architecture of CRC algorithm (Slicing-by-N algorithm) based on multiple
lookup tables (LUT) approach is proposed. This algorithm can ideally
read large amounts of data at a time, while optimizing their memory
requirement to meet the constraints of specific computer architectures.
The focus of this paper is the comparison of two algorithms. These two
algorithms are Slicing by-N-algorithm and Sarwate algorithm, in which
slicing by-N-algorithm can read arbitrarily 512 bits at a time, but
Sarwate algorithm, which can read only 8 bits at a time. This paper
proposes the generation of CRC using slicing by 8 algorithm. In this,
message bits are chunked to 8 blocks. All are processed at a time.
Proposed Slicing-by-8 algorithm can read 64 bits of input data at a time
and it doubles the performance of existing implementations of Sarwate
algorithm.
Is this paper claiming to have invented the slicing-by-N
algorithm? It's hard to tell from the blather in the abstract, but going
through the remaining blather (an effort that, in retrospect, I cannot
in good conscience commend to the reader) suggests that this is indeed
the case.
Recently time is the major concern. So in order to
process large amount of data at a time, Multiple Lookup
based approach is more efficient. Multiple Lookup based
approach contains five CRC algorithms, called Slicing by-N
algorithm (N ϵ 4, 8, 16, 32, 64), which is used to read up to
512 bits at a time. So performance of the system should be
increased. Here proposing Slicing by-8 algorithm to read 64
bits at a time. Here proposed an efficient design of CRC
generator using Slicing by-N algorithm (N=8). In this
algorithm, input message stream is sliced into N slices and
each slice has 8 bits. So using this Slicing by-8 algorithm, it
can read 64 bits at a time and it triples the performance of
existing implementation of Sarwate algorithm.
Oho, so it triples the performance of existing implementations of the
Sarwate algorithm, does it? Funny the abstract claims a paltry doubling
in performance then. The paper goes on to describe CRC computation with
block diagrams, and then has some more blather about VHDL and MATLAB and
some screenshots of “simulation waveforms”, all of which seems to amount
to showing that the various CRC algorithms produce the same results and
that processing more input bytes at a time is faster than not doing so.
I made judicious use of the fast-forward button to reach the conclusion,
which begins with
The design of CRC generator using Multiple Look Up based approach is
proposed. In this paper, slicing by-8 algorithm is designed, and
compares this algorithm with the existing algorithms, that is, with
Sarwate algorithm and LFSR method.
So yeah, they're claiming in a slightly roundabout way to have invented
the slicing-by-8 CRC algorithm. However, the authors cite the Kounavis
and Berry paper anyway, perhaps so that any criticism can be blamed on
some sort of unfortunate misunderstanding. I didn't find any citations
of this paper in the minute or two I spent on the search, but Citeseer
and Researchgate link to it, and it's quite prominent in search results,
so it's probably only a matter of time before someone cites it.
The paper was published in "International Journal of Modern Engineering
Research” (ijmer.com) in 2012; the journal's name alone reminded me of
the scammers, Academic Journals Online,
whom I encountered a few years ago. IJMER does not, however, appear to
be one of the AJO journals. Perhaps it's a second cousin.
Unfortunately, the authors include no contact information in the paper,
so I was not able to send them a link to this page.