SSE4.2 is a subset of SSE4 which is a SIMD CPU instruction set. SSE4.2 first become available in Nehalem-based Core i7. Apart from string and text instructions SSE4.2 has also introduced CRC32 instructions to compute cyclic redundancy checks which is used for detecting errors in data transfer protocols.

SSE4.2 provides access to following intrinisics which can be used for CRC calculation:-

- unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
- unsigned int _mm_crc32_u16 (unsigned int crc, unsigned short v)
- unsigned int _mm_crc32_u32 (unsigned int crc, unsigned int v)
- unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)

For now let us focus on _mm_crc32_u8 as writing portable implementation for it enables us to use it for implementing other 3 functions.

From the Intel Intrinisics Guide we can see the pseudocode for _mm_crc32_u32 as given below.

We can observe from here that the polynomial is different from the more common CRC-IEEE implementation. So, the CRC implemented in SSE4.2 instruction set actually uses Castagnoli polynomial and thus it is known as crc32c (‘c’ here denotes the polynomial). CRC-32C is used by formats like iSCSI, SCTP, G.hn payload, SSE4.2, Btrfs, ext4, Ceph etc.

Now before proceeding any further, lets look at some basics about correct way of implementing CRC methods. We will look at formulaic implementation here as it is simpler to understand than table lookup method.

There are 2 forms of CRC:-

a) Normal Polynomial form (CRC bits are shifted left)

b) Reflected Polynomial form (CRC bits are shifted right and reverse polynomial is used)

Apart from this the results of both forms are swapped if we reflect both input bytes during processing and calculated CRC.

For CRC-32C we have to either use :-

a) Normal form and reflect each byte and calculated CRC in the end and use normal form Castagnoli polynomial (0x1EDC6F41).

b) Reflected Polynomial form in which we don’t have to reflect input or output CRC bits and use reverse form Castagnoli polynomial (0x82F63B78).

Using the second method we get a simpler implementation.

One more thing to notice is that SSE4.2 CRC intrinisics can be used as helper functions for generating CRC-32C for strings but they are different in one aspect.

A formulaic CRC implementation looks something like

```
f(prevcrc,char)
{
currcrc = ~prevcrc
<LOGIC>
return ~currcrc
}
```

But SSE4.2 crc intrinisics actually computes the “LOGIC” part and not the function f.

So for calculating CRC-32C for a single byte we can do something like

```
f(prevcrc,char)
{
currcrc = ~prevcrc
currcrc = _mm_crc32_u8(currcrc,char)
return ~currcrc
}
```

After all these observations, we arrive at following implementation for simde_mm_crc32_u8 :-

SIMDE_FUNCTION_ATTRIBUTES uint32_t simde_mm_crc32_u8(uint32_t prevcrc, uint8_t v) { #if defined(SIMDE_X86_SSE4_2_NATIVE) return _mm_crc32_u8(prevcrc, v); #else uint32_t crc = prevcrc; crc ^= v; for(int bit = 0 ; bit < 8 ; bit++) { if (crc & 1) crc = (crc >> 1) ^ UINT32_C(0x82f63b78); else crc = (crc >> 1); } return crc; #endif } #if defined(SIMDE_X86_SSE4_2_ENABLE_NATIVE_ALIASES) #define _mm_crc32_u8(prevcrc, v) simde_mm_crc32_u8(prevcrc, v) #endif

Other CRC functions can be computed by calling _mm_crc32_u8. For example simde_mm_crc32_u16 can be implemented as:-

SIMDE_FUNCTION_ATTRIBUTES uint32_t simde_mm_crc32_u16(uint32_t prevcrc, uint16_t v) { #if defined(SIMDE_X86_SSE4_2_NATIVE) return _mm_crc32_u16(prevcrc, v); #else uint32_t crc = prevcrc; crc = simde_mm_crc32_u8(crc, v & 0xff); crc = simde_mm_crc32_u8(crc, (v >> 8) & 0xff); return crc; #endif } #if defined(SIMDE_X86_SSE4_2_ENABLE_NATIVE_ALIASES) #define _mm_crc32_u16(prevcrc, v) simde_mm_crc32_u16(prevcrc, v) #endif

Similarly, _mm_crc32_u32 and _mm_crc32_u64 can be implemented whose implementation can be found here.

That’s it from my side. I wish to write more such articles in future.

Thank You

**References**:-

https://github.com/nemequ/simde

https://en.wikipedia.org/wiki/SSE4

https://github.com/Michaelangel007/crc32/blob/master/README.md