* Add support for arm64 using NEON instructions
Specifically using the PMULL/PMULL2 polynomial multiplication instructions followed by a reduction step (actually two steps).
* Add ARM performance numbers
* Formatting for performance table
* Refactoring of NEON version and 256-bit wide version
* Expand test slice beyond 32 (for AVX2 and NEON) and test galMulSliceXor explicitly.
* Fix ARM code with missing function.
* Fix missing newline