Optimisations for the single table region multiplication and carry less
multiplication using NEON's polynomial multiplication of 8-bit values.
The single polynomial multiplication is not that useful but vector
version is for region multiplication.
Selected time_tool.sh results for a 1.7GHz cortex-a9:
Region Best (MB/s): 672.72 W-Method: 4 -m CARRY_FREE -
Region Best (MB/s): 265.84 W-Method: 4 -m BYTWO_p -
Region Best (MB/s): 329.41 W-Method: 4 -m TABLE -r DOUBLE -
Region Best (MB/s): 278.63 W-Method: 4 -m TABLE -r QUAD -
Region Best (MB/s): 329.81 W-Method: 4 -m TABLE -r QUAD -r LAZY -
Region Best (MB/s): 1318.03 W-Method: 4 -m TABLE -r SIMD -
Region Best (MB/s): 165.15 W-Method: 4 -m TABLE -r NOSIMD -
Region Best (MB/s): 99.73 W-Method: 4 -m LOG -