Optimisations for the 4,4 split table region multiplication and carry
less multiplication using NEON's polynomial long multiplication.
arm: w8: NEON carry less multiplication
Selected time_tool.sh results for a 1.7GHz cortex-a9:
Region Best (MB/s): 375.86 W-Method: 8 -m CARRY_FREE -
Region Best (MB/s): 142.94 W-Method: 8 -m TABLE -
Region Best (MB/s): 225.01 W-Method: 8 -m TABLE -r DOUBLE -
Region Best (MB/s): 211.23 W-Method: 8 -m TABLE -r DOUBLE -r LAZY -
Region Best (MB/s): 160.09 W-Method: 8 -m LOG -
Region Best (MB/s): 123.61 W-Method: 8 -m LOG_ZERO -
Region Best (MB/s): 123.85 W-Method: 8 -m LOG_ZERO_EXT -
Region Best (MB/s): 1183.79 W-Method: 8 -m SPLIT 8 4 -r SIMD -
Region Best (MB/s): 177.68 W-Method: 8 -m SPLIT 8 4 -r NOSIMD -
Region Best (MB/s): 87.85 W-Method: 8 -m COMPOSITE 2 - -
Region Best (MB/s): 428.59 W-Method: 8 -m COMPOSITE 2 - -r ALTMAP -