Frank Wessels
2f8e50e65c
Better test coverage for AVX512 (parallel version) ( #134 )
2020-05-07 09:28:23 +02:00
Frank Wessels
1b9e129671
Avx512 parallel81 ( #131 )
...
* AVX512 routine for 8x1 parallel processing (WIP)
* Testing and integration of Parallel81 assembly routine
2020-05-06 12:32:31 +02:00
Klaus Post
de70cc155f
AVX512 parallel processing ( #120 )
...
Do concurrent processing in AVX512 mode and split jobs by cache size.
2020-05-04 09:17:40 +02:00
Klaus Post
65df535980
Make single goroutine encodes more efficient ( #122 )
...
Calculate the optimal per round size to keep data in cache when not using WithAutoGoroutines.
```
λ benchcmp before.txt after.txt
benchmark old ns/op new ns/op delta
BenchmarkParallel_8x8x05M-16 675225 321053 -52.45%
BenchmarkParallel_20x10x05M-16 3471988 600740 -82.70%
BenchmarkParallel_8x8x1M-16 3948606 728093 -81.56%
BenchmarkParallel_8x8x8M-16 47361588 5976467 -87.38%
BenchmarkParallel_8x8x32M-16 195044200 24365474 -87.51%
benchmark old MB/s new MB/s speedup
BenchmarkParallel_8x8x05M-16 6211.71 13064.22 2.10x
BenchmarkParallel_20x10x05M-16 3020.10 17454.73 5.78x
BenchmarkParallel_8x8x1M-16 2124.45 11521.34 5.42x
BenchmarkParallel_8x8x8M-16 1416.95 11228.85 7.92x
BenchmarkParallel_8x8x32M-16 1376.28 11017.04 8.00x
```
2020-05-03 19:37:22 +02:00
Frank Wessels
0b98f5350a
Refactor AVX512 code to use Go assembly instructions. ( #121 )
...
Additionally there is a small performance improvement using VPTERNLOGD (instead of two VPXORD instructions).
2020-05-03 13:43:52 +02:00
Klaus Post
d2cfcb8065
Add commandline arg to disable asm for tests. ( #116 )
...
* Add commandline test args
2020-04-22 15:38:21 +02:00
Klaus Post
101092fa3b
Make AVX512 short tests ( #114 )
...
Tests are timing out. Use shorter tests for -short.
2020-01-18 14:50:31 +01:00
Frank Wessels
79aee05119
AVX512 accelerated version resulting in a 4x speed improvement over AVX2 ( #91 )
...
The performance on AVX512 has been accelerated for Intel CPUs. This gives speedups on a per-core basis of up to 4x compared to AVX2 as can be seen in the following table:
```
$ benchcmp avx2.txt avx512.txt
benchmark AVX2 MB/s AVX512 MB/s speedup
BenchmarkEncode8x8x1M-72 1681.35 4125.64 2.45x
BenchmarkEncode8x4x8M-72 1529.36 5507.97 3.60x
BenchmarkEncode8x8x8M-72 791.16 2952.29 3.73x
BenchmarkEncode8x8x32M-72 573.26 2168.61 3.78x
BenchmarkEncode12x4x12M-72 1234.41 4912.37 3.98x
BenchmarkEncode16x4x16M-72 1189.59 5138.01 4.32x
BenchmarkEncode24x8x24M-72 690.68 2583.70 3.74x
BenchmarkEncode24x8x48M-72 674.20 2643.31 3.92x
```
2019-02-10 11:17:23 +01:00