Klaus Post
daf81ef0bd
Use VPTERNLOGD on GOAMD64=v4 ( #182 )
...
* Use VPTERNLOGD on GOAMD64=v4
* Bump to Go 1.18
2022-03-16 11:10:29 +01:00
Klaus Post
1bb4d699e1
avx2: Improve speed when > 10 input or output shards. ( #174 )
...
Speeds are including a limiting the number of goroutines with all AVX2 paths,
Before/after
```
benchmark old ns/op new ns/op delta
BenchmarkGalois128K-32 2240 2240 +0.00%
BenchmarkGalois1M-32 19578 18891 -3.51%
BenchmarkGaloisXor128K-32 2798 2852 +1.93%
BenchmarkGaloisXor1M-32 23334 23345 +0.05%
BenchmarkEncode2x1x1M-32 34357 34370 +0.04%
BenchmarkEncode10x2x10000-32 3210 3093 -3.64%
BenchmarkEncode100x20x10000-32 362925 148214 -59.16%
BenchmarkEncode17x3x1M-32 323767 224157 -30.77%
BenchmarkEncode10x4x16M-32 8376895 8376737 -0.00%
BenchmarkEncode5x2x1M-32 68365 66861 -2.20%
BenchmarkEncode10x2x1M-32 101407 93023 -8.27%
BenchmarkEncode10x4x1M-32 171880 155477 -9.54%
BenchmarkEncode50x20x1M-32 3704691 3015047 -18.62%
BenchmarkEncode17x3x16M-32 10279233 10106658 -1.68%
BenchmarkEncode_8x4x8M-32 3438245 3326479 -3.25%
BenchmarkEncode_12x4x12M-32 6632257 6581637 -0.76%
BenchmarkEncode_16x4x16M-32 10815755 10788377 -0.25%
BenchmarkEncode_16x4x32M-32 21029061 21507995 +2.28%
BenchmarkEncode_16x4x64M-32 42145450 43876850 +4.11%
BenchmarkEncode_8x5x8M-32 4543208 3846378 -15.34%
BenchmarkEncode_8x6x8M-32 5065494 4397218 -13.19%
BenchmarkEncode_8x7x8M-32 5818995 4962884 -14.71%
BenchmarkEncode_8x9x8M-32 6215449 6114898 -1.62%
BenchmarkEncode_8x10x8M-32 6923415 6610501 -4.52%
BenchmarkEncode_8x11x8M-32 7365988 7010473 -4.83%
BenchmarkEncode_8x8x05M-32 150857 136820 -9.30%
BenchmarkEncode_8x8x1M-32 256722 254854 -0.73%
BenchmarkEncode_8x8x8M-32 5547790 5422048 -2.27%
BenchmarkEncode_8x8x32M-32 23038643 22705859 -1.44%
BenchmarkEncode_24x8x24M-32 27729259 30332216 +9.39%
BenchmarkEncode_24x8x48M-32 53865705 61187658 +13.59%
BenchmarkVerify10x2x10000-32 8769 8154 -7.01%
BenchmarkVerify10x2x1M-32 516149 476180 -7.74%
BenchmarkVerify5x2x1M-32 443888 419541 -5.48%
BenchmarkVerify10x4x1M-32 1030299 948021 -7.99%
BenchmarkVerify50x20x1M-32 7209689 6186891 -14.19%
BenchmarkVerify10x4x16M-32 17774456 17681879 -0.52%
BenchmarkReconstruct10x2x10000-32 3352 3256 -2.86%
BenchmarkReconstruct50x5x50000-32 166417 140900 -15.33%
BenchmarkReconstruct10x2x1M-32 189711 174615 -7.96%
BenchmarkReconstruct5x2x1M-32 128080 126520 -1.22%
BenchmarkReconstruct10x4x1M-32 273312 254017 -7.06%
BenchmarkReconstruct50x20x1M-32 3628812 3192474 -12.02%
BenchmarkReconstruct10x4x16M-32 8562186 8781479 +2.56%
BenchmarkReconstructData10x2x10000-32 3241 3116 -3.86%
BenchmarkReconstructData50x5x50000-32 162520 134794 -17.06%
BenchmarkReconstructData10x2x1M-32 171253 161955 -5.43%
BenchmarkReconstructData5x2x1M-32 102215 106942 +4.62%
BenchmarkReconstructData10x4x1M-32 225593 219969 -2.49%
BenchmarkReconstructData50x20x1M-32 2515311 2129721 -15.33%
BenchmarkReconstructData10x4x16M-32 6980308 6698111 -4.04%
BenchmarkReconstructP10x2x10000-32 924 937 +1.35%
BenchmarkReconstructP10x5x20000-32 1639 1703 +3.90%
BenchmarkSplit10x4x160M-32 4984993 4898045 -1.74%
BenchmarkSplit5x2x5M-32 380415 221446 -41.79%
BenchmarkSplit10x2x1M-32 58761 53335 -9.23%
BenchmarkSplit10x4x10M-32 643188 410959 -36.11%
BenchmarkSplit50x20x50M-32 1843879 1647205 -10.67%
BenchmarkSplit17x3x272M-32 3684920 3613951 -1.93%
BenchmarkParallel_8x8x64K-32 7022 6630 -5.58%
BenchmarkParallel_8x8x05M-32 348308 348369 +0.02%
BenchmarkParallel_20x10x05M-32 575672 581028 +0.93%
BenchmarkParallel_8x8x1M-32 716033 697167 -2.63%
BenchmarkParallel_8x8x8M-32 5716048 5616437 -1.74%
BenchmarkParallel_8x8x32M-32 22650878 22098667 -2.44%
BenchmarkParallel_8x3x1M-32 406839 399125 -1.90%
BenchmarkParallel_8x4x1M-32 459107 463890 +1.04%
BenchmarkParallel_8x5x1M-32 527488 520334 -1.36%
BenchmarkStreamEncode10x2x10000-32 6013 5878 -2.25%
BenchmarkStreamEncode100x20x10000-32 503124 267894 -46.75%
BenchmarkStreamEncode17x3x1M-32 1561838 1376618 -11.86%
BenchmarkStreamEncode10x4x16M-32 19124427 17762582 -7.12%
BenchmarkStreamEncode5x2x1M-32 429701 384666 -10.48%
BenchmarkStreamEncode10x2x1M-32 801257 763637 -4.70%
BenchmarkStreamEncode10x4x1M-32 876065 820744 -6.31%
BenchmarkStreamEncode50x20x1M-32 7205112 6081398 -15.60%
BenchmarkStreamEncode17x3x16M-32 27182786 26117143 -3.92%
BenchmarkStreamVerify10x2x10000-32 13767 14026 +1.88%
BenchmarkStreamVerify50x5x50000-32 826983 690453 -16.51%
BenchmarkStreamVerify10x2x1M-32 1238566 1182591 -4.52%
BenchmarkStreamVerify5x2x1M-32 892661 806301 -9.67%
BenchmarkStreamVerify10x4x1M-32 1676394 1631495 -2.68%
BenchmarkStreamVerify50x20x1M-32 10877875 10037678 -7.72%
BenchmarkStreamVerify10x4x16M-32 27599576 30435400 +10.27%
benchmark old MB/s new MB/s speedup
BenchmarkGalois128K-32 58518.53 58510.17 1.00x
BenchmarkGalois1M-32 53558.10 55507.44 1.04x
BenchmarkGaloisXor128K-32 46839.74 45961.09 0.98x
BenchmarkGaloisXor1M-32 44936.98 44917.46 1.00x
BenchmarkEncode2x1x1M-32 91561.27 91524.11 1.00x
BenchmarkEncode10x2x10000-32 37385.54 38792.54 1.04x
BenchmarkEncode100x20x10000-32 3306.47 8096.40 2.45x
BenchmarkEncode17x3x1M-32 64773.49 93557.14 1.44x
BenchmarkEncode10x4x16M-32 28039.15 28039.68 1.00x
BenchmarkEncode5x2x1M-32 107365.88 109781.16 1.02x
BenchmarkEncode10x2x1M-32 124083.62 135266.27 1.09x
BenchmarkEncode10x4x1M-32 85408.99 94419.71 1.11x
BenchmarkEncode50x20x1M-32 19812.81 24344.67 1.23x
BenchmarkEncode17x3x16M-32 32642.93 33200.32 1.02x
BenchmarkEncode_8x4x8M-32 29277.52 30261.21 1.03x
BenchmarkEncode_12x4x12M-32 30355.67 30589.14 1.01x
BenchmarkEncode_16x4x16M-32 31023.66 31102.39 1.00x
BenchmarkEncode_16x4x32M-32 31912.44 31201.82 0.98x
BenchmarkEncode_16x4x64M-32 31846.32 30589.65 0.96x
BenchmarkEncode_8x5x8M-32 24003.28 28351.84 1.18x
BenchmarkEncode_8x6x8M-32 23184.41 26707.91 1.15x
BenchmarkEncode_8x7x8M-32 21623.86 25354.03 1.17x
BenchmarkEncode_8x9x8M-32 22943.85 23321.13 1.02x
BenchmarkEncode_8x10x8M-32 21809.31 22841.68 1.05x
BenchmarkEncode_8x11x8M-32 21637.77 22735.06 1.05x
BenchmarkEncode_8x8x05M-32 55606.22 61311.47 1.10x
BenchmarkEncode_8x8x1M-32 65351.80 65830.73 1.01x
BenchmarkEncode_8x8x8M-32 24193.01 24754.07 1.02x
BenchmarkEncode_8x8x32M-32 23303.06 23644.60 1.01x
BenchmarkEncode_24x8x24M-32 29041.76 26549.54 0.91x
BenchmarkEncode_24x8x48M-32 29900.52 26322.51 0.88x
BenchmarkVerify10x2x10000-32 13685.12 14717.10 1.08x
BenchmarkVerify10x2x1M-32 24378.43 26424.72 1.08x
BenchmarkVerify5x2x1M-32 16535.79 17495.41 1.06x
BenchmarkVerify10x4x1M-32 14248.35 15484.96 1.09x
BenchmarkVerify50x20x1M-32 10180.79 11863.85 1.17x
BenchmarkVerify10x4x16M-32 13214.53 13283.71 1.01x
BenchmarkReconstruct10x2x10000-32 35799.16 36854.89 1.03x
BenchmarkReconstruct50x5x50000-32 33049.47 39034.89 1.18x
BenchmarkReconstruct10x2x1M-32 66326.88 72061.06 1.09x
BenchmarkReconstruct5x2x1M-32 57308.21 58014.92 1.01x
BenchmarkReconstruct10x4x1M-32 53711.74 57791.66 1.08x
BenchmarkReconstruct50x20x1M-32 20227.09 22991.67 1.14x
BenchmarkReconstruct10x4x16M-32 27432.37 26747.32 0.98x
BenchmarkReconstructData10x2x10000-32 37030.86 38511.87 1.04x
BenchmarkReconstructData50x5x50000-32 33842.07 40802.85 1.21x
BenchmarkReconstructData10x2x1M-32 73475.57 77693.87 1.06x
BenchmarkReconstructData5x2x1M-32 71809.58 68635.57 0.96x
BenchmarkReconstructData10x4x1M-32 65073.27 66736.88 1.03x
BenchmarkReconstructData50x20x1M-32 29181.41 34464.76 1.18x
BenchmarkReconstructData10x4x16M-32 33649.09 35066.75 1.04x
BenchmarkReconstructP10x2x10000-32 129819.98 128086.76 0.99x
BenchmarkReconstructP10x5x20000-32 183073.89 176202.21 0.96x
BenchmarkParallel_8x8x64K-32 149327.33 158153.67 1.06x
BenchmarkParallel_8x8x05M-32 24083.89 24079.69 1.00x
BenchmarkParallel_20x10x05M-32 27322.20 27070.35 0.99x
BenchmarkParallel_8x8x1M-32 23430.78 24064.83 1.03x
BenchmarkParallel_8x8x8M-32 23480.86 23897.31 1.02x
BenchmarkParallel_8x8x32M-32 23701.99 24294.27 1.02x
BenchmarkParallel_8x3x1M-32 28351.11 28899.03 1.02x
BenchmarkParallel_8x4x1M-32 27407.34 27124.76 0.99x
BenchmarkParallel_8x5x1M-32 25842.27 26197.58 1.01x
BenchmarkStreamEncode10x2x10000-32 16629.76 17012.26 1.02x
BenchmarkStreamEncode100x20x10000-32 1987.58 3732.83 1.88x
BenchmarkStreamEncode17x3x1M-32 11413.34 12948.97 1.13x
BenchmarkStreamEncode10x4x16M-32 8772.66 9445.26 1.08x
BenchmarkStreamEncode5x2x1M-32 12201.21 13629.70 1.12x
BenchmarkStreamEncode10x2x1M-32 13086.64 13731.34 1.05x
BenchmarkStreamEncode10x4x1M-32 11969.16 12775.92 1.07x
BenchmarkStreamEncode50x20x1M-32 7276.61 8621.18 1.18x
BenchmarkStreamEncode17x3x16M-32 10492.40 10920.52 1.04x
BenchmarkStreamVerify10x2x10000-32 7264.00 7129.49 0.98x
BenchmarkStreamVerify50x5x50000-32 6046.07 7241.62 1.20x
BenchmarkStreamVerify10x2x1M-32 8466.05 8866.77 1.05x
BenchmarkStreamVerify5x2x1M-32 5873.31 6502.39 1.11x
BenchmarkStreamVerify10x4x1M-32 6254.95 6427.09 1.03x
BenchmarkStreamVerify50x20x1M-32 4819.76 5223.20 1.08x
BenchmarkStreamVerify10x4x16M-32 6078.79 5512.40 0.91x
```
2021-12-09 12:28:44 +01:00
Klaus Post
7761c8f7cd
Use Workflows ( #169 )
...
* Use Workflows
* Go 1.17 build tags
* Do races separately.
2021-09-01 18:55:02 +02:00
Klaus Post
7daa20bf74
Generate AVX2 code ( #141 )
...
Replaces AVX2 up to 10x8 configurations with specific generated functions.
If code size is a concern `-tags=nogen` can be used.
Biggest speedup when not memory constrained.
```
benchmark old MB/s new MB/s speedup
BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x
BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x
BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x
BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x
BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x
BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x
BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x
BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x
BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x
BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x
BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x
BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x
BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x
BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x
BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x
BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x
BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x
BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x
BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x
BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x
BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x
benchmark old MB/s new MB/s speedup
BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x
BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x
BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x
BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x
BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x
BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x
BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x
BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x
BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x
BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x
BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x
BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x
BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x
BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x
BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x
BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x
BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x
BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x
```
2020-05-20 12:48:34 +02:00
Frank Wessels
1b9e129671
Avx512 parallel81 ( #131 )
...
* AVX512 routine for 8x1 parallel processing (WIP)
* Testing and integration of Parallel81 assembly routine
2020-05-06 12:32:31 +02:00
Klaus Post
dccac354fe
Add cross compilation ( #127 )
...
* Add cross compilation
Add 386 as 32 bit test, arm64 and ppc64le since they have assembly.
2020-05-04 21:19:49 +02:00
Klaus Post
de70cc155f
AVX512 parallel processing ( #120 )
...
Do concurrent processing in AVX512 mode and split jobs by cache size.
2020-05-04 09:17:40 +02:00
Klaus Post
d2cfcb8065
Add commandline arg to disable asm for tests. ( #116 )
...
* Add commandline test args
2020-04-22 15:38:21 +02:00
Klaus Post
0883d2f011
Only enable AVX512 on AMD64
...
Fixes #102
2019-05-26 12:12:55 +02:00
Klaus Post
a9588190c0
Optimize pure Go version. ( #96 )
...
* Optimize pure Go version.
* Update docs. Add Go 1.12 CI
* Avoid dst bounds check when using noasm ~ 40-50% faster.
* Convert multiply table to a slice whenever used.
* Split on 32 byte boundaries instead of 16 byte.
2019-03-08 10:49:27 +01:00
Frank Wessels
79aee05119
AVX512 accelerated version resulting in a 4x speed improvement over AVX2 ( #91 )
...
The performance on AVX512 has been accelerated for Intel CPUs. This gives speedups on a per-core basis of up to 4x compared to AVX2 as can be seen in the following table:
```
$ benchcmp avx2.txt avx512.txt
benchmark AVX2 MB/s AVX512 MB/s speedup
BenchmarkEncode8x8x1M-72 1681.35 4125.64 2.45x
BenchmarkEncode8x4x8M-72 1529.36 5507.97 3.60x
BenchmarkEncode8x8x8M-72 791.16 2952.29 3.73x
BenchmarkEncode8x8x32M-72 573.26 2168.61 3.78x
BenchmarkEncode12x4x12M-72 1234.41 4912.37 3.98x
BenchmarkEncode16x4x16M-72 1189.59 5138.01 4.32x
BenchmarkEncode24x8x24M-72 690.68 2583.70 3.74x
BenchmarkEncode24x8x48M-72 674.20 2643.31 3.92x
```
2019-02-10 11:17:23 +01:00