Klaus Post
ab26eb4126
Add WithInversionCache and use pointer methods ( #160 )
...
There appears to be writes to value receivers.
Add `WithInversionCache(bool)` to disable cache.
Fixes #159
2021-01-13 10:21:28 +01:00
Klaus Post
7c8682430c
tests: Set full data size as number of bytes ( #157 )
...
* Clean up deps.
* tests: Set full data size as number of bytes
Use total data size (data+parity) as benchmark sizes for more consistent benchmarks.
2020-12-18 09:09:17 +01:00
Klaus Post
6e6fbcd31d
Update README.md
2020-12-17 11:35:52 +01:00
Michael Cook
c81ca04b16
Sanity check error on SwapRows ( #156 )
2020-12-17 09:38:25 +01:00
Klaus Post
60143f4a15
Update README.md
2020-12-09 22:56:37 +01:00
Klaus Post
519603f6e1
Update packages ( #154 )
...
* Update packages
Update cpuid and clean up generated.
2020-12-09 22:56:01 +01:00
Klaus Post
653e76aa26
Faster AVX2 encoding ( #153 )
...
* Remove 50% of bounds checks when copying.
* Use RIP only addressing, free one register.
```
benchmark old MB/s new MB/s speedup
BenchmarkGalois128K-32 57663.49 58005.87 1.01x
BenchmarkGalois1M-32 49479.31 49848.29 1.01x
BenchmarkGaloisXor128K-32 46310.69 46501.88 1.00x
BenchmarkGaloisXor1M-32 43804.86 43984.39 1.00x
BenchmarkEncode10x2x10000-32 25926.93 27457.75 1.06x
BenchmarkEncode100x20x10000-32 2635.82 2818.95 1.07x
BenchmarkEncode17x3x1M-32 63215.11 61576.76 0.97x
BenchmarkEncode10x4x16M-32 19551.54 19505.07 1.00x
BenchmarkEncode5x2x1M-32 79612.06 81985.14 1.03x
BenchmarkEncode10x2x1M-32 121478.29 127739.41 1.05x
BenchmarkEncode10x4x1M-32 70757.61 74423.67 1.05x
BenchmarkEncode50x20x1M-32 19811.96 20103.32 1.01x
BenchmarkEncode17x3x16M-32 27202.10 27825.34 1.02x
BenchmarkEncode_8x4x8M-32 19029.04 19701.31 1.04x
BenchmarkEncode_12x4x12M-32 22449.87 22480.51 1.00x
BenchmarkEncode_16x4x16M-32 24536.74 24672.24 1.01x
BenchmarkEncode_16x4x32M-32 24381.34 24981.99 1.02x
BenchmarkEncode_16x4x64M-32 24717.69 25086.94 1.01x
BenchmarkEncode_8x5x8M-32 16763.51 17154.04 1.02x
BenchmarkEncode_8x6x8M-32 15067.22 15205.87 1.01x
BenchmarkEncode_8x7x8M-32 13156.38 13589.40 1.03x
BenchmarkEncode_8x9x8M-32 11363.74 11523.70 1.01x
BenchmarkEncode_8x10x8M-32 10359.37 10474.91 1.01x
BenchmarkEncode_8x11x8M-32 9627.07 9463.24 0.98x
BenchmarkEncode_8x8x05M-32 30104.80 32634.89 1.08x
BenchmarkEncode_8x8x1M-32 36497.28 36425.88 1.00x
BenchmarkEncode_8x8x8M-32 12186.19 11602.41 0.95x
BenchmarkEncode_8x8x32M-32 11670.72 11413.71 0.98x
BenchmarkEncode_24x8x24M-32 21709.83 21652.50 1.00x
BenchmarkEncode_24x8x48M-32 22494.40 22280.59 0.99x
BenchmarkVerify10x2x10000-32 10567.56 10483.91 0.99x
BenchmarkVerify50x5x50000-32 28102.84 27923.63 0.99x
BenchmarkVerify10x2x1M-32 30298.33 30106.18 0.99x
BenchmarkVerify5x2x1M-32 16115.91 15847.03 0.98x
BenchmarkVerify10x4x1M-32 15382.13 14852.68 0.97x
BenchmarkVerify50x20x1M-32 8476.02 8466.24 1.00x
BenchmarkVerify10x4x16M-32 15101.03 15434.71 1.02x
BenchmarkReconstruct10x2x10000-32 26228.18 26960.19 1.03x
BenchmarkReconstruct50x5x50000-32 31091.42 30975.82 1.00x
BenchmarkReconstruct10x2x1M-32 58548.87 60281.92 1.03x
BenchmarkReconstruct5x2x1M-32 39499.23 41791.80 1.06x
BenchmarkReconstruct10x4x1M-32 41448.60 43053.15 1.04x
BenchmarkReconstruct50x20x1M-32 17185.99 17354.67 1.01x
BenchmarkReconstruct10x4x16M-32 18798.60 18847.43 1.00x
BenchmarkReconstructData10x2x10000-32 27208.48 27538.38 1.01x
BenchmarkReconstructData50x5x50000-32 32135.65 32078.91 1.00x
BenchmarkReconstructData10x2x1M-32 63180.19 67332.17 1.07x
BenchmarkReconstructData5x2x1M-32 47532.85 49932.17 1.05x
BenchmarkReconstructData10x4x1M-32 50059.14 52323.15 1.05x
BenchmarkReconstructData50x20x1M-32 26679.75 26714.11 1.00x
BenchmarkReconstructData10x4x16M-32 24854.99 24527.23 0.99x
BenchmarkReconstructP10x2x10000-32 115089.87 113229.75 0.98x
BenchmarkReconstructP10x5x20000-32 129838.75 132871.10 1.02x
BenchmarkParallel_8x8x64K-32 69951.43 69980.44 1.00x
BenchmarkParallel_8x8x05M-32 11752.94 11724.35 1.00x
BenchmarkParallel_20x10x05M-32 18553.93 18613.33 1.00x
BenchmarkParallel_8x8x1M-32 11639.19 11746.86 1.01x
BenchmarkParallel_8x8x8M-32 11799.36 11685.63 0.99x
BenchmarkParallel_8x8x32M-32 11510.94 11791.72 1.02x
BenchmarkParallel_8x3x1M-32 20268.92 20678.21 1.02x
BenchmarkParallel_8x4x1M-32 17616.05 17856.17 1.01x
BenchmarkParallel_8x5x1M-32 15590.87 15872.42 1.02x
BenchmarkStreamEncode10x2x10000-32 14917.08 15408.39 1.03x
BenchmarkStreamEncode100x20x10000-32 2014.81 2077.31 1.03x
BenchmarkStreamEncode17x3x1M-32 11839.37 12434.80 1.05x
BenchmarkStreamEncode10x4x16M-32 9151.14 9206.98 1.01x
BenchmarkStreamEncode5x2x1M-32 13598.55 13663.56 1.00x
BenchmarkStreamEncode10x2x1M-32 13192.91 13453.41 1.02x
BenchmarkStreamEncode10x4x1M-32 12109.90 12050.68 1.00x
BenchmarkStreamEncode50x20x1M-32 8640.73 8370.10 0.97x
BenchmarkStreamEncode17x3x16M-32 10473.17 10527.04 1.01x
BenchmarkStreamVerify10x2x10000-32 7032.23 7128.82 1.01x
BenchmarkStreamVerify50x5x50000-32 13023.46 13109.31 1.01x
BenchmarkStreamVerify10x2x1M-32 11941.63 11949.91 1.00x
BenchmarkStreamVerify5x2x1M-32 8029.93 8263.39 1.03x
BenchmarkStreamVerify10x4x1M-32 8137.82 8271.11 1.02x
BenchmarkStreamVerify50x20x1M-32 7378.87 7708.81 1.04x
BenchmarkStreamVerify10x4x16M-32 8973.18 8955.29 1.00x
```
2020-11-10 14:39:23 +01:00
Klaus Post
04d4482b55
Test gzip on 390x ( #150 )
...
* Upgrade CI to Go 1.15
* Test unzip
2020-09-01 13:14:21 +02:00
Klaus Post
11742a626c
Upgrade CI to Go 1.15 ( #151 )
...
Removes Go 1.12
2020-09-01 13:13:31 +02:00
Klaus Post
7daa20bf74
Generate AVX2 code ( #141 )
...
Replaces AVX2 up to 10x8 configurations with specific generated functions.
If code size is a concern `-tags=nogen` can be used.
Biggest speedup when not memory constrained.
```
benchmark old MB/s new MB/s speedup
BenchmarkEncode_8x5x8M 5895.75 9648.18 1.64x
BenchmarkEncode_8x5x8M-4 16773.41 17220.67 1.03x
BenchmarkEncode_8x5x8M-16 18263.12 17176.28 0.94x
BenchmarkEncode_8x6x8M 5075.89 8548.39 1.68x
BenchmarkEncode_8x6x8M-4 14559.83 15370.95 1.06x
BenchmarkEncode_8x6x8M-16 16183.37 15291.98 0.94x
BenchmarkEncode_8x7x8M 4481.18 7015.60 1.57x
BenchmarkEncode_8x7x8M-4 12835.35 13695.90 1.07x
BenchmarkEncode_8x7x8M-16 14246.94 13737.36 0.96x
BenchmarkEncode_8x8x05M 5569.95 7947.70 1.43x
BenchmarkEncode_8x8x05M-4 17334.91 25271.37 1.46x
BenchmarkEncode_8x8x05M-16 29349.42 35043.36 1.19x
BenchmarkEncode_8x8x1M 4830.58 7891.32 1.63x
BenchmarkEncode_8x8x1M-4 17531.36 27371.42 1.56x
BenchmarkEncode_8x8x1M-16 29593.98 39241.09 1.33x
BenchmarkEncode_8x8x8M 3953.66 6584.26 1.67x
BenchmarkEncode_8x8x8M-4 11527.34 12331.23 1.07x
BenchmarkEncode_8x8x8M-16 12718.89 12173.08 0.96x
BenchmarkEncode_8x8x32M 3927.51 6195.91 1.58x
BenchmarkEncode_8x8x32M-4 11490.85 11424.39 0.99x
BenchmarkEncode_8x8x32M-16 12506.09 11888.55 0.95x
benchmark old MB/s new MB/s speedup
BenchmarkParallel_8x8x64K 5490.24 6959.57 1.27x
BenchmarkParallel_8x8x64K-4 21078.94 29557.51 1.40x
BenchmarkParallel_8x8x64K-16 57508.45 73672.54 1.28x
BenchmarkParallel_8x8x1M 4755.49 7667.84 1.61x
BenchmarkParallel_8x8x1M-4 11818.66 12013.49 1.02x
BenchmarkParallel_8x8x1M-16 12923.12 12109.42 0.94x
BenchmarkParallel_8x8x8M 3973.94 6525.85 1.64x
BenchmarkParallel_8x8x8M-4 11725.68 11312.46 0.96x
BenchmarkParallel_8x8x8M-16 12608.20 11484.98 0.91x
BenchmarkParallel_8x3x1M 14139.71 17993.04 1.27x
BenchmarkParallel_8x3x1M-4 21805.97 23053.92 1.06x
BenchmarkParallel_8x3x1M-16 24673.05 23596.71 0.96x
BenchmarkParallel_8x4x1M 10617.88 14474.54 1.36x
BenchmarkParallel_8x4x1M-4 18635.82 18965.65 1.02x
BenchmarkParallel_8x4x1M-16 21518.12 20171.47 0.94x
BenchmarkParallel_8x5x1M 8669.88 11833.96 1.36x
BenchmarkParallel_8x5x1M-4 16321.00 17500.30 1.07x
BenchmarkParallel_8x5x1M-16 17267.16 17191.04 1.00x
```
2020-05-20 12:48:34 +02:00
Frank Wessels
01b307ec91
Minor refactor for arm NEON version using macros ( #147 )
...
* Minior refactor for arm NEON version using macros
2020-05-19 15:03:47 +02:00
Frank Wessels
6fbce20c81
Updated performance numbers for Graviton2 on ARM ( #146 )
2020-05-15 11:09:11 +02:00
Klaus Post
e8fdfd6630
Update readme and re-allow s390x failure.
2020-05-14 14:29:53 +02:00
Klaus Post
f338110979
Make sure assembler is formatted ( #145 )
...
* Make sure assembler is formatted
2020-05-14 12:04:55 +02:00
Frank Wessels
27f8a7b6bf
Small optimization to parallal82 for AVX512 by reducing the number of VSHUFI64X2 instructions in the core loop ( #143 )
2020-05-14 10:19:23 +02:00
Frank Wessels
2475ea7519
Use proper NEON assembly instructions for ARM ( #144 )
...
* Use proper NEON assembly instructions for ARM
* Updated performance numbers for ARM
2020-05-14 10:18:32 +02:00
Klaus Post
c83b7b4a38
Allow s390x failures ( #142 )
...
The s390x seems quite unstable, so we allow failures on it.
2020-05-13 17:04:16 +02:00
Klaus Post
cf8495259a
Add pure XOR for 1 parity ( #138 )
...
WithFastOneParityMatrix will switch the matrix to a simple xor if there is only one parity shard.
The PAR1 matrix already has this property so it has little effect there.
2020-05-13 11:10:58 +02:00
Frank Wessels
d6d9fba4f9
Take vshufi64x2 out of main loop and initialize upfront (for parallel 81 only) ( #139 )
2020-05-13 10:59:26 +02:00
Frank Wessels
d5afb5f48e
Faster arm64 implementation that does not use PMULL instruction ( #140 )
...
* Faster arm64 implementation that does not use PMULL instruction
* Add NEON version for sliceXor
2020-05-13 10:24:22 +02:00
Klaus Post
2df03bd4d1
Ci test more archs ( #135 )
...
* ci: test more architectures
2020-05-09 10:35:17 +02:00
Frank Wessels
2f8e50e65c
Better test coverage for AVX512 (parallel version) ( #134 )
2020-05-07 09:28:23 +02:00
Klaus Post
696c4018f8
bench: Fix reconstruct benchmarks ( #133 )
...
Always corrupt at least one shard and don't shuffle shards.
2020-05-06 15:42:49 +02:00
Klaus Post
151d8c7a05
Tweak concurrency ( #132 )
2020-05-06 15:42:30 +02:00
Klaus Post
96dc2a5aa4
Update README
2020-05-06 13:47:25 +02:00
Klaus Post
3067f8aed5
asmfmt
2020-05-06 12:36:43 +02:00
Frank Wessels
1b9e129671
Avx512 parallel81 ( #131 )
...
* AVX512 routine for 8x1 parallel processing (WIP)
* Testing and integration of Parallel81 assembly routine
2020-05-06 12:32:31 +02:00
Klaus Post
cb7a0b5aef
Do fast by one multiplication ( #130 )
...
When multiplying by one we can use faster math.
2020-05-06 11:14:25 +02:00
Klaus Post
0e9e10435f
avx2: Add 64 bytes per loop processing ( #128 )
...
* avx2: Add 64 bytes per loop processing
Not super clean benchmark run, but `BenchmarkGalois` is consistently faster.
```
benchmark old ns/op new ns/op delta
BenchmarkGalois128K-32 2551 2261 -11.37%
BenchmarkGalois1M-32 22492 21107 -6.16%
BenchmarkGaloisXor128K-32 2972 2808 -5.52%
BenchmarkGaloisXor1M-32 25181 23951 -4.88%
BenchmarkEncode10x2x10000-32 5081 4722 -7.07%
BenchmarkEncode100x20x10000-32 383800 346655 -9.68%
BenchmarkEncode17x3x1M-32 264806 263191 -0.61%
BenchmarkEncode10x4x16M-32 8337857 8376910 +0.47%
BenchmarkEncode5x2x1M-32 77119 73598 -4.57%
BenchmarkEncode10x2x1M-32 108424 102423 -5.53%
BenchmarkEncode10x4x1M-32 194427 184301 -5.21%
BenchmarkEncode50x20x1M-32 3870301 3747639 -3.17%
BenchmarkEncode17x3x16M-32 10617586 10602449 -0.14%
BenchmarkEncode_8x4x8M-32 3227254 3229451 +0.07%
BenchmarkEncode_12x4x12M-32 6841898 6847261 +0.08%
BenchmarkEncode_16x4x16M-32 11153469 11048738 -0.94%
BenchmarkEncode_16x4x32M-32 21947506 21826647 -0.55%
BenchmarkEncode_16x4x64M-32 43163608 42971338 -0.45%
BenchmarkEncode_8x5x8M-32 3856675 3780730 -1.97%
BenchmarkEncode_8x6x8M-32 4322023 4437109 +2.66%
BenchmarkEncode_8x7x8M-32 5011434 4959623 -1.03%
BenchmarkEncode_8x9x8M-32 6243694 6098824 -2.32%
BenchmarkEncode_8x10x8M-32 6724456 6657099 -1.00%
BenchmarkEncode_8x11x8M-32 7207693 7340332 +1.84%
BenchmarkEncode_8x8x05M-32 176877 172183 -2.65%
BenchmarkEncode_8x8x1M-32 309716 301743 -2.57%
BenchmarkEncode_8x8x8M-32 5498952 5489078 -0.18%
BenchmarkEncode_8x8x32M-32 22630195 22557074 -0.32%
BenchmarkEncode_24x8x24M-32 28488886 28220702 -0.94%
BenchmarkEncode_24x8x48M-32 56124735 54862495 -2.25%
BenchmarkVerify10x2x10000-32 9874 9356 -5.25%
BenchmarkVerify50x5x50000-32 175610 159735 -9.04%
BenchmarkVerify10x2x1M-32 331276 311726 -5.90%
BenchmarkVerify5x2x1M-32 265466 248075 -6.55%
BenchmarkVerify10x4x1M-32 701627 606420 -13.57%
BenchmarkVerify50x20x1M-32 4338171 4245635 -2.13%
BenchmarkVerify10x4x16M-32 12312830 11932698 -3.09%
BenchmarkReconstruct10x2x10000-32 1594 1504 -5.65%
BenchmarkReconstruct50x5x50000-32 95101 79558 -16.34%
BenchmarkReconstruct10x2x1M-32 38479 37225 -3.26%
BenchmarkReconstruct5x2x1M-32 30968 30013 -3.08%
BenchmarkReconstruct10x4x1M-32 81630 75350 -7.69%
BenchmarkReconstruct50x20x1M-32 1136952 1040156 -8.51%
BenchmarkReconstruct10x4x16M-32 685408 656484 -4.22%
BenchmarkReconstructData10x2x10000-32 1609 1486 -7.64%
BenchmarkReconstructData50x5x50000-32 87090 71512 -17.89%
BenchmarkReconstructData10x2x1M-32 31497 30347 -3.65%
BenchmarkReconstructData5x2x1M-32 23379 22611 -3.28%
BenchmarkReconstructData10x4x1M-32 63853 61035 -4.41%
BenchmarkReconstructData50x20x1M-32 1048807 966201 -7.88%
BenchmarkReconstructData10x4x16M-32 866658 892252 +2.95%
BenchmarkReconstructP10x2x10000-32 544 540 -0.74%
BenchmarkReconstructP10x5x20000-32 1242 1206 -2.90%
BenchmarkSplit10x4x160M-32 2735508 2743214 +0.28%
BenchmarkSplit5x2x5M-32 276232 288523 +4.45%
BenchmarkSplit10x2x1M-32 44389 45517 +2.54%
BenchmarkSplit10x4x10M-32 477282 460888 -3.43%
BenchmarkSplit50x20x50M-32 1608821 1602105 -0.42%
BenchmarkSplit17x3x272M-32 2035932 2034705 -0.06%
BenchmarkParallel_8x8x05M-32 346733 351837 +1.47%
BenchmarkParallel_20x10x05M-32 577127 586232 +1.58%
BenchmarkParallel_8x8x1M-32 722453 729294 +0.95%
BenchmarkParallel_8x8x8M-32 5717650 5817130 +1.74%
BenchmarkParallel_8x8x32M-32 22914260 24132696 +5.32%
BenchmarkStreamEncode10x2x10000-32 6703131 7141021 +6.53%
BenchmarkStreamEncode100x20x10000-32 38175873 39767386 +4.17%
BenchmarkStreamEncode17x3x1M-32 8920549 9218973 +3.35%
BenchmarkStreamEncode10x4x16M-32 21841702 21784898 -0.26%
BenchmarkStreamEncode5x2x1M-32 4088001 3247404 -20.56%
BenchmarkStreamEncode10x2x1M-32 5860652 5932381 +1.22%
BenchmarkStreamEncode10x4x1M-32 7555172 7589960 +0.46%
BenchmarkStreamEncode50x20x1M-32 30006814 30250054 +0.81%
BenchmarkStreamEncode17x3x16M-32 32757489 32818254 +0.19%
BenchmarkStreamVerify10x2x10000-32 6714996 6831093 +1.73%
BenchmarkStreamVerify50x5x50000-32 18525904 18761767 +1.27%
BenchmarkStreamVerify10x2x1M-32 5232278 5444148 +4.05%
BenchmarkStreamVerify5x2x1M-32 3673843 3755283 +2.22%
BenchmarkStreamVerify10x4x1M-32 7184419 7185293 +0.01%
BenchmarkStreamVerify50x20x1M-32 28441187 28574766 +0.47%
BenchmarkStreamVerify10x4x16M-32 8538440 8668614 +1.52%
benchmark old MB/s new MB/s speedup
BenchmarkGalois128K-32 51374.59 57976.36 1.13x
BenchmarkGalois1M-32 46620.03 49679.10 1.07x
BenchmarkGaloisXor128K-32 44106.22 46671.56 1.06x
BenchmarkGaloisXor1M-32 41641.82 43779.89 1.05x
BenchmarkEncode10x2x10000-32 19682.61 21176.81 1.08x
BenchmarkEncode100x20x10000-32 2605.52 2884.71 1.11x
BenchmarkEncode17x3x1M-32 67316.54 67729.50 1.01x
BenchmarkEncode10x4x16M-32 20121.74 20027.93 1.00x
BenchmarkEncode5x2x1M-32 67984.17 71236.47 1.05x
BenchmarkEncode10x2x1M-32 96710.29 102377.00 1.06x
BenchmarkEncode10x4x1M-32 53931.74 56894.82 1.05x
BenchmarkEncode50x20x1M-32 13546.44 13989.82 1.03x
BenchmarkEncode17x3x16M-32 26862.29 26900.64 1.00x
BenchmarkEncode_8x4x8M-32 20794.42 20780.27 1.00x
BenchmarkEncode_12x4x12M-32 22069.16 22051.88 1.00x
BenchmarkEncode_16x4x16M-32 24067.44 24295.58 1.01x
BenchmarkEncode_16x4x32M-32 24461.59 24597.04 1.01x
BenchmarkEncode_16x4x64M-32 24876.09 24987.40 1.00x
BenchmarkEncode_8x5x8M-32 17400.71 17750.24 1.02x
BenchmarkEncode_8x6x8M-32 15527.19 15124.46 0.97x
BenchmarkEncode_8x7x8M-32 13391.15 13531.04 1.01x
BenchmarkEncode_8x9x8M-32 10748.26 11003.58 1.02x
BenchmarkEncode_8x10x8M-32 9979.82 10080.80 1.01x
BenchmarkEncode_8x11x8M-32 9310.73 9142.48 0.98x
BenchmarkEncode_8x8x05M-32 23713.12 24359.50 1.03x
BenchmarkEncode_8x8x1M-32 27084.87 27800.50 1.03x
BenchmarkEncode_8x8x8M-32 12203.94 12225.89 1.00x
BenchmarkEncode_8x8x32M-32 11861.83 11900.28 1.00x
BenchmarkEncode_24x8x24M-32 21200.54 21402.01 1.01x
BenchmarkEncode_24x8x48M-32 21522.77 22017.95 1.02x
BenchmarkVerify10x2x10000-32 10127.24 10688.01 1.06x
BenchmarkVerify50x5x50000-32 28472.25 31301.75 1.10x
BenchmarkVerify10x2x1M-32 31652.63 33637.74 1.06x
BenchmarkVerify5x2x1M-32 19749.74 21134.27 1.07x
BenchmarkVerify10x4x1M-32 14944.92 17291.25 1.16x
BenchmarkVerify50x20x1M-32 12085.46 12348.87 1.02x
BenchmarkVerify10x4x16M-32 13625.80 14059.87 1.03x
BenchmarkReconstruct10x2x10000-32 62723.68 66470.81 1.06x
BenchmarkReconstruct50x5x50000-32 52575.87 62847.32 1.20x
BenchmarkReconstruct10x2x1M-32 272507.04 281685.84 1.03x
BenchmarkReconstruct5x2x1M-32 169299.03 174685.39 1.03x
BenchmarkReconstruct10x4x1M-32 128455.17 139161.42 1.08x
BenchmarkReconstruct50x20x1M-32 46113.48 50404.73 1.09x
BenchmarkReconstruct10x4x16M-32 244777.11 255561.72 1.04x
BenchmarkReconstructData10x2x10000-32 62160.46 67305.98 1.08x
BenchmarkReconstructData50x5x50000-32 57411.81 69917.97 1.22x
BenchmarkReconstructData10x2x1M-32 332909.82 345526.29 1.04x
BenchmarkReconstructData5x2x1M-32 224254.60 231868.74 1.03x
BenchmarkReconstructData10x4x1M-32 164216.61 171799.68 1.05x
BenchmarkReconstructData50x20x1M-32 49988.98 54262.82 1.09x
BenchmarkReconstructData10x4x16M-32 193585.15 188032.29 0.97x
BenchmarkReconstructP10x2x10000-32 183806.57 185284.57 1.01x
BenchmarkReconstructP10x5x20000-32 160985.46 165852.51 1.03x
BenchmarkParallel_8x8x05M-32 12096.63 11921.17 0.99x
BenchmarkParallel_20x10x05M-32 18168.91 17886.72 0.98x
BenchmarkParallel_8x8x1M-32 11611.28 11502.36 0.99x
BenchmarkParallel_8x8x8M-32 11737.14 11536.42 0.98x
BenchmarkParallel_8x8x32M-32 11714.78 11123.31 0.95x
BenchmarkStreamEncode10x2x10000-32 14.92 14.00 0.94x
BenchmarkStreamEncode100x20x10000-32 26.19 25.15 0.96x
BenchmarkStreamEncode17x3x1M-32 1998.28 1933.60 0.97x
BenchmarkStreamEncode10x4x16M-32 7681.28 7701.31 1.00x
BenchmarkStreamEncode5x2x1M-32 1282.50 1614.48 1.26x
BenchmarkStreamEncode10x2x1M-32 1789.18 1767.55 0.99x
BenchmarkStreamEncode10x4x1M-32 1387.89 1381.53 1.00x
BenchmarkStreamEncode50x20x1M-32 1747.23 1733.18 0.99x
BenchmarkStreamEncode17x3x16M-32 8706.79 8690.67 1.00x
BenchmarkStreamVerify10x2x10000-32 14.89 14.64 0.98x
BenchmarkStreamVerify50x5x50000-32 269.89 266.50 0.99x
BenchmarkStreamVerify10x2x1M-32 2004.05 1926.06 0.96x
BenchmarkStreamVerify5x2x1M-32 1427.08 1396.13 0.98x
BenchmarkStreamVerify10x4x1M-32 1459.51 1459.34 1.00x
BenchmarkStreamVerify50x20x1M-32 1843.41 1834.79 1.00x
BenchmarkStreamVerify10x4x16M-32 19649.04 19353.98 0.98x
```
2020-05-05 16:36:01 +02:00
Klaus Post
abb309aca7
Fix stream allocations ( #129 )
...
Numbers speak for themselves:
```
benchmark old ns/op new ns/op delta
BenchmarkStreamEncode10x2x10000-32 4792420 7937 -99.83%
BenchmarkStreamEncode100x20x10000-32 38424066 473285 -98.77%
BenchmarkStreamEncode17x3x1M-32 8195036 1482191 -81.91%
BenchmarkStreamEncode10x4x16M-32 21356715 18051773 -15.47%
BenchmarkStreamEncode5x2x1M-32 3295827 412301 -87.49%
BenchmarkStreamEncode10x2x1M-32 5249011 798828 -84.78%
BenchmarkStreamEncode10x4x1M-32 6392974 904818 -85.85%
BenchmarkStreamEncode50x20x1M-32 29083474 7199282 -75.25%
BenchmarkStreamEncode17x3x16M-32 32451850 28036421 -13.61%
BenchmarkStreamVerify10x2x10000-32 4858416 12988 -99.73%
BenchmarkStreamVerify50x5x50000-32 17047361 377003 -97.79%
BenchmarkStreamVerify10x2x1M-32 4869964 887214 -81.78%
BenchmarkStreamVerify5x2x1M-32 3282999 591669 -81.98%
BenchmarkStreamVerify10x4x1M-32 5824392 1230888 -78.87%
BenchmarkStreamVerify50x20x1M-32 27301648 6204613 -77.27%
BenchmarkStreamVerify10x4x16M-32 8508963 18845695 +121.48%
benchmark old MB/s new MB/s speedup
BenchmarkStreamEncode10x2x10000-32 20.87 12599.82 603.73x
BenchmarkStreamEncode100x20x10000-32 26.03 2112.89 81.17x
BenchmarkStreamEncode17x3x1M-32 2175.19 12026.65 5.53x
BenchmarkStreamEncode10x4x16M-32 7855.71 9293.94 1.18x
BenchmarkStreamEncode5x2x1M-32 1590.76 12716.14 7.99x
BenchmarkStreamEncode10x2x1M-32 1997.66 13126.43 6.57x
BenchmarkStreamEncode10x4x1M-32 1640.20 11588.81 7.07x
BenchmarkStreamEncode50x20x1M-32 1802.70 7282.50 4.04x
BenchmarkStreamEncode17x3x16M-32 8788.80 10172.93 1.16x
BenchmarkStreamVerify10x2x10000-32 20.58 7699.20 374.11x
BenchmarkStreamVerify50x5x50000-32 293.30 13262.49 45.22x
BenchmarkStreamVerify10x2x1M-32 2153.15 11818.75 5.49x
BenchmarkStreamVerify5x2x1M-32 1596.98 8861.17 5.55x
BenchmarkStreamVerify10x4x1M-32 1800.32 8518.86 4.73x
BenchmarkStreamVerify50x20x1M-32 1920.35 8449.97 4.40x
BenchmarkStreamVerify10x4x16M-32 19717.11 8902.41 0.45x
```
2020-05-05 16:35:35 +02:00
Klaus Post
dccac354fe
Add cross compilation ( #127 )
...
* Add cross compilation
Add 386 as 32 bit test, arm64 and ppc64le since they have assembly.
2020-05-04 21:19:49 +02:00
Klaus Post
f525ef0450
Clean up build tags ( #126 )
...
Move non-amd64 code to a separate file and remove references in other files.
Fixes #125
2020-05-04 20:06:47 +02:00
Klaus Post
a0556fddfa
Add go.sum as well.
2020-05-04 10:19:03 +02:00
Klaus Post
de70cc155f
AVX512 parallel processing ( #120 )
...
Do concurrent processing in AVX512 mode and split jobs by cache size.
2020-05-04 09:17:40 +02:00
Klaus Post
e920b5fec3
Add direct modules support ( #124 )
...
* Add direct modules support
* Add tests with various assembly disabled.
* Add Go 1.14 - remove 1.11
2020-05-03 21:53:25 +02:00
Klaus Post
d069fb1019
Remove a bounds check in pure Go ( #123 )
...
40% faster on the pure operation.
```
benchmark old ns/op new ns/op delta
BenchmarkParallel_8x8x05M-8 2990849 2763554 -7.60%
BenchmarkParallel_8x8x1M-8 4941575 5061619 +2.43%
BenchmarkParallel_8x8x8M-8 34257722 33192541 -3.11%
BenchmarkParallel_8x8x32M-8 143157262 131654688 -8.03%
BenchmarkGalois128K-8 64201 38374 -40.23%
BenchmarkGalois1M-8 507053 307236 -39.41%
BenchmarkGaloisXor128K-8 63815 63157 -1.03%
BenchmarkGaloisXor1M-8 506369 505641 -0.14%
BenchmarkEncode10x2x10000-8 96414 92781 -3.77%
BenchmarkEncode100x20x10000-8 3188549 3238299 +1.56%
BenchmarkEncode17x3x1M-8 3741349 3633535 -2.88%
BenchmarkEncode10x4x16M-8 41628596 40306100 -3.18%
BenchmarkEncode5x2x1M-8 724162 699137 -3.46%
BenchmarkEncode10x2x1M-8 1451401 1423224 -1.94%
BenchmarkEncode10x4x1M-8 2839382 2740249 -3.49%
BenchmarkEncode50x20x1M-8 68415407 67015156 -2.05%
BenchmarkEncode17x3x16M-8 53734221 51784418 -3.63%
BenchmarkEncode_8x4x8M-8 16826004 16013691 -4.83%
BenchmarkEncode_12x4x12M-8 37544203 36392439 -3.07%
BenchmarkEncode_16x4x16M-8 66070450 69062838 +4.53%
BenchmarkEncode_16x4x32M-8 133905200 130529500 -2.52%
BenchmarkEncode_16x4x64M-8 281313400 265809900 -5.51%
BenchmarkEncode_8x5x8M-8 20789000 19866553 -4.44%
BenchmarkEncode_8x6x8M-8 25027385 25087290 +0.24%
BenchmarkEncode_8x7x8M-8 29156578 28231372 -3.17%
BenchmarkEncode_8x9x8M-8 37286413 37383431 +0.26%
BenchmarkEncode_8x10x8M-8 41722722 39786752 -4.64%
BenchmarkEncode_8x11x8M-8 45692118 43409812 -4.99%
BenchmarkEncode_8x8x05M-8 2358946 2298631 -2.56%
BenchmarkEncode_8x8x1M-8 4551026 4357599 -4.25%
BenchmarkEncode_8x8x8M-8 33596074 31951653 -4.89%
BenchmarkEncode_8x8x32M-8 135030488 127382850 -5.66%
BenchmarkEncode_24x8x24M-8 297317050 301777575 +1.50%
BenchmarkEncode_24x8x48M-8 611638100 596134400 -2.53%
BenchmarkVerify10x2x10000-8 103723 103523 -0.19%
BenchmarkVerify50x5x50000-8 2170780 2148170 -1.04%
BenchmarkVerify10x2x1M-8 1693351 1676973 -0.97%
BenchmarkVerify5x2x1M-8 997721 995888 -0.18%
BenchmarkVerify10x4x1M-8 3354687 3296939 -1.72%
BenchmarkVerify50x20x1M-8 67491300 66890056 -0.89%
BenchmarkVerify10x4x16M-8 44195152 44356146 +0.36%
BenchmarkReconstruct10x2x10000-8 24720 23373 -5.45%
BenchmarkReconstruct50x5x50000-8 880988 858684 -2.53%
BenchmarkReconstruct10x2x1M-8 387655 368900 -4.84%
BenchmarkReconstruct5x2x1M-8 191067 175841 -7.97%
BenchmarkReconstruct10x4x1M-8 1040639 1004731 -3.45%
BenchmarkReconstruct50x20x1M-8 28507103 28467956 -0.14%
BenchmarkReconstruct10x4x16M-8 15829872 15225654 -3.82%
BenchmarkReconstructData10x2x10000-8 24369 23374 -4.08%
BenchmarkReconstructData50x5x50000-8 865039 852456 -1.45%
BenchmarkReconstructData10x2x1M-8 383240 366751 -4.30%
BenchmarkReconstructData5x2x1M-8 183644 170444 -7.19%
BenchmarkReconstructData10x4x1M-8 1010537 969151 -4.10%
BenchmarkReconstructData50x20x1M-8 28288428 28051051 -0.84%
BenchmarkReconstructData10x4x16M-8 15048840 14443250 -4.02%
BenchmarkReconstructP10x2x10000-8 3219 3122 -3.01%
BenchmarkReconstructP10x5x20000-8 23574 22704 -3.69%
BenchmarkSplit10x4x160M-8 2822150 2735071 -3.09%
BenchmarkSplit5x2x5M-8 409699 311346 -24.01%
BenchmarkSplit10x2x1M-8 43767 40247 -8.04%
BenchmarkSplit10x4x10M-8 741097 566888 -23.51%
BenchmarkSplit50x20x50M-8 1913475 1682060 -12.09%
BenchmarkSplit17x3x272M-8 2059505 2095628 +1.75%
BenchmarkStreamEncode10x2x10000-8 8517255 5226284 -38.64%
BenchmarkStreamEncode100x20x10000-8 41903836 40969212 -2.23%
BenchmarkStreamEncode17x3x1M-8 12038007 14129765 +17.38%
BenchmarkStreamEncode10x4x16M-8 56512840 54821895 -2.99%
BenchmarkStreamEncode5x2x1M-8 5326508 3966411 -25.53%
BenchmarkStreamEncode10x2x1M-8 6924358 6589396 -4.84%
BenchmarkStreamEncode10x4x1M-8 9016080 8459049 -6.18%
BenchmarkStreamEncode50x20x1M-8 93583042 94021200 +0.47%
BenchmarkStreamEncode17x3x16M-8 76643714 74750193 -2.47%
BenchmarkStreamVerify10x2x10000-8 8311646 5162179 -37.89%
BenchmarkStreamVerify50x5x50000-8 19015944 18352626 -3.49%
BenchmarkStreamVerify10x2x1M-8 5738380 5441592 -5.17%
BenchmarkStreamVerify5x2x1M-8 3462751 3328057 -3.89%
BenchmarkStreamVerify10x4x1M-8 6735717 6381116 -5.26%
BenchmarkStreamVerify50x20x1M-8 29844543 29416921 -1.43%
BenchmarkStreamVerify10x4x16M-8 8512699 8375778 -1.61%
benchmark old MB/s new MB/s speedup
BenchmarkParallel_8x8x05M-8 1402.38 1517.72 1.08x
BenchmarkParallel_8x8x1M-8 1697.56 1657.30 0.98x
BenchmarkParallel_8x8x8M-8 1958.94 2021.81 1.03x
BenchmarkParallel_8x8x32M-8 1875.11 2038.94 1.09x
BenchmarkGalois128K-8 2041.59 3415.64 1.67x
BenchmarkGalois1M-8 2067.98 3412.93 1.65x
BenchmarkGaloisXor128K-8 2053.92 2075.33 1.01x
BenchmarkGaloisXor1M-8 2070.77 2073.76 1.00x
BenchmarkEncode10x2x10000-8 1037.19 1077.81 1.04x
BenchmarkEncode100x20x10000-8 313.62 308.80 0.98x
BenchmarkEncode17x3x1M-8 4764.54 4905.91 1.03x
BenchmarkEncode10x4x16M-8 4030.21 4162.45 1.03x
BenchmarkEncode5x2x1M-8 7239.93 7499.07 1.04x
BenchmarkEncode10x2x1M-8 7224.58 7367.61 1.02x
BenchmarkEncode10x4x1M-8 3692.97 3826.57 1.04x
BenchmarkEncode50x20x1M-8 766.33 782.34 1.02x
BenchmarkEncode17x3x16M-8 5307.84 5507.69 1.04x
BenchmarkEncode_8x4x8M-8 3988.40 4190.72 1.05x
BenchmarkEncode_12x4x12M-8 4021.79 4149.07 1.03x
BenchmarkEncode_16x4x16M-8 4062.87 3886.83 0.96x
BenchmarkEncode_16x4x32M-8 4009.34 4113.02 1.03x
BenchmarkEncode_16x4x64M-8 3816.89 4039.51 1.06x
BenchmarkEncode_8x5x8M-8 3228.09 3377.98 1.05x
BenchmarkEncode_8x6x8M-8 2681.42 2675.01 1.00x
BenchmarkEncode_8x7x8M-8 2301.67 2377.10 1.03x
BenchmarkEncode_8x9x8M-8 1799.82 1795.15 1.00x
BenchmarkEncode_8x10x8M-8 1608.45 1686.71 1.05x
BenchmarkEncode_8x11x8M-8 1468.72 1545.94 1.05x
BenchmarkEncode_8x8x05M-8 1778.04 1824.70 1.03x
BenchmarkEncode_8x8x1M-8 1843.23 1925.05 1.04x
BenchmarkEncode_8x8x8M-8 1997.52 2100.33 1.05x
BenchmarkEncode_8x8x32M-8 1987.96 2107.31 1.06x
BenchmarkEncode_24x8x24M-8 2031.43 2001.41 0.99x
BenchmarkEncode_24x8x48M-8 1974.96 2026.32 1.03x
BenchmarkVerify10x2x10000-8 964.10 965.97 1.00x
BenchmarkVerify50x5x50000-8 2303.32 2327.56 1.01x
BenchmarkVerify10x2x1M-8 6192.31 6252.79 1.01x
BenchmarkVerify5x2x1M-8 5254.86 5264.53 1.00x
BenchmarkVerify10x4x1M-8 3125.70 3180.45 1.02x
BenchmarkVerify50x20x1M-8 776.82 783.81 1.01x
BenchmarkVerify10x4x16M-8 3796.17 3782.39 1.00x
BenchmarkReconstruct10x2x10000-8 4045.30 4278.40 1.06x
BenchmarkReconstruct50x5x50000-8 5675.45 5822.87 1.03x
BenchmarkReconstruct10x2x1M-8 27049.21 28424.40 1.05x
BenchmarkReconstruct5x2x1M-8 27440.02 29815.96 1.09x
BenchmarkReconstruct10x4x1M-8 10076.27 10436.39 1.04x
BenchmarkReconstruct50x20x1M-8 1839.15 1841.68 1.00x
BenchmarkReconstruct10x4x16M-8 10598.45 11019.04 1.04x
BenchmarkReconstructData10x2x10000-8 4103.60 4278.25 1.04x
BenchmarkReconstructData50x5x50000-8 5780.09 5865.40 1.01x
BenchmarkReconstructData10x2x1M-8 27360.79 28590.95 1.04x
BenchmarkReconstructData5x2x1M-8 28549.19 30760.16 1.08x
BenchmarkReconstructData10x4x1M-8 10376.42 10819.53 1.04x
BenchmarkReconstructData50x20x1M-8 1853.37 1869.05 1.01x
BenchmarkReconstructData10x4x16M-8 11148.51 11615.96 1.04x
BenchmarkReconstructP10x2x10000-8 31068.70 32026.22 1.03x
BenchmarkReconstructP10x5x20000-8 8484.08 8808.93 1.04x
BenchmarkStreamEncode10x2x10000-8 11.74 19.13 1.63x
BenchmarkStreamEncode100x20x10000-8 23.86 24.41 1.02x
BenchmarkStreamEncode17x3x1M-8 1480.79 1261.58 0.85x
BenchmarkStreamEncode10x4x16M-8 2968.74 3060.31 1.03x
BenchmarkStreamEncode5x2x1M-8 984.30 1321.82 1.34x
BenchmarkStreamEncode10x2x1M-8 1514.33 1591.31 1.05x
BenchmarkStreamEncode10x4x1M-8 1163.01 1239.59 1.07x
BenchmarkStreamEncode50x20x1M-8 560.24 557.63 1.00x
BenchmarkStreamEncode17x3x16M-8 3721.28 3815.54 1.03x
BenchmarkStreamVerify10x2x10000-8 12.03 19.37 1.61x
BenchmarkStreamVerify50x5x50000-8 262.94 272.44 1.04x
BenchmarkStreamVerify10x2x1M-8 1827.30 1926.97 1.05x
BenchmarkStreamVerify5x2x1M-8 1514.08 1575.36 1.04x
BenchmarkStreamVerify10x4x1M-8 1556.74 1643.25 1.06x
BenchmarkStreamVerify50x20x1M-8 1756.73 1782.27 1.01x
BenchmarkStreamVerify10x4x16M-8 19708.46 20030.64 1.02x
```
2020-05-03 19:38:55 +02:00
Klaus Post
65df535980
Make single goroutine encodes more efficient ( #122 )
...
Calculate the optimal per round size to keep data in cache when not using WithAutoGoroutines.
```
λ benchcmp before.txt after.txt
benchmark old ns/op new ns/op delta
BenchmarkParallel_8x8x05M-16 675225 321053 -52.45%
BenchmarkParallel_20x10x05M-16 3471988 600740 -82.70%
BenchmarkParallel_8x8x1M-16 3948606 728093 -81.56%
BenchmarkParallel_8x8x8M-16 47361588 5976467 -87.38%
BenchmarkParallel_8x8x32M-16 195044200 24365474 -87.51%
benchmark old MB/s new MB/s speedup
BenchmarkParallel_8x8x05M-16 6211.71 13064.22 2.10x
BenchmarkParallel_20x10x05M-16 3020.10 17454.73 5.78x
BenchmarkParallel_8x8x1M-16 2124.45 11521.34 5.42x
BenchmarkParallel_8x8x8M-16 1416.95 11228.85 7.92x
BenchmarkParallel_8x8x32M-16 1376.28 11017.04 8.00x
```
2020-05-03 19:37:22 +02:00
Frank Wessels
0b98f5350a
Refactor AVX512 code to use Go assembly instructions. ( #121 )
...
Additionally there is a small performance improvement using VPTERNLOGD (instead of two VPXORD instructions).
2020-05-03 13:43:52 +02:00
Klaus Post
17098a4f19
Use stream test options ( #118 )
2020-04-22 17:22:16 +02:00
Klaus Post
c3634dce94
Use CPU cache to set minSplitSize ( #117 )
...
Use L1 cache size to set default split size.
2020-04-22 16:12:18 +02:00
Klaus Post
d2cfcb8065
Add commandline arg to disable asm for tests. ( #116 )
...
* Add commandline test args
2020-04-22 15:38:21 +02:00
Klaus Post
0abe9de20c
Update tests ( #115 )
...
Don't create new slices.
2020-02-21 11:30:44 -08:00
Klaus Post
101092fa3b
Make AVX512 short tests ( #114 )
...
Tests are timing out. Use shorter tests for -short.
2020-01-18 14:50:31 +01:00
Klaus Post
70d6279761
Update travis script
2019-09-27 16:33:57 -07:00
Christian Muehlhaeuser
4681100338
Removed unused struct members ( #106 )
...
creads & cwrites both seem to be unused.
2019-09-27 16:31:11 -07:00
Christian Muehlhaeuser
993c27a5ba
Avoid unnecessary conversion ( #107 )
...
No need to convert to byte here.
2019-09-27 16:30:54 -07:00
Andreas Auernhammer
1f1369aa84
limit capacity of shards to shard size ( #109 )
...
This commit limits the capacity (additionally
to the length) of each shard to the shard size.
Before this change the following code behaves in
an unexpected way:
```
shards := encoder.Split(buffer)
// ...
shards[0] = shards[0][:cap(shards[0])
```
Instead of restoring the length of `shards[0]` to
the shard size, it assigns the entire memory of `buffer`
to `shards[0]`.
2019-09-27 16:30:26 -07:00
dssysolyatin
7890684129
Improve quick check for case when dataOnly is true ( #105 )
2019-06-25 16:30:44 +02:00
dssysolyatin
ec2eb9fb8c
Split: Reduce memory allocation ( #103 )
...
* [Split] Reduce memory allocation in Split function
2019-06-25 16:28:24 +02:00
Klaus Post
0883d2f011
Only enable AVX512 on AMD64
...
Fixes #102
2019-05-26 12:12:55 +02:00