Commit Graph

16 Commits (jerasure-matrix)

Author SHA1 Message Date
Klaus Post 3a82d28edb
Add GF16 AVX2, AVX512 and SSSE3 (#193)
* Add GF16 AVX2
* Add SSSE3 fallback.
* Fix reconstruction was skipped if first shard was empty.
* Combine lookups in pure Go
* Faster xor on pure Go.
* Add 4way butterfly AVX2.
* Add fftDIT4 avx2. Add avx512 version. Add noescape.
* Remove +build space. Do size varied 800x200 bench.
* Use VPTERNLOGD for avx512.
* Remove refMulAdd inner loop bounds checks. ~10-20% faster
2022-07-26 12:37:28 +02:00
Klaus Post 5593e2b2dd
Unroll pure go xor loop (#172)
* Unroll pure go xor loop

Testing with `go test -bench=x1x -tags=noasm -short`

```
before:
BenchmarkEncode2x1x1M-32           13658             87980 ns/op        35754.96 MB/s

after:
BenchmarkEncode2x1x1M-32           21633             55498 ns/op        56682.24 MB/s
```
2021-12-02 16:39:56 +01:00
Klaus Post 7761c8f7cd
Use Workflows (#169)
* Use Workflows
* Go 1.17 build tags
* Do races separately.
2021-09-01 18:55:02 +02:00
Klaus Post cb7a0b5aef
Do fast by one multiplication (#130)
When multiplying by one we can use faster math.
2020-05-06 11:14:25 +02:00
Klaus Post f525ef0450
Clean up build tags (#126)
Move non-amd64 code to a separate file and remove references in other files.

Fixes #125
2020-05-04 20:06:47 +02:00
Klaus Post de70cc155f
AVX512 parallel processing (#120)
Do concurrent processing in AVX512 mode and split jobs by cache size.
2020-05-04 09:17:40 +02:00
Klaus Post d2cfcb8065
Add commandline arg to disable asm for tests. (#116)
* Add commandline test args
2020-04-22 15:38:21 +02:00
Klaus Post a9588190c0
Optimize pure Go version. (#96)
* Optimize pure Go version.
* Update docs. Add Go 1.12 CI

* Avoid dst bounds check when using noasm ~ 40-50% faster.
* Convert multiply table to a slice whenever used.
* Split on 32 byte boundaries instead of 16 byte.
2019-03-08 10:49:27 +01:00
Frank Wessels 79aee05119 AVX512 accelerated version resulting in a 4x speed improvement over AVX2 (#91)
The performance on AVX512 has been accelerated for Intel CPUs. This gives speedups on a per-core basis of up to 4x compared to AVX2 as can be seen in the following table:

```
$ benchcmp avx2.txt avx512.txt
benchmark                      AVX2 MB/s    AVX512 MB/s   speedup
BenchmarkEncode8x8x1M-72       1681.35      4125.64       2.45x
BenchmarkEncode8x4x8M-72       1529.36      5507.97       3.60x
BenchmarkEncode8x8x8M-72        791.16      2952.29       3.73x
BenchmarkEncode8x8x32M-72       573.26      2168.61       3.78x
BenchmarkEncode12x4x12M-72     1234.41      4912.37       3.98x
BenchmarkEncode16x4x16M-72     1189.59      5138.01       4.32x
BenchmarkEncode24x8x24M-72      690.68      2583.70       3.74x
BenchmarkEncode24x8x48M-72      674.20      2643.31       3.92x
```
2019-02-10 11:17:23 +01:00
Frank Wessels 8885f3a1c7 Feature/ppc support (#88)
Add accelerated PPC support.
2018-12-18 20:39:59 +01:00
Klaus Post 454fd91890
Maintenance updates. (#86)
* Add gcc go build tags.
* Update Travis.
* Fix typo
2018-11-12 13:25:55 +01:00
Frank Wessels 7b88f42e61 Add NEON support for ARM64 (#62)
* Add support for arm64 using NEON instructions

Specifically using the PMULL/PMULL2 polynomial multiplication instructions followed by a reduction step (actually two steps).

* Add ARM performance numbers

* Formatting for performance table

* Refactoring of NEON version and 256-bit wide version

* Expand test slice beyond 32 (for AVX2 and NEON) and test galMulSliceXor explicitly.

* Fix ARM code with missing function.

* Fix missing newline
2017-08-26 11:47:42 +02:00
chenzhongtao d78bf472d8 add Update parity function (#60)
Add Update parity function
2017-08-20 11:42:39 +02:00
Klaus Post 5abf0ee302 Add options (#46)
* Add options

Make constants changeable as options.

The API remains backwards compatible.

* Update documentation.

* Fix line endings

* fmt

* fmt

* Use functions for parameters.

Much neater.
2017-02-19 11:13:22 +01:00
Klaus Post f1c2cf4160 Don't use assembler on app engine. 2015-06-21 22:54:13 +02:00
Klaus Post 5aa37c3492 Add AMD64 SSE3 Galois multiplication. Approximately 5-10x faster.
BenchmarkEncode10x2x10000         333.31       5827.17      17.48x
BenchmarkEncode10x2x10000-2       431.20       2802.53      6.50x
BenchmarkEncode10x2x10000-4       553.98       2432.95      4.39x
BenchmarkEncode10x2x10000-8       585.79       3469.61      5.92x
BenchmarkEncode100x20x10000       32.59        583.40       17.90x
BenchmarkEncode100x20x10000-2     59.52        726.70       12.21x
BenchmarkEncode100x20x10000-4     108.04       1363.25      12.62x
BenchmarkEncode100x20x10000-8     113.76       1274.62      11.20x
BenchmarkEncode17x3x1M            215.28       3141.85      14.59x
BenchmarkEncode17x3x1M-2          398.76       3650.12      9.15x
BenchmarkEncode17x3x1M-4          655.32       6071.11      9.26x
BenchmarkEncode17x3x1M-8          832.16       6616.47      7.95x
BenchmarkEncode10x4x16M           154.48       1357.30      8.79x
BenchmarkEncode10x4x16M-2         295.62       2377.92      8.04x
BenchmarkEncode10x4x16M-4         529.89       3519.49      6.64x
BenchmarkEncode10x4x16M-8         632.11       4521.90      7.15x
BenchmarkEncode5x2x1M             327.87       4879.09      14.88x
BenchmarkEncode5x2x1M-2           576.11       2599.20      4.51x
BenchmarkEncode5x2x1M-4           1043.65      3559.12      3.41x
BenchmarkEncode5x2x1M-8           1227.77      4255.34      3.47x
BenchmarkEncode10x2x1M            321.24       4574.68      14.24x
BenchmarkEncode10x2x1M-2          587.73       3100.28      5.28x
BenchmarkEncode10x2x1M-4          1101.96      4770.32      4.33x
BenchmarkEncode10x2x1M-8          1217.08      5812.17      4.78x
BenchmarkEncode10x4x1M            155.34       2037.27      13.11x
BenchmarkEncode10x4x1M-2          298.38       2470.97      8.28x
BenchmarkEncode10x4x1M-4          548.67       3603.15      6.57x
BenchmarkEncode10x4x1M-8          625.23       4827.42      7.72x
BenchmarkEncode50x20x1M           31.37        347.65       11.08x
BenchmarkEncode50x20x1M-2         59.81        713.28       11.93x
BenchmarkEncode50x20x1M-4         105.34       1175.47      11.16x
BenchmarkEncode50x20x1M-8         123.84       1491.91      12.05x
BenchmarkEncode17x3x16M           209.55       1861.59      8.88x
BenchmarkEncode17x3x16M-2         394.19       3331.73      8.45x
BenchmarkEncode17x3x16M-4         643.30       4942.74      7.68x
BenchmarkEncode17x3x16M-8         839.64       6213.43      7.40x
2015-06-21 21:23:22 +02:00