Reed-Solomon Erasure Coding in Go
 
 
Go to file
Klaus Post 1bb4d699e1
avx2: Improve speed when > 10 input or output shards. (#174)
Speeds are including a limiting the number of goroutines with all AVX2 paths,

Before/after
```
benchmark                                 old ns/op     new ns/op     delta
BenchmarkGalois128K-32                    2240          2240          +0.00%
BenchmarkGalois1M-32                      19578         18891         -3.51%
BenchmarkGaloisXor128K-32                 2798          2852          +1.93%
BenchmarkGaloisXor1M-32                   23334         23345         +0.05%
BenchmarkEncode2x1x1M-32                  34357         34370         +0.04%
BenchmarkEncode10x2x10000-32              3210          3093          -3.64%
BenchmarkEncode100x20x10000-32            362925        148214        -59.16%
BenchmarkEncode17x3x1M-32                 323767        224157        -30.77%
BenchmarkEncode10x4x16M-32                8376895       8376737       -0.00%
BenchmarkEncode5x2x1M-32                  68365         66861         -2.20%
BenchmarkEncode10x2x1M-32                 101407        93023         -8.27%
BenchmarkEncode10x4x1M-32                 171880        155477        -9.54%
BenchmarkEncode50x20x1M-32                3704691       3015047       -18.62%
BenchmarkEncode17x3x16M-32                10279233      10106658      -1.68%
BenchmarkEncode_8x4x8M-32                 3438245       3326479       -3.25%
BenchmarkEncode_12x4x12M-32               6632257       6581637       -0.76%
BenchmarkEncode_16x4x16M-32               10815755      10788377      -0.25%
BenchmarkEncode_16x4x32M-32               21029061      21507995      +2.28%
BenchmarkEncode_16x4x64M-32               42145450      43876850      +4.11%
BenchmarkEncode_8x5x8M-32                 4543208       3846378       -15.34%
BenchmarkEncode_8x6x8M-32                 5065494       4397218       -13.19%
BenchmarkEncode_8x7x8M-32                 5818995       4962884       -14.71%
BenchmarkEncode_8x9x8M-32                 6215449       6114898       -1.62%
BenchmarkEncode_8x10x8M-32                6923415       6610501       -4.52%
BenchmarkEncode_8x11x8M-32                7365988       7010473       -4.83%
BenchmarkEncode_8x8x05M-32                150857        136820        -9.30%
BenchmarkEncode_8x8x1M-32                 256722        254854        -0.73%
BenchmarkEncode_8x8x8M-32                 5547790       5422048       -2.27%
BenchmarkEncode_8x8x32M-32                23038643      22705859      -1.44%
BenchmarkEncode_24x8x24M-32               27729259      30332216      +9.39%
BenchmarkEncode_24x8x48M-32               53865705      61187658      +13.59%
BenchmarkVerify10x2x10000-32              8769          8154          -7.01%
BenchmarkVerify10x2x1M-32                 516149        476180        -7.74%
BenchmarkVerify5x2x1M-32                  443888        419541        -5.48%
BenchmarkVerify10x4x1M-32                 1030299       948021        -7.99%
BenchmarkVerify50x20x1M-32                7209689       6186891       -14.19%
BenchmarkVerify10x4x16M-32                17774456      17681879      -0.52%
BenchmarkReconstruct10x2x10000-32         3352          3256          -2.86%
BenchmarkReconstruct50x5x50000-32         166417        140900        -15.33%
BenchmarkReconstruct10x2x1M-32            189711        174615        -7.96%
BenchmarkReconstruct5x2x1M-32             128080        126520        -1.22%
BenchmarkReconstruct10x4x1M-32            273312        254017        -7.06%
BenchmarkReconstruct50x20x1M-32           3628812       3192474       -12.02%
BenchmarkReconstruct10x4x16M-32           8562186       8781479       +2.56%
BenchmarkReconstructData10x2x10000-32     3241          3116          -3.86%
BenchmarkReconstructData50x5x50000-32     162520        134794        -17.06%
BenchmarkReconstructData10x2x1M-32        171253        161955        -5.43%
BenchmarkReconstructData5x2x1M-32         102215        106942        +4.62%
BenchmarkReconstructData10x4x1M-32        225593        219969        -2.49%
BenchmarkReconstructData50x20x1M-32       2515311       2129721       -15.33%
BenchmarkReconstructData10x4x16M-32       6980308       6698111       -4.04%
BenchmarkReconstructP10x2x10000-32        924           937           +1.35%
BenchmarkReconstructP10x5x20000-32        1639          1703          +3.90%
BenchmarkSplit10x4x160M-32                4984993       4898045       -1.74%
BenchmarkSplit5x2x5M-32                   380415        221446        -41.79%
BenchmarkSplit10x2x1M-32                  58761         53335         -9.23%
BenchmarkSplit10x4x10M-32                 643188        410959        -36.11%
BenchmarkSplit50x20x50M-32                1843879       1647205       -10.67%
BenchmarkSplit17x3x272M-32                3684920       3613951       -1.93%
BenchmarkParallel_8x8x64K-32              7022          6630          -5.58%
BenchmarkParallel_8x8x05M-32              348308        348369        +0.02%
BenchmarkParallel_20x10x05M-32            575672        581028        +0.93%
BenchmarkParallel_8x8x1M-32               716033        697167        -2.63%
BenchmarkParallel_8x8x8M-32               5716048       5616437       -1.74%
BenchmarkParallel_8x8x32M-32              22650878      22098667      -2.44%
BenchmarkParallel_8x3x1M-32               406839        399125        -1.90%
BenchmarkParallel_8x4x1M-32               459107        463890        +1.04%
BenchmarkParallel_8x5x1M-32               527488        520334        -1.36%
BenchmarkStreamEncode10x2x10000-32        6013          5878          -2.25%
BenchmarkStreamEncode100x20x10000-32      503124        267894        -46.75%
BenchmarkStreamEncode17x3x1M-32           1561838       1376618       -11.86%
BenchmarkStreamEncode10x4x16M-32          19124427      17762582      -7.12%
BenchmarkStreamEncode5x2x1M-32            429701        384666        -10.48%
BenchmarkStreamEncode10x2x1M-32           801257        763637        -4.70%
BenchmarkStreamEncode10x4x1M-32           876065        820744        -6.31%
BenchmarkStreamEncode50x20x1M-32          7205112       6081398       -15.60%
BenchmarkStreamEncode17x3x16M-32          27182786      26117143      -3.92%
BenchmarkStreamVerify10x2x10000-32        13767         14026         +1.88%
BenchmarkStreamVerify50x5x50000-32        826983        690453        -16.51%
BenchmarkStreamVerify10x2x1M-32           1238566       1182591       -4.52%
BenchmarkStreamVerify5x2x1M-32            892661        806301        -9.67%
BenchmarkStreamVerify10x4x1M-32           1676394       1631495       -2.68%
BenchmarkStreamVerify50x20x1M-32          10877875      10037678      -7.72%
BenchmarkStreamVerify10x4x16M-32          27599576      30435400      +10.27%

benchmark                                 old MB/s      new MB/s      speedup
BenchmarkGalois128K-32                    58518.53      58510.17      1.00x
BenchmarkGalois1M-32                      53558.10      55507.44      1.04x
BenchmarkGaloisXor128K-32                 46839.74      45961.09      0.98x
BenchmarkGaloisXor1M-32                   44936.98      44917.46      1.00x
BenchmarkEncode2x1x1M-32                  91561.27      91524.11      1.00x
BenchmarkEncode10x2x10000-32              37385.54      38792.54      1.04x
BenchmarkEncode100x20x10000-32            3306.47       8096.40       2.45x
BenchmarkEncode17x3x1M-32                 64773.49      93557.14      1.44x
BenchmarkEncode10x4x16M-32                28039.15      28039.68      1.00x
BenchmarkEncode5x2x1M-32                  107365.88     109781.16     1.02x
BenchmarkEncode10x2x1M-32                 124083.62     135266.27     1.09x
BenchmarkEncode10x4x1M-32                 85408.99      94419.71      1.11x
BenchmarkEncode50x20x1M-32                19812.81      24344.67      1.23x
BenchmarkEncode17x3x16M-32                32642.93      33200.32      1.02x
BenchmarkEncode_8x4x8M-32                 29277.52      30261.21      1.03x
BenchmarkEncode_12x4x12M-32               30355.67      30589.14      1.01x
BenchmarkEncode_16x4x16M-32               31023.66      31102.39      1.00x
BenchmarkEncode_16x4x32M-32               31912.44      31201.82      0.98x
BenchmarkEncode_16x4x64M-32               31846.32      30589.65      0.96x
BenchmarkEncode_8x5x8M-32                 24003.28      28351.84      1.18x
BenchmarkEncode_8x6x8M-32                 23184.41      26707.91      1.15x
BenchmarkEncode_8x7x8M-32                 21623.86      25354.03      1.17x
BenchmarkEncode_8x9x8M-32                 22943.85      23321.13      1.02x
BenchmarkEncode_8x10x8M-32                21809.31      22841.68      1.05x
BenchmarkEncode_8x11x8M-32                21637.77      22735.06      1.05x
BenchmarkEncode_8x8x05M-32                55606.22      61311.47      1.10x
BenchmarkEncode_8x8x1M-32                 65351.80      65830.73      1.01x
BenchmarkEncode_8x8x8M-32                 24193.01      24754.07      1.02x
BenchmarkEncode_8x8x32M-32                23303.06      23644.60      1.01x
BenchmarkEncode_24x8x24M-32               29041.76      26549.54      0.91x
BenchmarkEncode_24x8x48M-32               29900.52      26322.51      0.88x
BenchmarkVerify10x2x10000-32              13685.12      14717.10      1.08x
BenchmarkVerify10x2x1M-32                 24378.43      26424.72      1.08x
BenchmarkVerify5x2x1M-32                  16535.79      17495.41      1.06x
BenchmarkVerify10x4x1M-32                 14248.35      15484.96      1.09x
BenchmarkVerify50x20x1M-32                10180.79      11863.85      1.17x
BenchmarkVerify10x4x16M-32                13214.53      13283.71      1.01x
BenchmarkReconstruct10x2x10000-32         35799.16      36854.89      1.03x
BenchmarkReconstruct50x5x50000-32         33049.47      39034.89      1.18x
BenchmarkReconstruct10x2x1M-32            66326.88      72061.06      1.09x
BenchmarkReconstruct5x2x1M-32             57308.21      58014.92      1.01x
BenchmarkReconstruct10x4x1M-32            53711.74      57791.66      1.08x
BenchmarkReconstruct50x20x1M-32           20227.09      22991.67      1.14x
BenchmarkReconstruct10x4x16M-32           27432.37      26747.32      0.98x
BenchmarkReconstructData10x2x10000-32     37030.86      38511.87      1.04x
BenchmarkReconstructData50x5x50000-32     33842.07      40802.85      1.21x
BenchmarkReconstructData10x2x1M-32        73475.57      77693.87      1.06x
BenchmarkReconstructData5x2x1M-32         71809.58      68635.57      0.96x
BenchmarkReconstructData10x4x1M-32        65073.27      66736.88      1.03x
BenchmarkReconstructData50x20x1M-32       29181.41      34464.76      1.18x
BenchmarkReconstructData10x4x16M-32       33649.09      35066.75      1.04x
BenchmarkReconstructP10x2x10000-32        129819.98     128086.76     0.99x
BenchmarkReconstructP10x5x20000-32        183073.89     176202.21     0.96x
BenchmarkParallel_8x8x64K-32              149327.33     158153.67     1.06x
BenchmarkParallel_8x8x05M-32              24083.89      24079.69      1.00x
BenchmarkParallel_20x10x05M-32            27322.20      27070.35      0.99x
BenchmarkParallel_8x8x1M-32               23430.78      24064.83      1.03x
BenchmarkParallel_8x8x8M-32               23480.86      23897.31      1.02x
BenchmarkParallel_8x8x32M-32              23701.99      24294.27      1.02x
BenchmarkParallel_8x3x1M-32               28351.11      28899.03      1.02x
BenchmarkParallel_8x4x1M-32               27407.34      27124.76      0.99x
BenchmarkParallel_8x5x1M-32               25842.27      26197.58      1.01x
BenchmarkStreamEncode10x2x10000-32        16629.76      17012.26      1.02x
BenchmarkStreamEncode100x20x10000-32      1987.58       3732.83       1.88x
BenchmarkStreamEncode17x3x1M-32           11413.34      12948.97      1.13x
BenchmarkStreamEncode10x4x16M-32          8772.66       9445.26       1.08x
BenchmarkStreamEncode5x2x1M-32            12201.21      13629.70      1.12x
BenchmarkStreamEncode10x2x1M-32           13086.64      13731.34      1.05x
BenchmarkStreamEncode10x4x1M-32           11969.16      12775.92      1.07x
BenchmarkStreamEncode50x20x1M-32          7276.61       8621.18       1.18x
BenchmarkStreamEncode17x3x16M-32          10492.40      10920.52      1.04x
BenchmarkStreamVerify10x2x10000-32        7264.00       7129.49       0.98x
BenchmarkStreamVerify50x5x50000-32        6046.07       7241.62       1.20x
BenchmarkStreamVerify10x2x1M-32           8466.05       8866.77       1.05x
BenchmarkStreamVerify5x2x1M-32            5873.31       6502.39       1.11x
BenchmarkStreamVerify10x4x1M-32           6254.95       6427.09       1.03x
BenchmarkStreamVerify50x20x1M-32          4819.76       5223.20       1.08x
BenchmarkStreamVerify10x4x16M-32          6078.79       5512.40       0.91x 
```
2021-12-09 12:28:44 +01:00
.github/workflows Use Workflows (#169) 2021-09-01 18:55:02 +02:00
_gen avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
examples Use Workflows (#169) 2021-09-01 18:55:02 +02:00
.gitignore fix example error (#53) 2017-06-06 22:26:01 +02:00
.travis.yml Update go116 (#164) 2021-03-26 10:07:40 +01:00
LICENSE Add Backblaze to LICENSE. 2015-06-19 16:35:13 +02:00
README.md Update docs 2021-11-16 11:47:50 +01:00
appveyor.yml Submit a new appveyor CI config. 2016-06-03 00:57:56 -07:00
examples_test.go Update docs 2021-11-16 11:47:50 +01:00
galois.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galoisAvx512_amd64.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galoisAvx512_amd64.s Fix build tags for gccgo (#163) 2021-03-18 13:39:19 +01:00
galoisAvx512_amd64_test.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_amd64.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_amd64.s Fix build tags for gccgo (#163) 2021-03-18 13:39:19 +01:00
galois_arm64.go Unroll pure go xor loop (#172) 2021-12-02 16:39:56 +01:00
galois_arm64.s Fix build tags for gccgo (#163) 2021-03-18 13:39:19 +01:00
galois_gen_amd64.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_gen_amd64.s avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_gen_none.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_gen_switch_amd64.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_noasm.go Unroll pure go xor loop (#172) 2021-12-02 16:39:56 +01:00
galois_notamd64.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
galois_ppc64le.go Use Workflows (#169) 2021-09-01 18:55:02 +02:00
galois_ppc64le.s Fix build tags for gccgo (#163) 2021-03-18 13:39:19 +01:00
galois_test.go Do fast by one multiplication (#130) 2020-05-06 11:14:25 +02:00
gentables.go Use Workflows (#169) 2021-09-01 18:55:02 +02:00
go.mod Upgrade avo to avoid bp (#166) 2021-04-26 11:40:31 +02:00
go.sum Upgrade avo to avoid bp (#166) 2021-04-26 11:40:31 +02:00
inversion_tree.go Add WithInversionCache and use pointer methods (#160) 2021-01-13 10:21:28 +01:00
inversion_tree_test.go Add Inverse Matrix caching in a Thread-Safe Lookup Tree (#36) 2016-09-12 21:31:07 +02:00
matrix.go Sanity check error on SwapRows (#156) 2020-12-17 09:38:25 +01:00
matrix_test.go Use Workflows (#169) 2021-09-01 18:55:02 +02:00
options.go Add WithInversionCache and use pointer methods (#160) 2021-01-13 10:21:28 +01:00
reedsolomon.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
reedsolomon_test.go avx2: Improve speed when > 10 input or output shards. (#174) 2021-12-09 12:28:44 +01:00
streaming.go Fix stream allocations (#129) 2020-05-05 16:35:35 +02:00
streaming_test.go Allow zero parity shards (#161) 2021-03-08 16:13:24 +01:00

README.md

Reed-Solomon

Go Reference Build Status

Reed-Solomon Erasure Coding in Go, with speeds exceeding 1GB/s/cpu core implemented in pure Go.

This is a Go port of the JavaReedSolomon library released by Backblaze, with some additional optimizations.

For an introduction on erasure coding, see the post on the Backblaze blog.

Package home: https://github.com/klauspost/reedsolomon

Godoc: https://pkg.go.dev/github.com/klauspost/reedsolomon?tab=doc

Installation

To get the package use the standard:

go get -u github.com/klauspost/reedsolomon

Using Go modules recommended.

Changes

2021

  • Add progressive shard encoding.
  • Wider AVX2 loops
  • Limit concurrency on AVX2, since we are likely memory bound.
  • Allow 0 parity shards.
  • Allow disabling inversion cache.
  • Faster AVX2 encoding.

May 2020

  • ARM64 optimizations, up to 2.5x faster.
  • Added WithFastOneParityMatrix for faster operation with 1 parity shard.
  • Much better performance when using a limited number of goroutines.
  • AVX512 is now using multiple cores.
  • Stream processing overhaul, big speedups in most cases.
  • AVX512 optimizations

March 6, 2019

The pure Go implementation is about 30% faster. Minor tweaks to assembler implementations.

February 8, 2019

AVX512 accelerated version added for Intel Skylake CPUs. This can give up to a 4x speed improvement as compared to AVX2. See here for more details.

December 18, 2018

Assembly code for ppc64le has been contributed, this boosts performance by about 10x on this platform.

November 18, 2017

Added WithAutoGoroutines which will attempt to calculate the optimal number of goroutines to use based on your expected shard size and detected CPU.

October 1, 2017

  • Cauchy Matrix is now an option. Thanks to templexxx for the basis of this.

  • Default maximum number of goroutines has been increased for better multi-core scaling.

  • After several requests the Reconstruct and ReconstructData now slices of zero length but sufficient capacity to be used instead of allocating new memory.

August 26, 2017

  • The Encoder() now contains an Update function contributed by chenzhongtao.

  • Frank Wessels kindly contributed ARM 64 bit assembly, which gives a huge performance boost on this platform.

July 20, 2017

ReconstructData added to Encoder interface. This can cause compatibility issues if you implement your own Encoder. A simple workaround can be added:

func (e *YourEnc) ReconstructData(shards [][]byte) error {
	return ReconstructData(shards)
}

You can of course also do your own implementation. The StreamEncoder handles this without modifying the interface. This is a good lesson on why returning interfaces is not a good design.

Usage

This section assumes you know the basics of Reed-Solomon encoding. A good start is this Backblaze blog post.

This package performs the calculation of the parity sets. The usage is therefore relatively simple.

First of all, you need to choose your distribution of data and parity shards. A 'good' distribution is very subjective, and will depend a lot on your usage scenario. A good starting point is above 5 and below 257 data shards (the maximum supported number), and the number of parity shards to be 2 or above, and below the number of data shards.

To create an encoder with 10 data shards (where your data goes) and 3 parity shards (calculated):

    enc, err := reedsolomon.New(10, 3)

This encoder will work for all parity sets with this distribution of data and parity shards. The error will only be set if you specify 0 or negative values in any of the parameters, or if you specify more than 256 data shards.

If you will primarily be using it with one shard size it is recommended to use WithAutoGoroutines(shardSize) as an additional parameter. This will attempt to calculate the optimal number of goroutines to use for the best speed. It is not required that all shards are this size.

The you send and receive data is a simple slice of byte slices; [][]byte. In the example above, the top slice must have a length of 13.

    data := make([][]byte, 13)

You should then fill the 10 first slices with equally sized data, and create parity shards that will be populated with parity data. In this case we create the data in memory, but you could for instance also use mmap to map files.

    // Create all shards, size them at 50000 each
    for i := range input {
      data[i] := make([]byte, 50000)
    }
    
    
  // Fill some data into the data shards
    for i, in := range data[:10] {
      for j:= range in {
         in[j] = byte((i+j)&0xff)
      }
    }

To populate the parity shards, you simply call Encode() with your data.

    err = enc.Encode(data)

The only cases where you should get an error is, if the data shards aren't of equal size. The last 3 shards now contain parity data. You can verify this by calling Verify():

    ok, err = enc.Verify(data)

The final (and important) part is to be able to reconstruct missing shards. For this to work, you need to know which parts of your data is missing. The encoder does not know which parts are invalid, so if data corruption is a likely scenario, you need to implement a hash check for each shard.

If a byte has changed in your set, and you don't know which it is, there is no way to reconstruct the data set.

To indicate missing data, you set the shard to nil before calling Reconstruct():

    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct the missing shards
    err := enc.Reconstruct(data)

The missing data and parity shards will be recreated. If more than 3 shards are missing, the reconstruction will fail.

If you are only interested in the data shards (for reading purposes) you can call ReconstructData():

    // Delete two data shards
    data[3] = nil
    data[7] = nil
    
    // Reconstruct just the missing data shards
    err := enc.ReconstructData(data)

So to sum up reconstruction:

  • The number of data/parity shards must match the numbers used for encoding.
  • The order of shards must be the same as used when encoding.
  • You may only supply data you know is valid.
  • Invalid shards should be set to nil.

For complete examples of an encoder and decoder see the examples folder.

Splitting/Joining Data

You might have a large slice of data. To help you split this, there are some helper functions that can split and join a single byte slice.

   bigfile, _ := ioutil.Readfile("myfile.data")
   
   // Split the file
   split, err := enc.Split(bigfile)

This will split the file into the number of data shards set when creating the encoder and create empty parity shards.

An important thing to note is that you have to keep track of the exact input size. If the size of the input isn't divisible by the number of data shards, extra zeros will be inserted in the last shard.

To join a data set, use the Join() function, which will join the shards and write it to the io.Writer you supply:

   // Join a data set and write it to io.Discard.
   err = enc.Join(io.Discard, data, len(bigfile))

Progressive encoding

It is possible to encode individual shards using EncodeIdx:

	// EncodeIdx will add parity for a single data shard.
	// Parity shards should start out as 0. The caller must zero them.
	// Data shards must be delivered exactly once. There is no check for this.
	// The parity shards will always be updated and the data shards will remain the same.
	EncodeIdx(dataShard []byte, idx int, parity [][]byte) error

This allows progressively encoding the parity by sending individual data shards. There is no requirement on shards being delivered in order, but when sent in order it allows encoding shards one at the time, effectively allowing the operation to be streaming.

The result will be the same as encoding all shards at once. There is a minor speed penalty using this method, so send shards at once if they are available.

Example

func test() {
    // Create an encoder with 7 data and 3 parity slices.
    enc, _ := reedsolomon.New(7, 3)

    // This will be our output parity.
    parity := make([][]byte, 3)
    for i := range parity {
        parity[i] = make([]byte, 10000)
    }

    for i := 0; i < 7; i++ {
        // Send data shards one at the time.
        _ = enc.EncodeIdx(make([]byte, 10000), i, parity)
    }

    // parity now contains parity, as if all data was sent in one call.
}

Streaming/Merging

It might seem like a limitation that all data should be in memory, but an important property is that as long as the number of data/parity shards are the same, you can merge/split data sets, and they will remain valid as a separate set.

    // Split the data set of 50000 elements into two of 25000
    splitA := make([][]byte, 13)
    splitB := make([][]byte, 13)
    
    // Merge into a 100000 element set
    merged := make([][]byte, 13)
    
    for i := range data {
      splitA[i] = data[i][:25000]
      splitB[i] = data[i][25000:]
      
      // Concatenate it to itself
	  merged[i] = append(make([]byte, 0, len(data[i])*2), data[i]...)
	  merged[i] = append(merged[i], data[i]...)
    }
    
    // Each part should still verify as ok.
    ok, err := enc.Verify(splitA)
    if ok && err == nil {
        log.Println("splitA ok")
    }
    
    ok, err = enc.Verify(splitB)
    if ok && err == nil {
        log.Println("splitB ok")
    }
    
    ok, err = enc.Verify(merge)
    if ok && err == nil {
        log.Println("merge ok")
    }

This means that if you have a data set that may not fit into memory, you can split processing into smaller blocks. For the best throughput, don't use too small blocks.

This also means that you can divide big input up into smaller blocks, and do reconstruction on parts of your data. This doesn't give the same flexibility of a higher number of data shards, but it will be much more performant.

Streaming API

There has been added support for a streaming API, to help perform fully streaming operations, which enables you to do the same operations, but on streams. To use the stream API, use NewStream function to create the encoding/decoding interfaces.

You can use WithConcurrentStreams to ready an interface that reads/writes concurrently from the streams.

You can specify the size of each operation using WithStreamBlockSize. This will set the size of each read/write operation.

Input is delivered as []io.Reader, output as []io.Writer, and functionality corresponds to the in-memory API. Each stream must supply the same amount of data, similar to how each slice must be similar size with the in-memory API. If an error occurs in relation to a stream, a StreamReadError or StreamWriteError will help you determine which stream was the offender.

There is no buffering or timeouts/retry specified. If you want to add that, you need to add it to the Reader/Writer.

For complete examples of a streaming encoder and decoder see the examples folder.

Advanced Options

You can modify internal options which affects how jobs are split between and processed by goroutines.

To create options, use the WithXXX functions. You can supply options to New, NewStream. If no Options are supplied, default options are used.

Example of how to supply options:

    enc, err := reedsolomon.New(10, 3, WithMaxGoroutines(25))

Performance

Performance depends mainly on the number of parity shards. In rough terms, doubling the number of parity shards will double the encoding time.

Here are the throughput numbers with some different selections of data and parity shards. For reference each shard is 1MB random data, and 16 CPU cores are used for encoding.

Data Parity Go MB/s SSSE3 MB/s AVX2 MB/s
5 2 14287 66355 108755
8 8 5569 34298 70516
10 4 6766 48237 93875
50 20 1540 12130 22090

The throughput numbers here is the size of the encoded data and parity shards.

If runtime.GOMAXPROCS() is set to a value higher than 1, the encoder will use multiple goroutines to perform the calculations in Verify, Encode and Reconstruct.

Example of performance scaling on AMD Ryzen 3950X - 16 physical cores, 32 logical cores, AVX 2. The example uses 10 blocks with 1MB data each and 4 parity blocks.

Threads Speed
1 9979 MB/s
2 18870 MB/s
4 33697 MB/s
8 51531 MB/s
16 59204 MB/s

Benchmarking Reconstruct() followed by a Verify() (=all) versus just calling ReconstructData() (=data) gives the following result:

benchmark                            all MB/s     data MB/s    speedup
BenchmarkReconstruct10x2x10000-8     2011.67      10530.10     5.23x
BenchmarkReconstruct50x5x50000-8     4585.41      14301.60     3.12x
BenchmarkReconstruct10x2x1M-8        8081.15      28216.41     3.49x
BenchmarkReconstruct5x2x1M-8         5780.07      28015.37     4.85x
BenchmarkReconstruct10x4x1M-8        4352.56      14367.61     3.30x
BenchmarkReconstruct50x20x1M-8       1364.35      4189.79      3.07x
BenchmarkReconstruct10x4x16M-8       1484.35      5779.53      3.89x

Performance on AVX512

The performance on AVX512 has been accelerated for Intel CPUs. This gives speedups on a per-core basis typically up to 2x compared to AVX2 as can be seen in the following table:

[...]

This speedup has been achieved by computing multiple parity blocks in parallel as opposed to one after the other. In doing so it is possible to minimize the memory bandwidth required for loading all data shards. At the same time the calculations are performed in the 512-bit wide ZMM registers and the surplus of ZMM registers (32 in total) is used to keep more data around (most notably the matrix coefficients).

Performance on ARM64 NEON

By exploiting NEON instructions the performance for ARM has been accelerated. Below are the performance numbers for a single core on an EC2 m6g.16xlarge (Graviton2) instance (Amazon Linux 2):

BenchmarkGalois128K-64        119562     10028 ns/op        13070.78 MB/s
BenchmarkGalois1M-64           14380     83424 ns/op        12569.22 MB/s
BenchmarkGaloisXor128K-64      96508     12432 ns/op        10543.29 MB/s
BenchmarkGaloisXor1M-64        10000    100322 ns/op        10452.13 MB/s

Performance on ppc64le

The performance for ppc64le has been accelerated. This gives roughly a 10x performance improvement on this architecture as can been seen below:

benchmark                      old MB/s     new MB/s     speedup
BenchmarkGalois128K-160        948.87       8878.85      9.36x
BenchmarkGalois1M-160          968.85       9041.92      9.33x
BenchmarkGaloisXor128K-160     862.02       7905.00      9.17x
BenchmarkGaloisXor1M-160       784.60       6296.65      8.03x

asm2plan9s

asm2plan9s is used for assembling the AVX2 instructions into their BYTE/WORD/LONG equivalents.

Links

License

This code, as the original JavaReedSolomon is published under an MIT license. See LICENSE file for more information.