Code structure as of 7/20/2012

written by Jim.

Ok -- once again, I have messed with the structure. My goal is flexible and efficient. It's similar to the stuff before, but better because it makes things like Euclid's method much cleaner.

I think we're ready to hack.

Files

GNUmakefile: Makefile
README: Empty readme
explanation.html: This file.
gf.c: Main gf routines
gf.h: Main gf prototypes and typedefs
gf_int.h: Prototypes and typedefs for common routines for the internal gf implementations.
gf_method.c: Code to help parse argc/argv to define the method. This way, various programs can be consistent with how they handle the command line.
gf_method.h: Prototypes for ibid.
gf_methods.c: This program prints out how to define the various methods on the command line. My idea is to beef this up so that you can give it a method spec on the command line, and it will tell you whether it's valid, or why it's invalid. I haven't written that part yet.
gf_mult.c: Program to do single multiplication.
gf_mult.c: Program to do single divisions -- it's created in the makefile with a sed script on gf_mult.c.
gf_time.c: Time tester
gf_unit.c: Unit tester
gf_54.c: A simple example program that multiplies 5 and 4 in GF(2^4).
gf_w4.c: Implementation of code for w = 4. (For now, only SHIFT and LOG, plus EUCLID & MATRIX).
gf_w8.c: Implementation of code for w = 8. (For now, only SHIFT plus EUCLID & MATRIX).
gf_w16.c: Implementation of code for w = 16. (For now, only SHIFT plus EUCLID & MATRIX).
gf_w32.c: Implementation of code for w = 32. (For now, only SHIFT plus EUCLID & MATRIX).
gf_w64.c: Implementation of code for w = 64. (For now, only SHIFT and EUCLID.
I don't have gf_w128.c or gf_gen.c yet.

Prototypes and typedefs in gf.h

The main structure that users will see is in gf.h, and it is of type gf_t:

typedef struct gf {
  gf_func_a_b    multiply;
  gf_func_a_b    divide;
  gf_func_a      inverse;
  gf_region      multiply_region;
  void           *scratch;
} gf_t;

We can beef it up later with buf-buf or buf-acc. The problem is that the paper is already bloated, so right now, I want to keep it lean.

The types of the procedures are big unions, so that they work with the following types of arguments:

typedef uint8_t     gf_val_4_t;
typedef uint8_t     gf_val_8_t;
typedef uint16_t    gf_val_16_t;
typedef uint32_t    gf_val_32_t;
typedef uint64_t    gf_val_64_t;
typedef uint64_t    *gf_val_128_t;
typedef uint32_t    gf_val_gen_t;   /* The intent here is for general values <= 32 */

To use one of these, you need to create one with gf_init_easy() or gf_init_hard(). Let's concentrate on the former:

extern int gf_init_easy(gf_t *gf, int w, int mult_type);

You pass it memory for a gf_t, a value of w and a variable that says how to do multiplication. The valid values of mult_type are enumerated in gf.h:

typedef enum {GF_MULT_DEFAULT,
              GF_MULT_SHIFT,
              GF_MULT_GROUP,
              GF_MULT_BYTWO_p,
              GF_MULT_BYTWO_b,
              GF_MULT_TABLE,
              GF_MULT_LOG_TABLE,
              GF_MULT_SPLIT_TABLE,
              GF_MULT_COMPOSITE } gf_mult_type_t;

After creating the gf_t, you use its multiply method to multiply, using the union's fields to work with the various types. It looks easier than my explanation. For example, suppose you wanted to multiply 5 and 4 in GF(2⁴). You can do it as in gf_54.c

#include "gf.h"

main()
{
  gf_t gf;

  gf_init_easy(&gf, 4, GF_MULT_DEFAULT);
  printf("%d\n", gf.multiply.w4(&gf, 5, 4));
  exit(0);
}

If you wanted to multiply in GF(2⁸), then you'd have to use 8 as a parameter to gf_init_easy, and call the multiplier as gf.mult.w8().

When you're done with your gf_t, you should call gf_free() on it so that it can free memory that it has allocated. We'll talk more about memory later, but if you create your gf_t with gf_init_easy, then it calls malloc(), and if you care about freeing memory, you'll have to call gf_free().

Memory allocation

Each implementation of a multiplication technique keeps around its own data. For example, GF_MULT_TABLE keeps around multiplication and division tables, and GF_MULT_LOG maintains log and antilog tables. This data is stored in the pointer scratch. My intent is that the memory that is there is all that's required. In other words, the multiply(), divide(), inverse() and multiply_region() calls don't do any memory allocation. Moreover, gf_init_easy() only allocates one chunk of memory -- the one in scratch.

If you don't want to have the initialization call allocate memory, you can use gf_init_hard():

extern int gf_init_hard(gf_t *gf,
                        int w,
                        int mult_type,
                        int region_type,
                        int divide_type,
                        uint64_t prim_poly,
                        int arg1,
                        int arg2,
                        gf_t *base_gf,
                        void *scratch_memory);

The first three parameters are the same as gf_init_easy(). You can add additional arguments for performing multiply_region, and for performing division in the region_type and divide_type arguments. Their values are also defined in gf.h. You can mix the region_type values (e.g. "DOUBLE" and "SSE"):

#define GF_REGION_DEFAULT      (0x0)
#define GF_REGION_SINGLE_TABLE (0x1)
#define GF_REGION_DOUBLE_TABLE (0x2)
#define GF_REGION_QUAD_TABLE   (0x4)
#define GF_REGION_LAZY         (0x8)
#define GF_REGION_SSE          (0x10)
#define GF_REGION_NOSSE        (0x20)
#define GF_REGION_STDMAP       (0x40)
#define GF_REGION_ALTMAP       (0x80)
#define GF_REGION_CAUCHY       (0x100)

typedef uint32_t gf_region_type_t;

typedef enum { GF_DIVIDE_DEFAULT,
               GF_DIVIDE_MATRIX,
               GF_DIVIDE_EUCLID } gf_division_type_t;

You can change the primitive polynomial with prim_poly, give additional arguments with arg1 and arg2 and give a base Galois Field for composite fields. Finally, you can pass it a pointer to memory in scratch_memory. That way, you can avoid having gf_init_hard() call malloc().

There is a procedure called gf_scratch_size() that lets you know the minimum size for scratch_memory, depending on w, the multiplication type and the arguments:

extern int gf_scratch_size(int w,
                           int mult_type,
                           int region_type,
                           int divide_type,
                           int arg1,
                           int arg2);

You can specify default arguments in gf_init_hard():

region_type = GF_REGION_DEFAULT
divide_type = GF_REGION_DEFAULT
prim_poly = 0
arg1 = 0
arg2 = 0
base_gf = NULL
scratch_memory = NULL

If any argument is equal to its default, then default actions are taken (e.g. a standard primitive polynomial is used, or memory is allocated for scratch_memory). In fact, gf_init_easy() simply calls gf_init_hard() with the default parameters.

gf_free() frees memory that was allocated with gf_init_easy() or gf_init_hard(). The recursive parameter is in case you use composite fields, and want to recursively free the base fields. If you pass scratch_memory to gf_init_hard(), then you typically don't need to call gf_free(). It won't hurt to call it, though.

gf_mult and gf_div

For the moment, I have few things completely implemented, but that's because I want to be able to explain the structure, and how to specify methods. In particular, for w=4, I have implemented SHIFT and LOG. For w=8, 16, 32, 64 I have implemented SHIFT. For all w ≤ 32, I have implemented both Euclid's algorithm for inversion, and the matrix method for inversion. For w=64, it's just Euclid. You can test these all with gf_mult and gf_div. Here are a few calls:

UNIX> gf_mult 7 11 4                - Default
4
UNIX> gf_mult 7 11 4 SHIFT - -      - Use shift
4
UNIX> gf_mult 7 11 4 LOG - -        - Use logs
4
UNIX> gf_div 4 7 4                  - Default
11
UNIX> gf_div 4 7 4 LOG - -          - Use logs
11
UNIX> gf_div 4 7 4 LOG - EUCLID     - Use Euclid instead of logs
11
UNIX> gf_div 4 7 4 LOG - MATRIX     - Use Matrix inversion instead of logs
11
UNIX> gf_div 4 7 4 SHIFT - -        - Default
11
UNIX> gf_div 4 7 4 SHIFT - EUCLID   - Use Euclid (which is the default)
11
UNIX> gf_div 4 7 4 SHIFT - MATRIX   - Use Matrix inversion instead of logs
11
UNIX> gf_mult 200 211 8        - The remainder are shift/Euclid
201
UNIX> gf_div 201 211 8
200
UNIX> gf_mult 60000 65111 16
63515
UNIX> gf_div 63515 65111 16
60000
UNIX> gf_mult abcd0001 9afbf788 32h
b0359681
UNIX> gf_div b0359681 9afbf788 32h
abcd0001
UNIX> gf_mult abcd00018c8b8c8a 9afbf7887f6d8e5b 64h
3a7def35185bd571
UNIX> gf_mult abcd00018c8b8c8a 9afbf7887f6d8e5b 64h
3a7def35185bd571
UNIX> gf_div 3a7def35185bd571 9afbf7887f6d8e5b 64h
abcd00018c8b8c8a
UNIX>

You can see all the methods with gf_methods. We have a lot of implementing to do:

UNIX> gf_methods
To specify the methods, do one of the following: 
       - leave empty to use defaults
       - use a single dash to use defaults
       - specify MULTIPLY REGION DIVIDE

Legal values of MULTIPLY:
       SHIFT: shift
       GROUP g_mult g_reduce: the Group technique - see the paper
       BYTWO_p: BYTWO doubling the product.
       BYTWO_b: BYTWO doubling b (more efficient thatn BYTWO_p)
       TABLE: Full multiplication table
       LOG:   Discrete logs
       LOG_ZERO: Discrete logs with a large table for zeros
       SPLIT g_a g_b: Split tables defined by g_a and g_b
       COMPOSITE k l [METHOD]: Composite field, recursively specify the
                               method of the base field in GF(2^l)

Legal values of REGION: Specify multiples with commas e.g. 'DOUBLE,LAZY'
       -: Use defaults
       SINGLE/DOUBLE/QUAD: Expand tables
       LAZY: Lazily create table (only applies to TABLE and SPLIT)
       SSE/NOSSE: Use 128-bit SSE instructions if you can
       CAUCHY/ALTMAP/STDMAP: Use different memory mappings

Legal values of DIVIDE:
       -: Use defaults
       MATRIX: Use matrix inversion
       EUCLID: Use the extended Euclidian algorithm.

See the user's manual for more information.
There are many restrictions, so it is better to simply use defaults in most cases.
UNIX>

gf_unit and gf_time

gf_unit.c is a unit tester, and gf_time.c is a time tester. They are called as follows:

UNIX> gf_unit w tests seed [METHOD] 
UNIX> gf_time w tests seed size(bytes) iterations [METHOD]

The tests parameter is one or more of the following characters:

A: Do all tests
S: Test only single operations (multiplication/division)
R: Test only region operations
V: Verbose Output

seed is a seed for srand48() -- using -1 defaults to the current time.

For example, testing the defaults with w=4:

UNIX> gf_unit 4 AV 1 LOG - -
Seed: 1
Testing single multiplications/divisions.
Testing Inversions.
Testing buffer-constant, src != dest, xor = 0
Testing buffer-constant, src != dest, xor = 1
Testing buffer-constant, src == dest, xor = 0
Testing buffer-constant, src == dest, xor = 1
UNIX> gf_unit 4 AV 1 SHIFT - -
Seed: 1
Testing single multiplications/divisions.
Testing Inversions.
No multiply_region.
UNIX>

There is no multiply_region() method defined for SHIFT. Thus, the procedures are NULL and the unit tester ignores them.

At the moment, I only have the unit tester working for w=4.

gf_time takes the size of an array (in bytes) and a number of iterations, and tests the speed of both single and region operations. The tests are:

A: All
S: All Single Operations
R: All Region Operations
M: Single: Multiplications
D: Single: Divisions
I: Single: Inverses
B: Region: Multipy_Region

Here are some examples with SHIFT and LOG on my mac.

UNIX> gf_time 4 A 1 102400 1024 LOG - -
Seed: 1
Multiply:   0.538126 s      185.830 Mega-ops/s
Divide:     0.520825 s      192.003 Mega-ops/s
Inverse:    0.631198 s      158.429 Mega-ops/s
Buffer-Const,s!=d,xor=0:    0.478395 s      209.032 MB/s
Buffer-Const,s!=d,xor=1:    0.524245 s      190.751 MB/s
Buffer-Const,s==d,xor=0:    0.471851 s      211.931 MB/s
Buffer-Const,s==d,xor=1:    0.528275 s      189.295 MB/s
UNIX> gf_time 4 A 1 102400 1024 LOG - EUCLID
Seed: 1
Multiply:   0.555512 s      180.014 Mega-ops/s
Divide:     5.359434 s       18.659 Mega-ops/s
Inverse:    4.911719 s       20.359 Mega-ops/s
Buffer-Const,s!=d,xor=0:    0.496097 s      201.573 MB/s
Buffer-Const,s!=d,xor=1:    0.538536 s      185.689 MB/s
Buffer-Const,s==d,xor=0:    0.485564 s      205.946 MB/s
Buffer-Const,s==d,xor=1:    0.540227 s      185.107 MB/s
UNIX> gf_time 4 A 1 102400 1024 LOG - MATRIX
Seed: 1
Multiply:   0.544005 s      183.822 Mega-ops/s
Divide:     7.602822 s       13.153 Mega-ops/s
Inverse:    7.000564 s       14.285 Mega-ops/s
Buffer-Const,s!=d,xor=0:    0.474868 s      210.585 MB/s
Buffer-Const,s!=d,xor=1:    0.527588 s      189.542 MB/s
Buffer-Const,s==d,xor=0:    0.473130 s      211.358 MB/s
Buffer-Const,s==d,xor=1:    0.529877 s      188.723 MB/s
UNIX> gf_time 4 A 1 102400 1024 SHIFT - -
Seed: 1
Multiply:   2.708842 s       36.916 Mega-ops/s
Divide:     8.756882 s       11.420 Mega-ops/s
Inverse:    5.695511 s       17.558 Mega-ops/s
UNIX>

At the moment, I only have the timer working for w=4.

Walking you through LOG

To see how scratch is used to store data, let's look at what happens when you call gf_init_easy(&gf, 4, GF_MULT_LOG); First, gf_init_easy() calls gf_init_hard() with default parameters. This is in gf.c.

gf_init_hard()' first job is to set up the scratch. The scratch's type is gf_internal_t, defined in gf_int.h:

typedef struct {
  int mult_type;
  int region_type;
  int divide_type;
  int w;
  uint64_t prim_poly;
  int free_me;
  int arg1;
  int arg2;
  gf_t *base_gf;
  void *private;
} gf_internal_t;

All the fields are straightfoward, with the exception of private. That is a (void *) which points to the implementation's private data.

Here's the code for gf_init_hard():

int gf_init_hard(gf_t *gf, int w, int mult_type, 
                        int region_type,
                        int divide_type,
                        uint64_t prim_poly,
                        int arg1, int arg2,
                        gf_t *base_gf,
                        void *scratch_memory) 
{
  int sz;
  gf_internal_t *h;


  if (scratch_memory == NULL) {
    sz = gf_scratch_size(w, mult_type, region_type, divide_type, arg1, arg2);
    if (sz <= 0) return 0;
    h = (gf_internal_t *) malloc(sz);
    h->free_me = 1;
  } else {
    h = scratch_memory;
    h->free_me = 0;
  }
  gf->scratch = (void *) h;
  h->mult_type = mult_type;
  h->region_type = region_type;
  h->divide_type = divide_type;
  h->w = w;
  h->prim_poly = prim_poly;
  h->arg1 = arg1;
  h->arg2 = arg2;
  h->base_gf = base_gf;
  h->private = (void *) gf->scratch;
  h->private += (sizeof(gf_internal_t));

  switch(w) {
    case 4: return gf_w4_init(gf);
    case 8: return gf_w8_init(gf);
    case 16: return gf_w16_init(gf);
    case 32: return gf_w32_init(gf);
    case 64: return gf_w64_init(gf);
    case 128: return gf_dummy_init(gf);
    default: return 0;
  }
}

The first thing it does is determine if it has to allocate space for scratch. If it must, it uses gf_scratch_size() to figure out how big the space must be. It then sets gf->scratch to this space, and sets all of the fields of the scratch to the arguments in gf_init_hard(). The private pointer is set to be the space just after the pointer gf->private. Again, it is up to gf_scratch_size() to make sure there is enough space for the scratch, and for all of the private data needed by the implementation.

Once the scratch is set up, gf_init_hard() calls gf_w4_init(). This is in gf_w4.c, and it is a simple dispatcher to the various initialization routines, plus it sets EUCLID and MATRIX if need be:

int gf_w4_init(gf_t *gf)
{
  gf_internal_t *h;

  h = (gf_internal_t *) gf->scratch;
  if (h->prim_poly == 0) h->prim_poly = 0x13;

  gf->multiply.w4 = NULL;
  gf->divide.w4 = NULL;
  gf->inverse.w4 = NULL;
  gf->multiply_region.w4 = NULL;

  switch(h->mult_type) {
    case GF_MULT_SHIFT:     if (gf_w4_shift_init(gf) == 0) return 0; break;
    case GF_MULT_LOG_TABLE: if (gf_w4_log_init(gf) == 0) return 0; break;
    case GF_MULT_DEFAULT:   if (gf_w4_log_init(gf) == 0) return 0; break;
    default: return 0;
  }
  if (h->divide_type == GF_DIVIDE_EUCLID) {
    gf->divide.w4 = gf_w4_divide_from_inverse;
    gf->inverse.w4 = gf_w4_euclid;
  } else if (h->divide_type == GF_DIVIDE_MATRIX) {
    gf->divide.w4 = gf_w4_divide_from_inverse;
    gf->inverse.w4 = gf_w4_matrix;
  }

  if (gf->inverse.w4 != NULL && gf->divide.w4 == NULL) {
    gf->divide.w4 = gf_w4_divide_from_inverse;
  }
  if (gf->inverse.w4 == NULL && gf->divide.w4 != NULL) {
    gf->inverse.w4 = gf_w4_inverse_from_divide;
  }
  return 1;
}

The code in gf_w4_log_init() sets up the log and antilog tables, and sets the multiply.w4, divide.w4 etc routines to be the ones for logs. The tables are put into gf->scratch->private, which is typecast to a struct gf_logtable_data *:

struct gf_logtable_data {
    gf_val_4_t      log_tbl[GF_FIELD_SIZE];
    gf_val_4_t      antilog_tbl[GF_FIELD_SIZE * 2];
    gf_val_4_t      *antilog_tbl_div;
};
.......

static 
int gf_w4_log_init(gf_t *gf)
{
  gf_internal_t *h;
  struct gf_logtable_data *ltd;
  int i, b;

  h = (gf_internal_t *) gf->scratch;
  ltd = h->private;

  ltd->log_tbl[0] = 0;

  ltd->antilog_tbl_div = ltd->antilog_tbl + (GF_FIELD_SIZE-1);
  b = 1;
  for (i = 0; i < GF_FIELD_SIZE-1; i++) {
      ltd->log_tbl[b] = (gf_val_8_t)i;
      ltd->antilog_tbl[i] = (gf_val_8_t)b;
      ltd->antilog_tbl[i+GF_FIELD_SIZE-1] = (gf_val_8_t)b;
      b <<= 1;
      if (b & GF_FIELD_SIZE) {
          b = b ^ h->prim_poly;
      }
  }
    
  gf->inverse.w4 = gf_w4_inverse_from_divide;
  gf->divide.w4 = gf_w4_log_divide;
  gf->multiply.w4 = gf_w4_log_multiply;
  gf->multiply_region.w4 = gf_w4_log_multiply_region;
  return 1;
}

And of course the individual routines use h->private to access the tables:

static
inline
gf_val_8_t gf_w4_log_multiply (gf_t *gf, gf_val_8_t a, gf_val_8_t b)
{
  struct gf_logtable_data *ltd;
    
  ltd = (struct gf_logtable_data *) ((gf_internal_t *) (gf->scratch))->private;
  return (a == 0 || b == 0) ? 0 : ltd->antilog_tbl[(unsigned)(ltd->log_tbl[a] + ltd->log_tbl[b])];
}

Finally, it's important that the proper sizes are put into gf_w4_scratch_size() for each implementation:

int gf_w4_scratch_size(int mult_type, int region_type, int divide_type, int arg1, int arg2)
{
  int region_tbl_size;
  switch(mult_type)
  {
    case GF_MULT_DEFAULT:
    case GF_MULT_LOG_TABLE:
      return sizeof(gf_internal_t) + sizeof(struct gf_logtable_data) + 64;
      break;
    case GF_MULT_SHIFT:
      return sizeof(gf_internal_t);
      break;
    default:
      return -1;
   }
}

I hope that's enough explanation for y'all to start implementing. Let me know if you have problems -- thanks -- Jim

The initial structure has been set for w=4, 8, 16, 32 and 64, with implementations of SHIFT and EUCLID, and for w <= 32, MATRIX. There are some weird caveats:

For w=32 and w=64, the primitive polynomial does not have the leading one.
I'd like for naming to be:
For example, the log techniques for w=4 are:
```
gf_w4_log_multiply()
gf_w4_log_divide()
gf_w4_log_multiply_region()
gf_w4_log_init()
```
I'd also like a header block on implementations that says who wrote it.

Things we need to Implement: w=4

SHIFT Done - Jim

BYTWO_p Done - Jim

BYTWO_b Done - Jim

BYTWO_p, SSE Done - Jim

BYTWO_b, SSE Done - Jim

Single TABLE Done - Jim

Double TABLE Done - Jim

Double TABLE, SSE Done - Jim

Quad TABLE Done - Jim

Lazy Quad TABLE Done - Jim

LOG Done - Jim

Things we need to Implement: w=8

SHIFT Done - Jim

BYTWO_p Done - Jim

BYTWO_b Done - Jim

BYTWO_p, SSE Done - Jim

BYTWO_b, SSE Done - Jim

Single TABLE Done - Kevin

Double TABLE Done - Jim

Lazy Double TABLE Done - Jim

Split 2 1 (Half) SSE Done - Jim

Composite, k=2 Done - Kevin (alt mapping not passing unit test)

LOG Done - Kevin

LOG ZERO Done - Jim

Things we need to Implement: w=16

SHIFT Done - Jim

BYTWO_p Done - Jim

BYTWO_b Done - Jim

BYTWO_p, SSE Done - Jim

BYTWO_b, SSE Done - Jim

Lazy TABLE Done - Jim

Split 4 16 No-SSE, lazy Done - Jim

Split 4 16 SSE, lazy Done - Jim

Split 4 16 SSE, lazy, alternate mapping Done - Jim

Split 8 16, lazy Done - Jim

Composite, k=2, stdmap recursive Done - Kevin

Composite, k=2, altmap recursive Done - Kevin

Composite, k=2, stdmap inline Done - Kevin

LOG Done - Kevin

LOG ZERO Done - Kevin

Group 4 4 Done - Jim: I don't see a reason to implement others, although 4-8 will be faster, and 8 8 will have faster region ops. They'll never beat SPLIT.

Things we need to Implement: w=32

SHIFT Done - Jim

BYTWO_p Done - Jim

BYTWO_b Done - Jim

BYTWO_p, SSE Done - Jim

BYTWO_b, SSE Done - Jim

Split 2 32,lazy Done - Jim

Split 2 32, SSE, lazy Done - Jim

Split 4 32, lazy Done - Jim

Split 4 32, SSE,ALTMAP lazy Done - Jim

Split 4 32, SSE, lazy Done - Jim

Split 8 8 Done - Jim

Group, g_s == g_r Done - Jim

Group, any g_s and g_r Done - Jim

Composite, k=2, stdmap recursive Done - Kevin

Composite, k=2, altmap recursive Done - Kevin

Composite, k=2, stdmap inline Done - Kevin

Things we need to Implement: w=64

SHIFT Done - Jim

BYTWO_p -

BYTWO_b -

BYTWO_p, SSE -

BYTWO_b, SSE -

Split 16 1 SSE, maybe lazy -

Split 8 1 lazy -

Split 8 8 -

Split 8 8 lazy -

Group -

Composite, k=2, alternate mapping -

Things we need to Implement: w=128

SHIFT Done - Will

BYTWO_p -

BYTWO_b -

BYTWO_p, SSE -

BYTWO_b, SSE -

Split 32 1 SSE, maybe lazy -

Split 16 1 lazy -

Split 16 16 - Maybe that's insanity -

Split 16 16 lazy -

Group (SSE) -

Composite, k=?, alternate mapping -

Things we need to Implement: w=general between 1 & 32

CAUCHY Region (SSE XOR) Done - Jim

SHIFT Done - Jim

TABLE Done - Jim

LOG Done - Jim

BYTWO_p Done - Jim

BYTWO_b Done - Jim

Group, g_s == g_r Done - Jim

Group, any g_s and g_r Done - Jim

Split - do we need it? Done - Jim

Composite - do we need it? -

Split - do we need it? -

Logzero? -

SHIFT	Done - Jim
BYTWO_p	Done - Jim
BYTWO_b	Done - Jim
BYTWO_p, SSE	Done - Jim
BYTWO_b, SSE	Done - Jim
Single TABLE	Done - Jim
Double TABLE	Done - Jim
Double TABLE, SSE	Done - Jim
Quad TABLE	Done - Jim
Lazy Quad TABLE	Done - Jim
LOG	Done - Jim

SHIFT	Done - Will
BYTWO_p	-
BYTWO_b	-
BYTWO_p, SSE	-
BYTWO_b, SSE	-
Split 32 1 SSE, maybe lazy	-
Split 16 1 lazy	-
Split 16 16 - Maybe that's insanity	-
Split 16 16 lazy	-
Group (SSE)	-
Composite, k=?, alternate mapping	-

CAUCHY Region (SSE XOR)	Done - Jim
SHIFT	Done - Jim
TABLE	Done - Jim
LOG	Done - Jim
BYTWO_p	Done - Jim
BYTWO_b	Done - Jim
Group, g_s == g_r	Done - Jim
Group, any g_s and g_r	Done - Jim
Split - do we need it?	Done - Jim
Composite - do we need it?	-
Split - do we need it?	-
Logzero?	-