18 Jan 19:25

MrUnbelievable92

f544443

v2.9.99 Latest

Latest

Known Issues

half8/16 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)
scalar and vector half conversion operators and related functions differ from the slightly incorrect Unity.Mathematics implementation. Specifically, if a float or double value is converted to a half, and if the value to be converted is exactly halfway between two adjacent representable half values, Unity's implementation rounds up while this library's implementation truncates, which matches the hardware default rounding direction when converting a double to a float. The same behavior occurs when converting any integer type to half values, because Unity.Mathematics does not implement an optimized integer to half conversion operator/function but rather first converts the integer to a float implicitly, before casting to half
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
float8 min() and max() functions don't handle NaNs the same way Unity.Mathematics does
LLVM, in many cases, generates very poor code for all vectors with small fields (16 bit and below). This can be fixed by, for instance, byte16 only having two ulongs as fields and exposing the individual bytes as properties. This is API breaking, since you cannot take the address of properties, affecting unsafe pointer code; in, and ref code referencing those fields can be fixed with those fields becoming ref properties, while out used on those fields may only be used if the vector is already initialized. This will have to be changed for much better performance but this change is reserved for version 3.0

Fixes

Fixed managed fallback of roundto[all integer types] for float, double and quadruple arguments for values near 0
Fixed SIMD div and rem for Unity.Mathematics integer vector types
Fixed float8 != comparison if compiling for AVX(2)
Fixed toboolsafe for double4 and all signed integer vector types
Fixed SSE2 fallback for converting a quarter vector to any integer vector
Fixed software implemented floating point conversion from wider types to narrower types sometimes not rounding to the nearest representable value
Fixed int, long and Int128 minmag and maxmag always- (minmag) or never (maxmag) returning the respective MinValue, if either argument is equal to MinValue

Additions

Added scoped registry support
Added half16, quarter16 and quarter32 due to more and more quarter and half function overloads having been implemented that are faster when not casting to a float scalar/vector
Added bits_select(a, b, c) for each scalar- and vector integer type. This performs the same operation as select, just for each bit, and thus takes in a c that is of the same type as a and b
Added cand for each vector integer type. These reduce vectors to a scalar integer of that type by applying bitwise AND operations between each element
Added reverse to reverse the element order in a vector of any type
Added a shuffle overload for all vector types that does not take a Unity.Mathematics.ShuffleComponent as an argument. The second parameter is a vector with the same amount of not necessarily identical elements as the first parameter, holding the indices pointing to elements in the first parameter, determining the order of elements in the returned vector. Example: shuffle(new int4(9, 99, 999, 9999), new byte4(0, 3, 3, 2); // result: int4(9, 9999, 9999, 999)
Added mulwide for scalar and vector integer types, which performs full precision multiplication and returns the respective low and high halves of the result as out parameters of the same type as the input parameters
Added floortoint, ceiltoint and trunctoint for all combinations of floating point types (input parameter) and integer types (return type). trunctoint wraps default floating point to integer casting, while offering an optional Promise parameter when quarter, half or quadruple are involved
Added tohalfunsafe, converting any other scalar or vector type to a half scalar or vector type, utilizing faster and smaller code paths via a Promise parameter
Added tofloatunsafe and todoubleunsafe with quarter, half and quadruple parameter types, utilizing faster and smaller code paths via a Promise parameter
Added the "trivial" quadruple overloads lerp, unlerp, remap, clamp, saturate, dot, frac, sign, modf, length, lengthsq, distance, distancesq, smoothstep, step, avg, fastrcp, div, divrem, dad, dsub, addsaturated, subsaturated, mulsaturated, divsaturated, tobytesaturated, tosbytesaturated, toushortsaturated, toshortsaturated, touintsaturated, tointsaturated, toulongsaturated, tolongsaturated, touint128saturated, toint128saturated, toquartersaturated, tohalfsaturated, tofloatsaturated, todoublesaturated, ceilmultiple, truncmultiple, roundmultiple, floormultiple, reversebytes, negate, maxmag, minmag, minmaxmag, minmax, angledelta, angledeltasgn, angledeltadeg, angledeltasgndeg, smoothlerp, pingpong, repeat, tobool, toboolsafe, toquarterunsafe, tohalfunsafe, toquadruple, toquadruplesafe, exp2 (integer parameters) Missing quadruple overloads are: tan, tanh, atan, atan2, cos, cosh, acos, sin, sinh, asin, sincos, asinh, acosh, atanh, pow, exp, exp2, exp10, log, log2, log10, erf, erfc, gamma, hypot

Improvements

Added (u)long vector /, % and divrem overloads for (s)byte, (u)short and (u)int divisors, with a latency of ~5 fewer cycles and while using 4 fewer instructions
Optimized (u)long and (U)Int128 insqrt algorithms by replacing a loop based algorithm with straight line code. For 64-bit integers, this implementation is up to two times faster. For 128-bit integers it is up to 14 times faster, yet a little slower if the argument is below ~2^57. These intsqrt versions can now be constant-evaluated at compile time. If the global compilation option for OptimizeFor is set to OptimizeFor.Size, the much smaller loop based algorithm is chosen at compile time.
Optimized (s)byte vector square and myByteVector_0 * myByteVector_0 when compiling for x86 with SSE4 or higher, having derived and implemented a novel 8-bit integer-square algorithm due to the lack of a native SIMD 8 bit multiplication instruction on x86. The implementation of the algorithm has a latency of 5 or 6 cycles instead of the 8 or 9 cycles (CPU specific) associated with generalized 8 bit integer multiplication implemented in software, at the cost of 17 (highly parallel) instructions and 4 constants instead of 6 instructions and 1 constant, and is thus only used if COMPILATION_OPTIONS.OPTIMIZE_FOR is set to OptimizeFor.Performance. Most notably, this algorithm reduces the latency of (s)byte intpow, since squaring is part of its loop
Optimized vector (s)byte intsqrt, having derived and implemented a novel 8-bit integer-square-root algorithm, reducing the latency by 2 to 5 (unsigned) - or 5 to 8 cycles (signed), respectively, and removing up to 8 instructions, except for non 16/32-element (s)byte vectors, where a respecive 9 or 4 instructions (unsigned), or 7 or 2 instructions (signed) were added instead. The previously implemented algorithm is selected for those vectors if the global compilation option for OptimizeFor is set to OptimizeFor.Size
Optimized vector [] operator to avoid repeated memory reads and writes when multiple scalar values are sequentially assigned to non-constant indices of a vector
Optimized count for bool2 and bool4 inputs when compiling for an architecture that supports SIMD
Optimized scalar fallback- and vectorized versions of bits_depositparallel and bits_extractparallel by utilizing a O(log2(n)) algorithm over the previously implemented O(n) algorithm, where n is the bit-width of the respective scalar datatype. Additionally, this algorithm is ~50x to ~100x faster when mask is constant
Reduced latency, code size and constant data for (U)Int128 bits_depositparallel by varying (but substantial) amounts if compiling for X86.BMI2 (= AVX2) by utilizing the 64-bit hardware-supported variants, analogous to how bits_extractparallel has already been implemented for (U)Int128
Reduced latency (up to 1 cycle), code size (up to 2 instructions) and constant data (up to 48 bytes) of most vector constructors that combine multiple vectors together, if compiling for ARM or x86.SSE4 or higher
Reduced latency vector (u)long % operator by 9 to 11 cycles and removed 15 instructions if only the remainder is used; using the / operator in the same context or using divrem results in 8 fewer instructions instead
Reduced latency of int8 / and % operators by 3 cycles and removed at least 2 instructions
Reduced latency of vector Divider<(u)int> initialization by 3 to 4 cycles and removed 3 or 5 instructions
Reduced latency of vector double to (u)long conversion by 2 cycles if compiling for AVX2
Reduced latency of scalar quarter to float and double conversion by 4 cycles and removed 7 instructions. If compiling for BMI2 (i.e. AVX2), 3 further instructions are replaced by 2 instructions with the same latency, saving another 1 byte in code size
Reduced latency of vector quarter to float and double conversion by up to 3 cycles and removed up to 2 instructions
Reduced latency of scalar- and vector software implemented narrowing floating point to floating point type conversion by up to 3 cycles and removed up to 4 i...

Assets 2

18 Jan 14:11

MrUnbelievable92

2.9.9

4760658

v2.9.9

Known Issues

half8/16 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)
scalar and vector half conversion operators and related functions differ from the slightly incorrect Unity.Mathematics implementation. Specifically, if a float or double value is converted to a half, and if the value to be converted is exactly halfway between two adjacent representable half values, Unity's implementation rounds up while this library's implementation truncates, which matches the hardware default rounding direction when converting a double to a float. The same behavior occurs when converting any integer type to half values, because Unity.Mathematics does not implement an optimized integer to half conversion operator/function but rather first converts the integer to a float implicitly, before casting to half
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
float8 min() and max() functions don't handle NaNs the same way Unity.Mathematics does
LLVM, in many cases, generates very poor code for all vectors with small fields (16 bit and below). This can be fixed by, for instance, byte16 only having two ulongs as fields and exposing the individual bytes as properties. This is API breaking, since you cannot take the address of properties, affecting unsafe pointer code; in, and ref code referencing those fields can be fixed with those fields becoming ref properties, while out used on those fields may only be used if the vector is already initialized. This will have to be changed for much better performance but this change is reserved for version 3.0

Fixes

Fixed managed fallback of roundto[all integer types] for float, double and quadruple arguments for values near 0
Fixed SIMD div and rem for Unity.Mathematics integer vector types
Fixed float8 != comparison if compiling for AVX(2)
Fixed toboolsafe for double4 and all signed integer vector types
Fixed SSE2 fallback for converting a quarter vector to any integer vector
Fixed software implemented floating point conversion from wider types to narrower types sometimes not rounding to the nearest representable value
Fixed int, long and Int128 minmag and maxmag always- (minmag) or never (maxmag) returning the respective MinValue, if either argument is equal to MinValue

Additions

Added half16, quarter16 and quarter32 due to more and more quarter and half function overloads having been implemented that are faster when not casting to a float scalar/vector
Added bits_select(a, b, c) for each scalar- and vector integer type. This performs the same operation as select, just for each bit, and thus takes in a c that is of the same type as a and b
Added cand for each vector integer type. These reduce vectors to a scalar integer of that type by applying bitwise AND operations between each element
Added reverse to reverse the element order in a vector of any type
Added a shuffle overload for all vector types that does not take a Unity.Mathematics.ShuffleComponent as an argument. The second parameter is a vector with the same amount of not necessarily identical elements as the first parameter, holding the indices pointing to elements in the first parameter, determining the order of elements in the returned vector. Example: shuffle(new int4(9, 99, 999, 9999), new byte4(0, 3, 3, 2); // result: int4(9, 9999, 9999, 999)
Added mulwide for scalar and vector integer types, which performs full precision multiplication and returns the respective low and high halves of the result as out parameters of the same type as the input parameters
Added floortoint, ceiltoint and trunctoint for all combinations of floating point types (input parameter) and integer types (return type). trunctoint wraps default floating point to integer casting, while offering an optional Promise parameter when quarter, half or quadruple are involved
Added tohalfunsafe, converting any other scalar or vector type to a half scalar or vector type, utilizing faster and smaller code paths via a Promise parameter
Added tofloatunsafe and todoubleunsafe with quarter, half and quadruple parameter types, utilizing faster and smaller code paths via a Promise parameter
Added the "trivial" quadruple overloads lerp, unlerp, remap, clamp, saturate, dot, frac, sign, modf, length, lengthsq, distance, distancesq, smoothstep, step, avg, fastrcp, div, divrem, dad, dsub, addsaturated, subsaturated, mulsaturated, divsaturated, tobytesaturated, tosbytesaturated, toushortsaturated, toshortsaturated, touintsaturated, tointsaturated, toulongsaturated, tolongsaturated, touint128saturated, toint128saturated, toquartersaturated, tohalfsaturated, tofloatsaturated, todoublesaturated, ceilmultiple, truncmultiple, roundmultiple, floormultiple, reversebytes, negate, maxmag, minmag, minmaxmag, minmax, angledelta, angledeltasgn, angledeltadeg, angledeltasgndeg, smoothlerp, pingpong, repeat, tobool, toboolsafe, toquarterunsafe, tohalfunsafe, toquadruple, toquadruplesafe, exp2 (integer parameters)
Missing quadruple overloads are: tan, tanh, atan, atan2, cos, cosh, acos, sin, sinh, asin, sincos, asinh, acosh, atanh, pow, exp, exp2, exp10, log, log2, log10, erf, erfc, gamma, hypot

Improvements

Added (u)long vector /, % and divrem overloads for (s)byte, (u)short and (u)int divisors, with a latency of ~5 fewer cycles and while using 4 fewer instructions
Optimized (u)long and (U)Int128 insqrt algorithms by replacing a loop based algorithm with straight line code. For 64-bit integers, this implementation is up to two times faster. For 128-bit integers it is up to 14 times faster, yet a little slower if the argument is below ~2^57. These intsqrt versions can now be constant-evaluated at compile time. If the global compilation option for OptimizeFor is set to OptimizeFor.Size, the much smaller loop based algorithm is chosen at compile time.
Optimized (s)byte vector square and myByteVector_0 * myByteVector_0 when compiling for x86 with SSE4 or higher, having derived and implemented a novel 8-bit integer-square algorithm due to the lack of a native SIMD 8 bit multiplication instruction on x86. The implementation of the algorithm has a latency of 5 or 6 cycles instead of the 8 or 9 cycles (CPU specific) associated with generalized 8 bit integer multiplication implemented in software, at the cost of 17 (highly parallel) instructions and 4 constants instead of 6 instructions and 1 constant, and is thus only used if COMPILATION_OPTIONS.OPTIMIZE_FOR is set to OptimizeFor.Performance. Most notably, this algorithm reduces the latency of (s)byte intpow, since squaring is part of its loop
Optimized vector (s)byte intsqrt, having derived and implemented a novel 8-bit integer-square-root algorithm, reducing the latency by 2 to 5 (unsigned) - or 5 to 8 cycles (signed), respectively, and removing up to 8 instructions, except for non 16/32-element (s)byte vectors, where a respecive 9 or 4 instructions (unsigned), or 7 or 2 instructions (signed) were added instead. The previously implemented algorithm is selected for those vectors if the global compilation option for OptimizeFor is set to OptimizeFor.Size
Optimized vector [] operator to avoid repeated memory reads and writes when multiple scalar values are sequentially assigned to non-constant indices of a vector
Optimized count for bool2 and bool4 inputs when compiling for an architecture that supports SIMD
Optimized scalar fallback- and vectorized versions of bits_depositparallel and bits_extractparallel by utilizing a O(log2(n)) algorithm over the previously implemented O(n) algorithm, where n is the bit-width of the respective scalar datatype. Additionally, this algorithm is ~50x to ~100x faster when mask is constant
Reduced latency, code size and constant data for (U)Int128 bits_depositparallel by varying (but substantial) amounts if compiling for X86.BMI2 (= AVX2) by utilizing the 64-bit hardware-supported variants, analogous to how bits_extractparallel has already been implemented for (U)Int128
Reduced latency (up to 1 cycle), code size (up to 2 instructions) and constant data (up to 48 bytes) of most vector constructors that combine multiple vectors together, if compiling for ARM or x86.SSE4 or higher
Reduced latency vector (u)long % operator by 9 to 11 cycles and removed 15 instructions if only the remainder is used; using the / operator in the same context or using divrem results in 8 fewer instructions instead
Reduced latency of int8 / and % operators by 3 cycles and removed at least 2 instructions
Reduced latency of vector Divider<(u)int> initialization by 3 to 4 cycles and removed 3 or 5 instructions
Reduced latency of vector double to (u)long conversion by 2 cycles if compiling for AVX2
Reduced latency of scalar quarter to float and double conversion by 4 cycles and removed 7 instructions. If compiling for BMI2 (i.e. AVX2), 3 further instructions are replaced by 2 instructions with the same latency, saving another 1 byte in code size
Reduced latency of vector quarter to float and double conversion by up to 3 cycles and removed up to 2 instructions
Reduced latency of scalar- and vector software implemented narrowing floating point to floating point type conversion by up to 3 cycles and removed up to 4 instructions
Reduced laten...

Assets 2

26 Oct 19:25

MrUnbelievable92

2.9.0

7f4752f

v2.9.0

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
float8 min() and max() functions don't handle NaNs the same way Unity.Mathematics does

Fixes

Fixed XML documentation not showing descriptions for valid Promise flags
Fixed cminmax documentation
bitmask64 with numBits equal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2
Fixed uint8 to float8 type conversion if compiling for AVX2
Fixed incorrect mod implementations
(ISSUE #16) Fixed float and double (r)cbrt edge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vector float implementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new valid Promise flags allow for call-site selection of faster code paths

Additions

`Divider<T>`

Divider<T> is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T, a Divider<T> instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T> was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T> is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i is the divisor in a for-loop).
Numerous Promise flags allow for faster operations, provided that the Divider<T> instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T) often requires explicit initialization (example: new Divider<byte>((byte)42)). DEBUG only validity checks ensure correct initialization and usage.
The current Divider API consists of...:

/ and % operators:
- LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
- LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
- LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
- LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
DivRem member methods
EvenlyDivides member methods
T Divisor as a readonly property
public const Promises within Divider<T>, documenting valid promise flags with appropriate naming, starting with "PROMISE_"
Get/SetInnerDivider<U> methods: get or set a scalar- or vector Divider<U> within a Divider<T>
Component shuffles: Divider<T>.wzxy swizzle "operators" as properties.

NOTE: Get/SetInnerDivider<U> methods and Divider<T>.[a][b][c][d] properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2> from a Divider<int4>. Get/SetInnerDivider<U> are placeholderholders both for these operations as well as for the v[a]_[b] properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.

`quadruple` (PREVIEW)

Analogous to (U)Int128, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath functions available for it at this time.
The following functions have been implemented: ToString and Parse (no perfect roundtrip guaranteed), All constants (example: PI_QUAD), Random128 NextQuadruple (optionally with min and max values), all type conversions except for decimal, -(unary), +(binary), -(binary), *, /, %, ==, !=, <, <=, >, >=, fmod, mad, msub, rcp, isnan, isinf, isfinite, isnormal, issubnormal, round, floor, ceil, trunc, roundtoint (and all other integer variations), fastsqrt, (r)sqrt, (r)cbrt, isinrange, approx, select, compareto, min, max, copysign, nextgreater, nextsmaller, nexttoward, radians, degrees, chgsign

Functions

Added isnormal and issubnormal functions for floating point types
Added hypot and inthypot functions for calculating [int]sqrt(a * a + b * b) without overflow, unless an optional Promise parameter with its NoOverflow flag set is passed as a compile time constant argument
Added roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer
Added cor and cxor. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element
Split approx into two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead
Added roundmultiple(x, m), floormultiple(x, m), ceilmultiple(x, m) and truncmultiple(x, m) for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m)
Added a whole stack of bit manipulation functions for all scalar- and vector integer types: parityodd, parityeven, countzerobits, l1cnt, t1cnt, lzmask, tzmask, l1mask, t1mask, bits_extractlowest0, bits_masktolowest, bits_masktolowest0, bits_maskfromlowest, bits_maskfromlowest0, bits_setlowest, bits_surroundlowest and bits_surroundlowest0

Global Compilation Options

Added Global Compilation Options for OptimizeFor, FloatMode and FloatPrecision. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded to OptimizeFor.Performance, FloatMode.Default and FloatPrecision.Standard, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs

Improvements

Performance

Implemented optimized (u)long vector to float vector type convesion operators
Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
Implemented an AssumeRangeAttribute equivalent for all vectorized functions with known return value ranges
Implemented more optimal (U)Int128 comparison operators
Implemented optimal (U)Int128 multiplication operations with- and division and modulo operations by compile time constants
Implemented optimal (U)Int128 division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizing Unity.Burst.CompilerServices.Constant.IsConstantExpression<T>()
Increased valid Promise.Unsafe0 range for (u)long intcbrt from [0, 2^46} to [0, 2^48]
Added an optional Promise parameter to gamma
Added an optional Promise parameter to erf(c)
Added an optional Promise parameter to gcd and lcm
Added quarter and half scalar- and vector function overloads for min, max, minmax, clamp, saturate, isinrange, trunc, round, ceil, floor and sign
Removed the only non-optimizing branch in vector code in the entire library within the long2/3/4 >> operator if the shift amount is not a comp...

Assets 2

20 Oct 19:01

MrUnbelievable92

2.3.5

e5427c0

v2.3.5

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal
optimized (U)Int128 comparison operators didn't make it into this release
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
AVX2 (s)byte32 all_dif lookup tables are currently way too large (kiloBytes)

Fixes

(Issue #10) bool8/16/32 are now blittable when not used within an IJob

Additions

added comb(n, k) for scalar- and vector integer types. This is known as the binomial coefficient or "n choose k". An optional Promise parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows (which is not true for most solutions found online that claim it), uses a O(min(k, n - k)) algorithm with respect to time
added perm(n, k) for scalar- and vector integer types. This is known as "k-permutations of n". An optional Promise parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows, uses a O(k) algorithm with respect to time
added nextgreater(x) for all types. For integer types, it is a wrapper function for addsaturated(x, 1). For floating point types, it returns the next greater representable floating point value(s), unless x is NaN or infinite. An optional Promise parameter allows for numerous optimizations.
added nextsmaller(x) for all types. For integer types, it is a wrapper function for subsaturated(x, 1). For floating point types, it returns the next smaller representable floating point value(s), unless x is NaN or infinite. An optional Promise parameter allows for numerous optimizations
added nexttoward(from, to) for all types, returning the next representable integer/floating point value(s) in a given direction, unless from is equal to to. For floating point types, from is returned if from is NaN or infinite. If to is NaN, NaN is returned. An optional Promise parameter allows for numerous optimizations.

Improvements

improved performance of 64bit vectorized division thanks to a newly implemented and further optimized algorithm from a July 13th 2022 research paper, which replaces a vectorized loop (rather slow; up to 64 iterations; no instruction level parallelism outside the loop possible until the loop finished executing, following an almost certainly mispredicted branch) with straight line code. Due to "recent" improvements to divider circuits, this code path is inferior to hardware supported scalar division via element extraction for (u)long2, specifically, even when the quotient and/or remainder vector is in the middle of a dependency chain and even in tight loops, and is thus only implemented for (u)long3/4 types and only if compiling for AVX2
improved performance and reduced code size of up to (s)byte8 and every (u)short vector division if not compiling with FloatMode.Fast. Reduced constants possibly read from RAM in either case.
fixed performance regression of SIMD register <-> software abstraction conversions for types using up the entirety of a hardware register
lcm for (s)byte vectors with 8 elements or less: decreased code size by 20 or 28 bytes; removed 2 or 4 or 8 bytes of constant data read from RAM; reduced latency by 2 or 3 clock cycles
verified and increased the (u)long scalar- and vector intcbrt Promise.Unsafe0 range from [0, 1ul << 40] to [0, 1ul << 46], the code path of which is also possibly chosen at compile time
implemented optimized quarter{X} IEEE-754 comparison operators (without having to cast to float{X}). Vectorized halfX comparisons are implemented in MaxMath.Intrinsics.Xse as well and used where appropriate. compareto with quarter{X} and half{X} function overloads were implemented.
reduced latency of add/subsaturated for scalar Int128s, scalar and vector longs as well as vector ints by about a third
replaced (U)Int128.ToString(null, null)s call to BigInteger.ToString() and thus unnecessary heap allocations with an optimized implementation
(u)short8 / and % operators now correctly check for SSE2 support rather than AVX2
removed aliased fixed size buffers from all types, also improving indexer operator performance if the index is a compile time constant (in some cases)

Changes

Burst compiled code that uses a Promise argument which is not a compile time constant will throw an exception in DEBUG, as it represents significant overhead instead of an optimization. This will currently not inform users of the name of the function but rather the Burst compiled job/function that threw it.

Fixed Oversights

added explicit type conversion operators for scalar floats and doubles to half8 and all quarter vectors (as well as scalar halfs to quarter vectors)

Assets 2

18 Aug 06:18

MrUnbelievable92

2.3.0

981f38f

v2.3.0

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal
optimized (U)Int128 comparison operators didn't make it into this release
using bool vectors generated from 256 bit input vectors like so: long4 x = select(a, b, >>> myLong4a < myLong4b <<<) (as an example) does not generate the most efficient machine code possible
unit tests for 64-bit bits_zerohigh functions fail 100% of the time because of a bug related to the managed debug implementation of intrinsics (reported)
unit tests for intrinsics code paths for all functions that use "(mm256_)shuffle_ps" or "(mm256_)blendv_ps" can fail semi-randomly due to a bug which changes the bit content of ints which would be NaN if dereferenced as a float and written back to memory (reported)
most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
(s)byte32 all_dif lookup tables are currently way too large (kiloBytes)

Fixes

fixed quarter rounding behavior when casting a wider floating point type to a quarter to round towards the nearest representable value instead of truncating the mantissa

Additions

added namespace `MaxMath.Intrinsics` for users who want to use the math library through "high level" X86 intrinsics. Because users need to guard their intrinsics code with e.g. `if (Burst.Intrinsics.X86.Sse2.IsSse2Supported)` blocks and supported architectures vary (slightly) from function to function, these are considered unsafe, undocumented and unrecommended and only serve as an exposed layer of abstraction which is used internally anyway.

added flags enum `Promise`, with values `Nothing`, `Everything` `NoOverflow`, `ZeroOrGreater`, `ZeroOrLess`, `NonZero` and `Unsafe` 0 through 3 aswell as the composites `Positive` and `Negative`. This flags enum is only ever used as an optional parameter and offers faster, yet more unsafe code. Specifics vary between functions and sometimes even overloads but are documented accordingly. Optimizations are only ever to be added, not removed (= a ...promise ... of never introducing breaking changes in this regard)

Other Additions

added factorial (for integer types) and gamma (floating point types) functions. factorial, when called without a Promise parameter, clamps the result to type.MaxValue in case of overflow
added erf(c), the (complementary) error function for floating point types
added (c)minmag and (c)maxmag functions, returning the (componentwise) minimum/maximum magnitude of two values or within a vector; equivalent to abs(x) > abs(y) ? x : y (maxmag) or abs(cmin(c)) > abs(cmax(c)) ? cmin(c) : cmax(c) (cmaxmag)
added (c)minmax and (c)minmaxmag functions which return both the (componentwise/columnwise) minimum and maximum (magnitude) as out parameters
added bitfield functions for scalar and vector integer types - small utility functions that pack several smaller integers into bigger ones
added copysign(x, y) functions for signed types, which is equivalent to return y < 0 ? nabs(x) : abs(x)
added (naive?) implementation for scalar- and vector float/double inverse hyberbolic functions asinh, acosh and atanh
added intlog10 functions (integer base ten logarithm)
added the bit test/bt family of functions for scalar and vector integer types. A testbit(POST_ACTION)((ref)x, i) function returns a boolean (vector), indicating whether the bit in x at index i is 1 and may (or may not) flip, set, or reset that bit afterwards
added a new category of type conversion functions with the suffix "unsafe". Added to(u)longunsafe and todoubleunsafe with a Promise parameter, allowing for up to two levels of optimization (vectorized 64bit int <-> 64 bit float is not hardware supported). Details in the XML documentation. Default double <-> (u)long conversion operators - apart from having their 4-element version improved - now check whether or not a safe range for unsafe conversions can be validated at compile time
added scalar/vectorized toquarterunsafe allowing for each type to be converted to a quarter type while specifying whether the input value will or will not overflow and/or is >= 0

Improvements

improved performance of several vector operators and function overloads for types that use up an entire hardware register while having to be up-cast to a wider type considerably - surrounding boilerplate code uses a new "in-house" faster-than-hardware algorithm with its dependency chain latency having been reduced from x [0 <= x <= 3] + (9 or 10) clock cycles down to x + (0 or 1 or 3) + (1 or 3) clock cycles

massive performance improvements for all vector types that are not a total of 128 or 256 bits wide, respectively, either through the `Avx.[...]undefined[...]` compiler intrinsics or through controlled undefined behaviour, by declaring an uninitialized variable and using pointer syntax to force the C# compiler into trusting that the variable has been fully initialized; this cannot lead to memory access violations, since the variable is declared and thus enough space is reserved on the stack, before it is optimized away by LLVM and assigned a hardware register instead, with undefined upper elements. This allows for upper elements of hardware registers to be ignored during compilation. Unnecessarily emitted instructions like `movq xmm0, xmm0` (move the low 8 bytes from a register to the same register, zeroing out the upper 8 bytes, even though only the lower 8 bytes will be written back to memory) or far worse instruction sequences, for example when using vectors with 3 elements, are now (MOSTLY; there's still work to be done) omitted instead. Although most zero-upper-elements instruction( sequence)s only took a single clock cycle, they were always part of each dependency chain and could happen between almost each function call, including operators of course. The same improvements apply to `Unity.Mathematics` types when passed to `maxmath` functions.

improved performance throughout the library by effectively adding hundreds of thousands of `Unity.Burst.CompilerServices.Constant.IsConstantExpression` condition checks more to many functions within the library. Most notably, algorithms, where the total latency is dependant on the byte size of arguments, may now perform much faster. Some but not yet all of these constant checks are exposed through a `Promise` parameter

Other Improvements

improved performance of scalar (u)short to (u)short2/3/4 conversion
reduced latency of all, any first, last, count and bitmask functions for bool8/16/32 when used with an expression as the argument, such as all(x != y) - a way to force the compiler to omit unnecessary intructions was found
reduced latency of addsaturated for scalar unsigned integer types
reduced latency of float/double to (U)Int128 conversion
reduced latency of shl, shrl and shra and thus all functions using those - especially for: shl for (s)byte vectors of all sizes if compiling for SSE4 and 32 byte sized vectors if compiling for AVX2; shl for (u)short vectors of 4 or more elements if compiling for at least SSE4; shra for (u)long vectors if compiling for AVX2 and the vector containing the shift amounts is a compile time constant.
reduced long2/3/4 shra code size and latency by another 2 clock cycles if compiling for AVX2
reduced latency of variable rol/r vector functions beyond shl/r improvements and added an optional Promise parameter, allowing the caller to promise the rotation values are in a specific range
reduced latency of long2/3/4 "is negative checks" - mylong4 < 0/0 > mylong4 by 33% by doubling its code size. This further improves performance/adds to code size of functions in the library
reduced latency of (u)long2/3/4 isinrange functions
reduced latency of unsigned byte and ushort vector to float vector conversion. This also affects performance of (s)byte (u)short vector intsqrt functions, aswell as the respective % and / operators (byte2/3/4/8, all ushort vectors)
reduced (u)long vector intcbrt latency by ~45% and reduced code size by ~20% (roughly 150 bytes). For other integer vector types, the latency has been reduced by ~8 to ~15 clock cycles
added hidden and retroactively improved exp2 scalar and vector integer argument function overloads. These return exp2((float/double)x) or (float/double)(1 << x) in 3 instead of 6 to 7 clock cycles at best; they of course also work for negative input values i.e. reciprocals of powers of 2. The (u)int overloads convert to floats, the (u)long overloads convert to doubles; explicit integer to integer casting should (and sometimes has to) be used for optimal results. Additionally, these overloads contain an optional 'Promise' parameter, allowing for omission of clamping which is needed to ensure correct underflow/overflow behavior, as dictated by Unity's exp2 implementation. If you ever used the standard exp2 function by implicitly converting an int type to a float type, performance was improved by a factor of about 30x. This overload only "breaks" code that casts (u)long types to float types implicitly if the result is expected to be a float type. It is recommended to explicitly cast the (u)long type to a (u)int type in such a case
added ==, !=, <, >, <= and >= operators for UInt128 and signed long/int comparisons, as the expensive float conversion and comparison was previosly used when, for instance, compar...

Assets 2

13 Sep 00:18

MrUnbelievable92

2.2.0

e08e11e

MaxMath v2.2.0

Known Issues

half8 == and != operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal. For (U)Int128, it requires a new Burst feature à la T Constant.ForceCompileTimeEvaluation<T, U>(Func<U, T> code)(proposed); Currently work is being done on (s)byte and (u)short vectors in this regard, which will beat any compiler. The current (tested) state of all optimizations possible is included in this version.
pow functions with compile time constant exponents currently do not handle many decimal numbers - math.rsqrt would often be used in those cases for optimal performance but it is actually slower when the Unity.Burst.FloatMode is set to anything but FloatMode.Fast. To guarantee optimal performance, compile time access to the current FloatMode would be needed (proposed)
double (r)cbrt functions are currently not optimized

Fixes

linked float8 rcp and rsqrt functions to Bursts' FloatMode and FloatPrecision
short.MinValue / -1 now correctly overflows to short.MinValue when dividing a short16 vector by another short16 vector when compiling for AVX or higher
fixed scalar quarter to double conversion for when the quarter value is negative
fixed scalar half to quarter conversion for when the half value is negative
fixed vector quarter to ulong conversion for when a quarter value is negative
fixed (u)short8 to quarter8 conversion

Additions

Added saturation arithmetic to the library for all scalar- and vector types. Saturation arithmetic clamps the result of an operation to `type.MinValue` and `type.MaxValue` if under- or overflow occurs, respectively and has single-instruction hardware support for `(s)bytes` and `(u)shorts`. The included functions are:

addsaturated
subsaturated
mulsaturated
divsaturated (only clamps division of floating point types and signed division of, for instance, sbyte.MinValue ( = -128) / -1 to sbyte.MaxValue ( =127), which would cause a hardware exception for ints and longs`)
castsaturated (all types to all other types with a smaller range),
csumsaturated
cprodsaturated

(U)Int128

added high performance (U)Int128 types with full library support, meaning: all operators and type conversions aswell as all functions support these types. Most operations of both types, in Burst code, compile down to optimal machine code. Exceptions: 1) signed 64x64 bit to 128 bit multiplication 2) *, /, % and divrem functions with a scalar compile time constant argument (See: Known Issues 2)
added Random128 XOR-Shift pseudo random number generator for generating (U)Int128s

Cube Root

added high performance & accuracy (r)cbrt - (reciprocal) cube root functions for scalar and vector float- and double types based on a research paper from 2021. An optional bool parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case with math.pow(x, 1f/3f)), which is set to false by default
added high performance intcbrt - integer cube root functions for all scalar and vector integer types. For signed integer types, an optional bool parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case with math.pow(x, 1f/3f)), which is set to false by default

Other Additions

added a log function to all scalar and vector float- and double types with a second parameter b, which is the logarithms' base
added reversebytes functions for all scalar- and vector types, which convert back and forth between big endian and little endian byte order, respectively. All of them (scalar, vector) compile down to single hardware instructions
added pow functions with scalar exponents for float and double scalars and vectors, with optimizations for selected constant exponents (not necessarily whole exponents)
added function overloads to all functions for scalar (s)bytes and (u)shorts in order to resolve function call resolution ambiguity which was already present in Unity.Mathematics, which may also improve performance in some cases
added a static readonly New property to RandomX XOR-Shift pseudo random generators. It calls Environment.TickCount internally (and is thus seeded somewhat randomly), makes sure it is non-zero and can be called from Burst native code
added fastrcp functions for float scalars and vectors, faster (and substantially less accurate) than FloatPrecision.Low, FloatMode.Fast Burst implementations
added fastrsqrt functions for float scalars and vectors, faster (and substantially less accurate) than FloatPrecision.Low, FloatMode.Fast Burst implementations

Improvements

added AVX and AVX2 code for float8 sin, cos, tan, sincos, asin, acos, atan, atan2, sinh, cosh, tanh, pow, exp, exp2, exp10, log, log2, log10 and fmod (and the % operator)
optimized many /, %, * and divrem operations with a scalar compile time constant argument for (s)byte vectors (see 'Known Issues 2'), which were previously not optimized (...optimally/at all) by Burst.
added SSE2 fallback code for converting AVX vector types to SSE vector types and vice versa(for example: short16(256 bit) to byte16(128 bit))
scalar (s)byte and (u)short rol and ror functions now compile down to single hardware instructions
improved performance and/or reduced code size of nearly all vector comparison operations (==, > etc.)
improved performance of - and added SSE2 fallback code for bitfield to boolean vector conversion (toboolX and thus also select(vector a, vector b, bitmask c));
improved performance of intpow functions in general and for when the exponent is a compile time constant
improved performance and reduced code size of compareto vector functions (especially for unsigned types)
added more optimizations to isdivisible
improved performance of intsqrt functions for (u)long and (s)byte scalar and vector types considerably
reduced code size of ispow2 vector functions
reduced code size of (s)byte vector-by-vector division
improved performance of Random64's (u)long4 generation if compiling for AVX2
improved performance of (s)byte matrix multiplication
reduced code size of (u)short- and up to (s)byte8 vector by vector division and divrem functions(and improved performance if compiling for SSE2 only)
reduced code size and improved performance of isinrange functions for (u)long vector types
reduced code size of ushort vector >= and <= operators for SSE2 fallback code by ~75%
improved performance and reduced code size of SSE2 down-casting fallback code

Changes

API BREAKING CHANGE: The various boolean to integer/floating point conversion functions (touint8/tof32 etc.) are now renamed to contain C# types in their names (tobyte/tofloat etc.)
API BREAKING CHANGE: If you use this library as intended, meaning you import it and Unity.Mathematics.math statically (using static MaxMath.maxmath;) and you use the pow functions with scalar bases and scalar exponents in those scripts, you will encounter the first ever function call resolution ambiguity. It is strongly recommended to always use the maxmath.pow function, because it optimizes any pow call enormously if the exponent is a compile time constant, which does NOT necessarily mean that such a call must declare the exponent as a literal value - the exponent may become a compile time constant due to constant propagation
quarter is now a readonly struct
quarter to sbyte, short, int and long coversions are now required to be declared explicitly
removed countbits(void* ptr, ulong bytes) from the library and added it to https://github.com/MrUnbelievable92/SIMD-Algorithms with more options

Fixed Oversights

(Issue #3) added constructor wrappers to the maxmath class analogous to Unity.Mathematics(byte4 myByte4 = (maxmath.)byte4(1, 2, 3, 4);)
added dsub - fused divide-subtract function for scalar and vector float types
added an optional bool fast = false parameter to dad, dsub, dadsub and dsubadd functions
added andnot function overloads for scalar and vector bool types
added implicit type conversions of scalar quarter values to half, float and double vectors
added all_eq and all_dif functions for vectors of size 2
added all_eq and all_dif functions for float and double vectors

Assets 2

24 Mar 17:09

MrUnbelievable92

2.1.2

bec6dde

MaxMath v2.1.2

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed undefined behavior of "vshr" functions for vector types smaller than 128 bits
fixed SSE2 implementations of "vrol" and "vror" functions for the (u)short16 type

Additions

implemented Bmi1 and Bmi2 intrinsics as functions with a "bits_" prefix (except for "andn", which has already been implemented as "andnot")
added high performance and/or SIMD "isdivisible" functions for all integer vector types and scalar value types
added high performance and/or SIMD "intpow" - integer exponentiation - functions for (u)int, (u)long and all integer vector types
added high performance and/or SIMD "floorpow2" functions for all integer vector types
added "nabs" - negative absolute value functions for all non-boolean vector- and single value types
added "indexof(vector v, value x)" functions for all non-boolean vector types

Improvements

aggressivley optimized away global variables (shuffle masks) and thus memory access and usage where appropriate
improved performance of 256 bit vector subvector getters
added Sse2 fallback code for all (u)long2/3/4 operators
improved performance of mulitplication, division and modulo operations for all (s)byte- and (u)short vector- and matrix types when dividing by a single non-compile time constant value
added overloads for (s)byte- and (u)short vectors' "divrem" functions with a scalar value as the divisor parameter, improving performance when it is a compile time constant
improved performance of "intsqrt" functions for most types

Changes

bump com.unity.burst to version 1.5

Fixed Oversights

added bitmask8 and bitmask16 functions for (s)byte and (u)short vector types, respectively

Assets 2

01 Mar 00:42

MrUnbelievable92

2.1.1

9da6887

MaxMath v2.1.1

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
fixed "tzcnt" implementations (were completely broken)
fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors

Additions

added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative

Improvements

performance improvements of "avg" functions for signed integer vectors
added SIMD implementations of the "transpose" functions for all matrix types
added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
added SSE2 fallback code for typecasting, propagating through the entire library
added SSE2 fallback code for "addsub" and "subadd" functions
bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively

Changes

renamed "BurstCompilerException" to "CPUFeatureCheckException"
"shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#

Fixed Oversights

added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
added "compareto" functions for all vector types except half- and quarter vectors
added "all_dif" functions for (s)byte32 vectors
added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors

2.1.1 Hotfix

Fixes

fixed SSE2 "shl", "shrl" and "shra" implementations
fixed SSE2 "intsqrt" implementations

Improvements

improved performance of (s)byte2, -3, -4, -8, -16 and (u)short2, -3, -4, -8 "gcd" functions (and thus "lcm") when compiling for Avx2
improved performance of "tzcnt" and "lzcnt" implementations for all vector types if compiling for SSE4 or higher, propagating through a lot of the library

Fixed Oversights

Added documentation for RandomX methods

Assets 2

26 Feb 06:37

MrUnbelievable92

2.1.0

a6b5c64

MaxMath v2.1.0

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
fixed "tzcnt" implementations (were completely broken)
fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors

Additions

added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative

Improvements

performance improvements of "avg" functions for signed integer vectors
added SIMD implementations of the "transpose" functions for all matrix types
added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
added SSE2 fallback code for typecasting, propagating through the entire library
added SSE2 fallback code for "addsub" and "subadd" functions
bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively

Changes

renamed "BurstCompilerException" to "CPUFeatureCheckException"
"shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#

Fixed Oversights

added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
added "compareto" functions for all vector types except half- and quarter vectors
added "all_dif" functions for (s)byte32 vectors
added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors

Assets 2

12 Feb 16:51

MrUnbelievable92

2.0.0

7ec7a5b

MaxMath v2.0.0

Re-Release Notes

Version 2.0.0 adds - for the first time - fallback procedures from Avx2 to Sse4, Sse2 and platform independent instruction sets, respectively, with some major optimizations for all of them
ARM and other instruction sets do NOT have optimized fallback procedures written for them, and there are no plans for it at this time. Burst/LLVM are good at recognizing the patterns in the code, though, and some of the code will be vectorized for other platforms (confirmed)

Known Issues

half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation

Fixes

fixed incorrect bool4 subvector getters of the bool8 type

Improvements

removed "fixed" vector element access to improve performance in managed C#

Additions

added "shuffle(vector, vector, ShuffleComponent(, ShuffleComponent)(, ShuffleComponent)(, ShuffleComponent)) functions for (s)byte, (u)short, (u)long, quarter and half vectors

Changes

Bump com.unity.burst to version 1.4.4

Fixed Oversights

Added "addsub" function for floating point types, complementary to "subadd"
Added "addsub" and "subadd" functions for integer types

Assets 2

Releases: MrUnbelievable92/MaxMath

v2.9.99

Known Issues

Fixes

Additions

Improvements

Uh oh!

v2.9.9

Known Issues

Fixes

Additions

Improvements

Uh oh!

v2.9.0

Known Issues

Fixes

Additions

Divider<T>

quadruple (PREVIEW)

Functions

Global Compilation Options

Improvements

Meta

Performance

Uh oh!

v2.3.5

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

Uh oh!

v2.3.0

Known Issues

Fixes

Additions

Other Additions

Improvements

Other Improvements

Uh oh!

MaxMath v2.2.0

Known Issues

Fixes

Additions

(U)Int128

Cube Root

Other Additions

Improvements

Changes

Fixed Oversights

Uh oh!

MaxMath v2.1.2

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

Uh oh!

MaxMath v2.1.1

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

2.1.1 Hotfix

Fixes

Improvements

Fixed Oversights

Uh oh!

MaxMath v2.1.0

Known Issues

Fixes

Additions

Improvements

Changes

Fixed Oversights

Uh oh!

`Divider<T>`

`quadruple` (PREVIEW)