Releases: MrUnbelievable92/MaxMath
v2.9.99
Known Issues
half8/16==and!=operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)- scalar and vector
halfconversion operators and related functions differ from the slightly incorrect Unity.Mathematics implementation. Specifically, if afloatordoublevalue is converted to ahalf, and if the value to be converted is exactly halfway between two adjacent representablehalfvalues, Unity's implementation rounds up while this library's implementation truncates, which matches the hardware default rounding direction when converting adoubleto afloat. The same behavior occurs when converting any integer type tohalfvalues, because Unity.Mathematics does not implement an optimized integer tohalfconversion operator/function but rather first converts the integer to afloatimplicitly, before casting tohalf boolvectors generated from operations on non-(s)bytevectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficultiesfloat8min()andmax()functions don't handle NaNs the same way Unity.Mathematics does- LLVM, in many cases, generates very poor code for all vectors with small fields (16 bit and below). This can be fixed by, for instance,
byte16only having twoulongs as fields and exposing the individualbytes as properties. This is API breaking, since you cannot take the address of properties, affecting unsafe pointer code;in, andrefcode referencing those fields can be fixed with those fields becomingrefproperties, whileoutused on those fields may only be used if the vector is already initialized. This will have to be changed for much better performance but this change is reserved for version 3.0
Fixes
- Fixed managed fallback of
roundto[all integer types]forfloat,doubleandquadruplearguments for values near 0 - Fixed SIMD
divandremfor Unity.Mathematics integer vector types - Fixed float8
!=comparison if compiling for AVX(2) - Fixed
toboolsafefordouble4and all signed integer vector types - Fixed SSE2 fallback for converting a
quartervector to any integer vector - Fixed software implemented floating point conversion from wider types to narrower types sometimes not rounding to the nearest representable value
- Fixed
int,longandInt128minmagandmaxmagalways- (minmag) or never (maxmag) returning the respectiveMinValue, if either argument is equal toMinValue
Additions
- Added scoped registry support
- Added
half16,quarter16andquarter32due to more and morequarterandhalffunction overloads having been implemented that are faster when not casting to afloatscalar/vector - Added
bits_select(a, b, c)for each scalar- and vector integer type. This performs the same operation asselect, just for each bit, and thus takes in acthat is of the same type asaandb - Added
candfor each vector integer type. These reduce vectors to a scalar integer of that type by applying bitwise AND operations between each element - Added
reverseto reverse the element order in a vector of any type - Added a
shuffleoverload for all vector types that does not take aUnity.Mathematics.ShuffleComponentas an argument. The second parameter is a vector with the same amount of not necessarily identical elements as the first parameter, holding the indices pointing to elements in the first parameter, determining the order of elements in the returned vector. Example:shuffle(new int4(9, 99, 999, 9999), new byte4(0, 3, 3, 2); // result: int4(9, 9999, 9999, 999) - Added
mulwidefor scalar and vector integer types, which performs full precision multiplication and returns the respective low and high halves of the result asoutparameters of the same type as the input parameters - Added
floortoint,ceiltointandtrunctointfor all combinations of floating point types (input parameter) and integer types (return type).trunctointwraps default floating point to integer casting, while offering an optionalPromiseparameter whenquarter,halforquadrupleare involved - Added
tohalfunsafe, converting any other scalar or vector type to ahalfscalar or vector type, utilizing faster and smaller code paths via aPromiseparameter - Added
tofloatunsafeandtodoubleunsafewithquarter,halfandquadrupleparameter types, utilizing faster and smaller code paths via aPromiseparameter - Added the "trivial"
quadrupleoverloadslerp,unlerp,remap,clamp,saturate,dot,frac,sign,modf,length,lengthsq,distance,distancesq,smoothstep,step,avg,fastrcp,div,divrem,dad,dsub,addsaturated,subsaturated,mulsaturated,divsaturated,tobytesaturated,tosbytesaturated,toushortsaturated,toshortsaturated,touintsaturated,tointsaturated,toulongsaturated,tolongsaturated,touint128saturated,toint128saturated,toquartersaturated,tohalfsaturated,tofloatsaturated,todoublesaturated,ceilmultiple,truncmultiple,roundmultiple,floormultiple,reversebytes,negate,maxmag,minmag,minmaxmag,minmax,angledelta,angledeltasgn,angledeltadeg,angledeltasgndeg,smoothlerp,pingpong,repeat,tobool,toboolsafe,toquarterunsafe,tohalfunsafe,toquadruple,toquadruplesafe,exp2(integer parameters) Missingquadrupleoverloads are:tan,tanh,atan,atan2,cos,cosh,acos,sin,sinh,asin,sincos,asinh,acosh,atanh,pow,exp,exp2,exp10,log,log2,log10,erf,erfc,gamma,hypot
Improvements
- Added
(u)longvector/,%anddivremoverloads for(s)byte,(u)shortand(u)intdivisors, with a latency of ~5 fewer cycles and while using 4 fewer instructions - Optimized
(u)longand(U)Int128insqrtalgorithms by replacing a loop based algorithm with straight line code. For 64-bit integers, this implementation is up to two times faster. For 128-bit integers it is up to 14 times faster, yet a little slower if the argument is below ~2^57. Theseintsqrtversions can now be constant-evaluated at compile time. If the global compilation option forOptimizeForis set toOptimizeFor.Size, the much smaller loop based algorithm is chosen at compile time. - Optimized
(s)bytevectorsquareandmyByteVector_0 * myByteVector_0when compiling for x86 with SSE4 or higher, having derived and implemented a novel 8-bit integer-square algorithm due to the lack of a native SIMD 8 bit multiplication instruction on x86. The implementation of the algorithm has a latency of 5 or 6 cycles instead of the 8 or 9 cycles (CPU specific) associated with generalized 8 bit integer multiplication implemented in software, at the cost of 17 (highly parallel) instructions and 4 constants instead of 6 instructions and 1 constant, and is thus only used ifCOMPILATION_OPTIONS.OPTIMIZE_FORis set toOptimizeFor.Performance. Most notably, this algorithm reduces the latency of(s)byteintpow, since squaring is part of its loop - Optimized vector
(s)byteintsqrt, having derived and implemented a novel 8-bit integer-square-root algorithm, reducing the latency by 2 to 5 (unsigned) - or 5 to 8 cycles (signed), respectively, and removing up to 8 instructions, except for non 16/32-element(s)bytevectors, where a respecive 9 or 4 instructions (unsigned), or 7 or 2 instructions (signed) were added instead. The previously implemented algorithm is selected for those vectors if the global compilation option forOptimizeForis set toOptimizeFor.Size - Optimized vector
[]operator to avoid repeated memory reads and writes when multiple scalar values are sequentially assigned to non-constant indices of a vector - Optimized
countforbool2andbool4inputs when compiling for an architecture that supports SIMD - Optimized scalar fallback- and vectorized versions of
bits_depositparallelandbits_extractparallelby utilizing a O(log2(n)) algorithm over the previously implemented O(n) algorithm, where n is the bit-width of the respective scalar datatype. Additionally, this algorithm is ~50x to ~100x faster whenmaskis constant - Reduced latency, code size and constant data for
(U)Int128bits_depositparallelby varying (but substantial) amounts if compiling for X86.BMI2 (= AVX2) by utilizing the 64-bit hardware-supported variants, analogous to howbits_extractparallelhas already been implemented for(U)Int128 - Reduced latency (up to 1 cycle), code size (up to 2 instructions) and constant data (up to 48 bytes) of most vector constructors that combine multiple vectors together, if compiling for ARM or x86.SSE4 or higher
- Reduced latency vector
(u)long%operator by 9 to 11 cycles and removed 15 instructions if only the remainder is used; using the/operator in the same context or usingdivremresults in 8 fewer instructions instead - Reduced latency of
int8/and%operators by 3 cycles and removed at least 2 instructions - Reduced latency of vector
Divider<(u)int>initialization by 3 to 4 cycles and removed 3 or 5 instructions - Reduced latency of vector
doubleto(u)longconversion by 2 cycles if compiling for AVX2 - Reduced latency of scalar
quartertofloatanddoubleconversion by 4 cycles and removed 7 instructions. If compiling for BMI2 (i.e. AVX2), 3 further instructions are replaced by 2 instructions with the same latency, saving another 1 byte in code size - Reduced latency of vector
quartertofloatanddoubleconversion by up to 3 cycles and removed up to 2 instructions - Reduced latency of scalar- and vector software implemented narrowing floating point to floating point type conversion by up to 3 cycles and removed up to 4 i...
v2.9.9
Known Issues
half8/16==and!=operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)- scalar and vector
halfconversion operators and related functions differ from the slightly incorrect Unity.Mathematics implementation. Specifically, if afloatordoublevalue is converted to ahalf, and if the value to be converted is exactly halfway between two adjacent representablehalfvalues, Unity's implementation rounds up while this library's implementation truncates, which matches the hardware default rounding direction when converting adoubleto afloat. The same behavior occurs when converting any integer type tohalfvalues, because Unity.Mathematics does not implement an optimized integer tohalfconversion operator/function but rather first converts the integer to afloatimplicitly, before casting tohalf boolvectors generated from operations on non-(s)bytevectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficultiesfloat8min()andmax()functions don't handle NaNs the same way Unity.Mathematics does- LLVM, in many cases, generates very poor code for all vectors with small fields (16 bit and below). This can be fixed by, for instance,
byte16only having twoulongs as fields and exposing the individualbytes as properties. This is API breaking, since you cannot take the address of properties, affecting unsafe pointer code;in, andrefcode referencing those fields can be fixed with those fields becomingrefproperties, whileoutused on those fields may only be used if the vector is already initialized. This will have to be changed for much better performance but this change is reserved for version 3.0
Fixes
- Fixed managed fallback of
roundto[all integer types]forfloat,doubleandquadruplearguments for values near 0 - Fixed SIMD
divandremfor Unity.Mathematics integer vector types - Fixed float8
!=comparison if compiling for AVX(2) - Fixed
toboolsafefordouble4and all signed integer vector types - Fixed SSE2 fallback for converting a
quartervector to any integer vector - Fixed software implemented floating point conversion from wider types to narrower types sometimes not rounding to the nearest representable value
- Fixed
int,longandInt128minmagandmaxmagalways- (minmag) or never (maxmag) returning the respectiveMinValue, if either argument is equal toMinValue
Additions
- Added
half16,quarter16andquarter32due to more and morequarterandhalffunction overloads having been implemented that are faster when not casting to afloatscalar/vector - Added
bits_select(a, b, c)for each scalar- and vector integer type. This performs the same operation asselect, just for each bit, and thus takes in acthat is of the same type asaandb - Added
candfor each vector integer type. These reduce vectors to a scalar integer of that type by applying bitwise AND operations between each element - Added
reverseto reverse the element order in a vector of any type - Added a
shuffleoverload for all vector types that does not take aUnity.Mathematics.ShuffleComponentas an argument. The second parameter is a vector with the same amount of not necessarily identical elements as the first parameter, holding the indices pointing to elements in the first parameter, determining the order of elements in the returned vector. Example:shuffle(new int4(9, 99, 999, 9999), new byte4(0, 3, 3, 2); // result: int4(9, 9999, 9999, 999) - Added
mulwidefor scalar and vector integer types, which performs full precision multiplication and returns the respective low and high halves of the result asoutparameters of the same type as the input parameters - Added
floortoint,ceiltointandtrunctointfor all combinations of floating point types (input parameter) and integer types (return type).trunctointwraps default floating point to integer casting, while offering an optionalPromiseparameter whenquarter,halforquadrupleare involved - Added
tohalfunsafe, converting any other scalar or vector type to ahalfscalar or vector type, utilizing faster and smaller code paths via aPromiseparameter - Added
tofloatunsafeandtodoubleunsafewithquarter,halfandquadrupleparameter types, utilizing faster and smaller code paths via aPromiseparameter - Added the "trivial"
quadrupleoverloadslerp,unlerp,remap,clamp,saturate,dot,frac,sign,modf,length,lengthsq,distance,distancesq,smoothstep,step,avg,fastrcp,div,divrem,dad,dsub,addsaturated,subsaturated,mulsaturated,divsaturated,tobytesaturated,tosbytesaturated,toushortsaturated,toshortsaturated,touintsaturated,tointsaturated,toulongsaturated,tolongsaturated,touint128saturated,toint128saturated,toquartersaturated,tohalfsaturated,tofloatsaturated,todoublesaturated,ceilmultiple,truncmultiple,roundmultiple,floormultiple,reversebytes,negate,maxmag,minmag,minmaxmag,minmax,angledelta,angledeltasgn,angledeltadeg,angledeltasgndeg,smoothlerp,pingpong,repeat,tobool,toboolsafe,toquarterunsafe,tohalfunsafe,toquadruple,toquadruplesafe,exp2(integer parameters)
Missingquadrupleoverloads are:tan,tanh,atan,atan2,cos,cosh,acos,sin,sinh,asin,sincos,asinh,acosh,atanh,pow,exp,exp2,exp10,log,log2,log10,erf,erfc,gamma,hypot
Improvements
- Added
(u)longvector/,%anddivremoverloads for(s)byte,(u)shortand(u)intdivisors, with a latency of ~5 fewer cycles and while using 4 fewer instructions - Optimized
(u)longand(U)Int128insqrtalgorithms by replacing a loop based algorithm with straight line code. For 64-bit integers, this implementation is up to two times faster. For 128-bit integers it is up to 14 times faster, yet a little slower if the argument is below ~2^57. Theseintsqrtversions can now be constant-evaluated at compile time. If the global compilation option forOptimizeForis set toOptimizeFor.Size, the much smaller loop based algorithm is chosen at compile time. - Optimized
(s)bytevectorsquareandmyByteVector_0 * myByteVector_0when compiling for x86 with SSE4 or higher, having derived and implemented a novel 8-bit integer-square algorithm due to the lack of a native SIMD 8 bit multiplication instruction on x86. The implementation of the algorithm has a latency of 5 or 6 cycles instead of the 8 or 9 cycles (CPU specific) associated with generalized 8 bit integer multiplication implemented in software, at the cost of 17 (highly parallel) instructions and 4 constants instead of 6 instructions and 1 constant, and is thus only used ifCOMPILATION_OPTIONS.OPTIMIZE_FORis set toOptimizeFor.Performance. Most notably, this algorithm reduces the latency of(s)byteintpow, since squaring is part of its loop - Optimized vector
(s)byteintsqrt, having derived and implemented a novel 8-bit integer-square-root algorithm, reducing the latency by 2 to 5 (unsigned) - or 5 to 8 cycles (signed), respectively, and removing up to 8 instructions, except for non 16/32-element(s)bytevectors, where a respecive 9 or 4 instructions (unsigned), or 7 or 2 instructions (signed) were added instead. The previously implemented algorithm is selected for those vectors if the global compilation option forOptimizeForis set toOptimizeFor.Size - Optimized vector
[]operator to avoid repeated memory reads and writes when multiple scalar values are sequentially assigned to non-constant indices of a vector - Optimized
countforbool2andbool4inputs when compiling for an architecture that supports SIMD - Optimized scalar fallback- and vectorized versions of
bits_depositparallelandbits_extractparallelby utilizing a O(log2(n)) algorithm over the previously implemented O(n) algorithm, where n is the bit-width of the respective scalar datatype. Additionally, this algorithm is ~50x to ~100x faster whenmaskis constant - Reduced latency, code size and constant data for
(U)Int128bits_depositparallelby varying (but substantial) amounts if compiling for X86.BMI2 (= AVX2) by utilizing the 64-bit hardware-supported variants, analogous to howbits_extractparallelhas already been implemented for(U)Int128 - Reduced latency (up to 1 cycle), code size (up to 2 instructions) and constant data (up to 48 bytes) of most vector constructors that combine multiple vectors together, if compiling for ARM or x86.SSE4 or higher
- Reduced latency vector
(u)long%operator by 9 to 11 cycles and removed 15 instructions if only the remainder is used; using the/operator in the same context or usingdivremresults in 8 fewer instructions instead - Reduced latency of
int8/and%operators by 3 cycles and removed at least 2 instructions - Reduced latency of vector
Divider<(u)int>initialization by 3 to 4 cycles and removed 3 or 5 instructions - Reduced latency of vector
doubleto(u)longconversion by 2 cycles if compiling for AVX2 - Reduced latency of scalar
quartertofloatanddoubleconversion by 4 cycles and removed 7 instructions. If compiling for BMI2 (i.e. AVX2), 3 further instructions are replaced by 2 instructions with the same latency, saving another 1 byte in code size - Reduced latency of vector
quartertofloatanddoubleconversion by up to 3 cycles and removed up to 2 instructions - Reduced latency of scalar- and vector software implemented narrowing floating point to floating point type conversion by up to 3 cycles and removed up to 4 instructions
- Reduced laten...
v2.9.0
Known Issues
half8==and!=operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)boolvectors generated from operations on non-(s)bytevectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficultiesfloat8min()andmax()functions don't handle NaNs the same way Unity.Mathematics does
Fixes
- Fixed XML documentation not showing descriptions for valid
Promiseflags - Fixed
cminmaxdocumentation bitmask64withnumBitsequal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2- Fixed
uint8tofloat8type conversion if compiling for AVX2 - Fixed incorrect
modimplementations - (ISSUE #16) Fixed
floatanddouble(r)cbrtedge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vectorfloatimplementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new validPromiseflags allow for call-site selection of faster code paths
Additions
Divider<T>
Divider<T> is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T, a Divider<T> instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T> was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T> is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i is the divisor in a for-loop).
Numerous Promise flags allow for faster operations, provided that the Divider<T> instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T) often requires explicit initialization (example: new Divider<byte>((byte)42)). DEBUG only validity checks ensure correct initialization and usage.
The current Divider API consists of...:
/and%operators:- LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
- LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
- LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
- LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
DivRemmember methodsEvenlyDividesmember methodsT Divisoras a readonly propertypublic constPromises withinDivider<T>, documenting valid promise flags with appropriate naming, starting with "PROMISE_"Get/SetInnerDivider<U>methods: get or set a scalar- or vectorDivider<U>within aDivider<T>- Component shuffles:
Divider<T>.wzxyswizzle "operators" as properties.
NOTE: Get/SetInnerDivider<U> methods and Divider<T>.[a][b][c][d] properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2> from a Divider<int4>. Get/SetInnerDivider<U> are placeholderholders both for these operations as well as for the v[a]_[b] properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.
quadruple (PREVIEW)
Analogous to (U)Int128, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath functions available for it at this time.
The following functions have been implemented: ToString and Parse (no perfect roundtrip guaranteed), All constants (example: PI_QUAD), Random128 NextQuadruple (optionally with min and max values), all type conversions except for decimal, -(unary), +(binary), -(binary), *, /, %, ==, !=, <, <=, >, >=, fmod, mad, msub, rcp, isnan, isinf, isfinite, isnormal, issubnormal, round, floor, ceil, trunc, roundtoint (and all other integer variations), fastsqrt, (r)sqrt, (r)cbrt, isinrange, approx, select, compareto, min, max, copysign, nextgreater, nextsmaller, nexttoward, radians, degrees, chgsign
Functions
- Added
isnormalandissubnormalfunctions for floating point types - Added
hypotandinthypotfunctions for calculating[int]sqrt(a * a + b * b)without overflow, unless an optionalPromiseparameter with itsNoOverflowflag set is passed as a compile time constant argument - Added
roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer - Added
corandcxor. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element - Split
approxinto two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead - Added
roundmultiple(x, m),floormultiple(x, m),ceilmultiple(x, m)andtruncmultiple(x, m)for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m) - Added a whole stack of bit manipulation functions for all scalar- and vector integer types:
parityodd,parityeven,countzerobits,l1cnt,t1cnt,lzmask,tzmask,l1mask,t1mask,bits_extractlowest0,bits_masktolowest,bits_masktolowest0,bits_maskfromlowest,bits_maskfromlowest0,bits_setlowest,bits_surroundlowestandbits_surroundlowest0
Global Compilation Options
- Added Global Compilation Options for
OptimizeFor,FloatModeandFloatPrecision. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded toOptimizeFor.Performance,FloatMode.DefaultandFloatPrecision.Standard, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs
Improvements
Meta
- This library now fully supports ARM CPUs' SIMD instructions (huge!). It utilizes SSE2NEON and SIMDe to convert x86 SIMD instructions to ARM SIMD instructions or instruction sequences. Because of this, generated ARM code will sometimes remain slightly unoptimized, because the author is unable to verify correctness of ARM specific optimizations with unit tests in most cases.
Performance
- Implemented optimized
(u)longvector tofloatvector type convesion operators - Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
- Implemented an
AssumeRangeAttributeequivalent for all vectorized functions with known return value ranges - Implemented more optimal
(U)Int128comparison operators - Implemented optimal
(U)Int128multiplication operations with- and division and modulo operations by compile time constants - Implemented optimal
(U)Int128division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizingUnity.Burst.CompilerServices.Constant.IsConstantExpression<T>() - Increased valid
Promise.Unsafe0range for(u)longintcbrtfrom [0, 2^46} to [0, 2^48] - Added an optional
Promiseparameter togamma - Added an optional
Promiseparameter toerf(c) - Added an optional
Promiseparameter togcdandlcm - Added
quarterandhalfscalar- and vector function overloads formin,max,minmax,clamp,saturate,isinrange,trunc,round,ceil,floorandsign - Removed the only non-optimizing branch in vector code in the entire library within the
long2/3/4>>operator if the shift amount is not a comp...
v2.3.5
Known Issues
half8==and!=operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)(s)byte,(u)shortvector and(U)Int128multiplication, division and modulo operations by compile time constants are not optimal- optimized
(U)Int128comparison operators didn't make it into this release boolvectors generated from operations on non-(s)bytevectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties- most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
- AVX2
(s)byte32all_diflookup tables are currently way too large (kiloBytes)
Fixes
- (Issue #10)
bool8/16/32are now blittable when not used within anIJob
Additions
- added
comb(n, k)for scalar- and vector integer types. This is known as the binomial coefficient or "n choose k". An optionalPromiseparameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows (which is not true for most solutions found online that claim it), uses a O(min(k, n - k)) algorithm with respect to time - added
perm(n, k)for scalar- and vector integer types. This is known as "k-permutations of n". An optionalPromiseparameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows, uses a O(k) algorithm with respect to time - added
nextgreater(x)for all types. For integer types, it is a wrapper function foraddsaturated(x, 1). For floating point types, it returns the next greater representable floating point value(s), unless x is NaN or infinite. An optionalPromiseparameter allows for numerous optimizations. - added
nextsmaller(x)for all types. For integer types, it is a wrapper function forsubsaturated(x, 1). For floating point types, it returns the next smaller representable floating point value(s), unless x is NaN or infinite. An optionalPromiseparameter allows for numerous optimizations - added
nexttoward(from, to)for all types, returning the next representable integer/floating point value(s) in a given direction, unlessfromis equal toto. For floating point types,fromis returned iffromis NaN or infinite. Iftois NaN, NaN is returned. An optionalPromiseparameter allows for numerous optimizations.
Improvements
- improved performance of 64bit vectorized division thanks to a newly implemented and further optimized algorithm from a July 13th 2022 research paper, which replaces a vectorized loop (rather slow; up to 64 iterations; no instruction level parallelism outside the loop possible until the loop finished executing, following an almost certainly mispredicted branch) with straight line code. Due to "recent" improvements to divider circuits, this code path is inferior to hardware supported scalar division via element extraction for
(u)long2, specifically, even when the quotient and/or remainder vector is in the middle of a dependency chain and even in tight loops, and is thus only implemented for(u)long3/4types and only if compiling for AVX2 - improved performance and reduced code size of up to
(s)byte8and every(u)shortvector division if not compiling withFloatMode.Fast. Reduced constants possibly read from RAM in either case. - fixed performance regression of SIMD register <-> software abstraction conversions for types using up the entirety of a hardware register
lcmfor(s)bytevectors with 8 elements or less: decreased code size by 20 or 28 bytes; removed 2 or 4 or 8 bytes of constant data read from RAM; reduced latency by 2 or 3 clock cycles- verified and increased the
(u)longscalar- and vectorintcbrtPromise.Unsafe0range from [0, 1ul << 40] to [0, 1ul << 46], the code path of which is also possibly chosen at compile time - implemented optimized
quarter{X}IEEE-754 comparison operators (without having to cast tofloat{X}). VectorizedhalfXcomparisons are implemented inMaxMath.Intrinsics.Xseas well and used where appropriate.comparetowithquarter{X}andhalf{X}function overloads were implemented. - reduced latency of
add/subsaturatedfor scalarInt128s, scalar and vectorlongs as well as vectorints by about a third - replaced
(U)Int128.ToString(null, null)s call toBigInteger.ToString()and thus unnecessary heap allocations with an optimized implementation (u)short8/and%operators now correctly check for SSE2 support rather than AVX2- removed aliased fixed size buffers from all types, also improving indexer operator performance if the index is a compile time constant (in some cases)
Changes
- Burst compiled code that uses a
Promiseargument which is not a compile time constant will throw an exception inDEBUG, as it represents significant overhead instead of an optimization. This will currently not inform users of the name of the function but rather the Burst compiled job/function that threw it.
Fixed Oversights
- added
explicittype conversion operators for scalarfloats anddoubles tohalf8and allquartervectors (as well as scalarhalfs toquartervectors)
v2.3.0
Known Issues
half8==and!=operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)(s)byte,(u)shortvector and(U)Int128multiplication, division and modulo operations by compile time constants are not optimal- optimized
(U)Int128comparison operators didn't make it into this release - using
boolvectors generated from 256 bit input vectors like so:long4 x = select(a, b, >>> myLong4a < myLong4b <<<)(as an example) does not generate the most efficient machine code possible - unit tests for 64-bit
bits_zerohighfunctions fail 100% of the time because of a bug related to the managed debug implementation of intrinsics (reported) - unit tests for intrinsics code paths for all functions that use "(mm256_)shuffle_ps" or "(mm256_)blendv_ps" can fail semi-randomly due to a bug which changes the bit content of
ints which would be NaN if dereferenced as afloatand written back to memory (reported) - most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
(s)byte32all_diflookup tables are currently way too large (kiloBytes)
Fixes
- fixed
quarterrounding behavior when casting a wider floating point type to aquarterto round towards the nearest representable value instead of truncating the mantissa
Additions
added namespace MaxMath.Intrinsics for users who want to use the math library through "high level" X86 intrinsics. Because users need to guard their intrinsics code with e.g. if (Burst.Intrinsics.X86.Sse2.IsSse2Supported) blocks and supported architectures vary (slightly) from function to function, these are considered unsafe, undocumented and unrecommended and only serve as an exposed layer of abstraction which is used internally anyway.
added flags enum Promise, with values Nothing, Everything NoOverflow, ZeroOrGreater, ZeroOrLess, NonZero and Unsafe 0 through 3 aswell as the composites Positive and Negative. This flags enum is only ever used as an optional parameter and offers faster, yet more unsafe code. Specifics vary between functions and sometimes even overloads but are documented accordingly. Optimizations are only ever to be added, not removed (= a ...promise ... of never introducing breaking changes in this regard)
Other Additions
- added
factorial(for integer types) andgamma(floating point types) functions.factorial, when called without aPromiseparameter, clamps the result totype.MaxValuein case of overflow - added
erf(c), the (complementary) error function for floating point types - added
(c)minmagand(c)maxmagfunctions, returning the (componentwise) minimum/maximum magnitude of two values or within a vector; equivalent toabs(x) > abs(y) ? x : y(maxmag) orabs(cmin(c)) > abs(cmax(c)) ? cmin(c) : cmax(c)(cmaxmag) - added
(c)minmaxand(c)minmaxmagfunctions which return both the (componentwise/columnwise) minimum and maximum (magnitude) asoutparameters - added
bitfieldfunctions for scalar and vector integer types - small utility functions that pack several smaller integers into bigger ones - added
copysign(x, y)functions for signed types, which is equivalent toreturn y < 0 ? nabs(x) : abs(x) - added (naive?) implementation for scalar- and vector
float/doubleinverse hyberbolic functionsasinh,acoshandatanh - added
intlog10functions (integer base ten logarithm) - added the
bit test/btfamily of functions for scalar and vector integer types. Atestbit(POST_ACTION)((ref)x, i)function returns a boolean (vector), indicating whether the bit inxat indexiis 1 and may (or may not) flip, set, or reset that bit afterwards - added a new category of type conversion functions with the suffix "unsafe". Added
to(u)longunsafeandtodoubleunsafewith aPromiseparameter, allowing for up to two levels of optimization (vectorized 64bit int <-> 64 bit float is not hardware supported). Details in the XML documentation. Defaultdouble<->(u)longconversion operators - apart from having their 4-element version improved - now check whether or not a safe range for unsafe conversions can be validated at compile time - added scalar/vectorized
toquarterunsafeallowing for each type to be converted to a quarter type while specifying whether the input value will or will not overflow and/or is >= 0
Improvements
improved performance of several vector operators and function overloads for types that use up an entire hardware register while having to be up-cast to a wider type considerably - surrounding boilerplate code uses a new "in-house" faster-than-hardware algorithm with its dependency chain latency having been reduced from x [0 <= x <= 3] + (9 or 10) clock cycles down to x + (0 or 1 or 3) + (1 or 3) clock cycles
massive performance improvements for all vector types that are not a total of 128 or 256 bits wide, respectively, either through the Avx.[...]undefined[...] compiler intrinsics or through controlled undefined behaviour, by declaring an uninitialized variable and using pointer syntax to force the C# compiler into trusting that the variable has been fully initialized; this cannot lead to memory access violations, since the variable is declared and thus enough space is reserved on the stack, before it is optimized away by LLVM and assigned a hardware register instead, with undefined upper elements. This allows for upper elements of hardware registers to be ignored during compilation. Unnecessarily emitted instructions like movq xmm0, xmm0 (move the low 8 bytes from a register to the same register, zeroing out the upper 8 bytes, even though only the lower 8 bytes will be written back to memory) or far worse instruction sequences, for example when using vectors with 3 elements, are now (MOSTLY; there's still work to be done) omitted instead. Although most zero-upper-elements instruction( sequence)s only took a single clock cycle, they were always part of each dependency chain and could happen between almost each function call, including operators of course. The same improvements apply to Unity.Mathematics types when passed to maxmath functions.
improved performance throughout the library by effectively adding hundreds of thousands of Unity.Burst.CompilerServices.Constant.IsConstantExpression condition checks more to many functions within the library. Most notably, algorithms, where the total latency is dependant on the byte size of arguments, may now perform much faster. Some but not yet all of these constant checks are exposed through a Promise parameter
Other Improvements
- improved performance of scalar
(u)shortto(u)short2/3/4conversion - reduced latency of
all,anyfirst,last,countandbitmaskfunctions forbool8/16/32when used with an expression as the argument, such asall(x != y)- a way to force the compiler to omit unnecessary intructions was found - reduced latency of
addsaturatedfor scalar unsigned integer types - reduced latency of
float/doubleto(U)Int128conversion - reduced latency of
shl,shrlandshraand thus all functions using those - especially for:shlfor(s)bytevectors of all sizes if compiling for SSE4 and 32 byte sized vectors if compiling for AVX2;shlfor(u)shortvectors of 4 or more elements if compiling for at least SSE4;shrafor(u)longvectors if compiling for AVX2 and the vector containing the shift amounts is a compile time constant. - reduced
long2/3/4shracode size and latency by another 2 clock cycles if compiling for AVX2 - reduced latency of variable
rol/rvector functions beyondshl/rimprovements and added an optionalPromiseparameter, allowing the caller to promise the rotation values are in a specific range - reduced latency of
long2/3/4"is negative checks" -mylong4 < 0/0 > mylong4by 33% by doubling its code size. This further improves performance/adds to code size of functions in the library - reduced latency of
(u)long2/3/4isinrangefunctions - reduced latency of unsigned
byteandushortvector to float vector conversion. This also affects performance of(s)byte(u)shortvectorintsqrtfunctions, aswell as the respective%and/operators (byte2/3/4/8, allushortvectors) - reduced
(u)longvectorintcbrtlatency by ~45% and reduced code size by ~20% (roughly 150 bytes). For other integer vector types, the latency has been reduced by ~8 to ~15 clock cycles - added hidden and retroactively improved
exp2scalar and vector integer argument function overloads. These returnexp2((float/double)x)or(float/double)(1 << x)in 3 instead of 6 to 7 clock cycles at best; they of course also work for negative input values i.e. reciprocals of powers of 2. The(u)intoverloads convert tofloats, the(u)longoverloads convert todoubles; explicit integer to integer casting should (and sometimes has to) be used for optimal results. Additionally, these overloads contain an optional 'Promise' parameter, allowing for omission of clamping which is needed to ensure correct underflow/overflow behavior, as dictated by Unity'sexp2implementation. If you ever used the standardexp2function by implicitly converting aninttype to afloattype, performance was improved by a factor of about 30x. This overload only "breaks" code that casts(u)longtypes tofloattypes implicitly if the result is expected to be afloattype. It is recommended to explicitly cast the(u)longtype to a(u)inttype in such a case - added
==,!=,<,>,<=and>=operators forUInt128and signedlong/intcomparisons, as the expensive float conversion and comparison was previosly used when, for instance, compar...
MaxMath v2.2.0
Known Issues
half8==and!=operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation(s)byte,(u)shortvector and(U)Int128multiplication, division and modulo operations by compile time constants are not optimal. For (U)Int128, it requires a new Burst feature à laT Constant.ForceCompileTimeEvaluation<T, U>(Func<U, T> code)(proposed); Currently work is being done on(s)byteand(u)shortvectors in this regard, which will beat any compiler. The current (tested) state of all optimizations possible is included in this version.powfunctions with compile time constant exponents currently do not handle many decimal numbers -math.rsqrtwould often be used in those cases for optimal performance but it is actually slower when theUnity.Burst.FloatModeis set to anything butFloatMode.Fast. To guarantee optimal performance, compile time access to the currentFloatModewould be needed (proposed)double(r)cbrtfunctions are currently not optimized
Fixes
- linked
float8rcpandrsqrtfunctions to Bursts'FloatModeandFloatPrecision short.MinValue / -1now correctly overflows toshort.MinValuewhen dividing ashort16vector by anothershort16vector when compiling for AVX or higher- fixed scalar
quartertodoubleconversion for when thequartervalue is negative - fixed scalar
halftoquarterconversion for when thehalfvalue is negative - fixed vector
quartertoulongconversion for when aquartervalue is negative - fixed
(u)short8toquarter8conversion
Additions
Added saturation arithmetic to the library for all scalar- and vector types. Saturation arithmetic clamps the result of an operation to type.MinValue and type.MaxValue if under- or overflow occurs, respectively and has single-instruction hardware support for (s)bytes and (u)shorts. The included functions are:
addsaturatedsubsaturatedmulsaturateddivsaturated(only clamps division of floating point types and signed division of, for instance,sbyte.MinValue( = -128)/ -1tosbyte.MaxValue( =127), which would cause a hardware exception forints andlongs`)castsaturated(all types to all other types with a smaller range),csumsaturatedcprodsaturated
(U)Int128
- added high performance
(U)Int128types with full library support, meaning: all operators and type conversions aswell as all functions support these types. Most operations of both types, in Burst code, compile down to optimal machine code. Exceptions: 1) signed 64x64 bit to 128 bit multiplication 2)*,/,%anddivremfunctions with a scalar compile time constant argument (See: Known Issues 2) - added
Random128XOR-Shift pseudo random number generator for generating(U)Int128s
Cube Root
- added high performance & accuracy
(r)cbrt- (reciprocal) cube root functions for scalar and vectorfloat- anddoubletypes based on a research paper from 2021. An optionalboolparameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case withmath.pow(x, 1f/3f)), which is set tofalseby default - added high performance
intcbrt- integer cube root functions for all scalar and vector integer types. For signed integer types, an optionalboolparameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case withmath.pow(x, 1f/3f)), which is set tofalseby default
Other Additions
- added a
logfunction to all scalar and vectorfloat- anddoubletypes with a second parameterb, which is the logarithms' base - added
reversebytesfunctions for all scalar- and vector types, which convert back and forth between big endian and little endian byte order, respectively. All of them (scalar, vector) compile down to single hardware instructions - added
powfunctions with scalar exponents forfloatanddoublescalars and vectors, with optimizations for selected constant exponents (not necessarily whole exponents) - added function overloads to all functions for scalar
(s)bytes and(u)shorts in order to resolve function call resolution ambiguity which was already present inUnity.Mathematics, which may also improve performance in some cases - added a static readonly
Newproperty toRandomXXOR-Shift pseudo random generators. It callsEnvironment.TickCountinternally (and is thus seeded somewhat randomly), makes sure it is non-zero and can be called from Burst native code - added
fastrcpfunctions forfloatscalars and vectors, faster (and substantially less accurate) thanFloatPrecision.Low,FloatMode.FastBurst implementations - added
fastrsqrtfunctions forfloatscalars and vectors, faster (and substantially less accurate) thanFloatPrecision.Low,FloatMode.FastBurst implementations
Improvements
- added AVX and AVX2 code for
float8sin,cos,tan,sincos,asin,acos,atan,atan2,sinh,cosh,tanh,pow,exp,exp2,exp10,log,log2,log10andfmod(and the%operator) - optimized many
/,%,*anddivremoperations with a scalar compile time constant argument for(s)bytevectors (see 'Known Issues 2'), which were previously not optimized (...optimally/at all) by Burst. - added SSE2 fallback code for converting AVX vector types to SSE vector types and vice versa(for example:
short16(256 bit) tobyte16(128 bit)) - scalar
(s)byteand(u)shortrolandrorfunctions now compile down to single hardware instructions - improved performance and/or reduced code size of nearly all vector comparison operations (
==,>etc.) - improved performance of - and added SSE2 fallback code for bitfield to boolean vector conversion (
toboolXand thus alsoselect(vector a, vector b, bitmask c)); - improved performance of
intpowfunctions in general and for when the exponent is a compile time constant - improved performance and reduced code size of
comparetovector functions (especially for unsigned types) - added more optimizations to
isdivisible - improved performance of
intsqrtfunctions for(u)longand(s)bytescalar and vector types considerably - reduced code size of
ispow2vector functions - reduced code size of
(s)bytevector-by-vector division - improved performance of
Random64's(u)long4generation if compiling for AVX2 - improved performance of
(s)bytematrix multiplication - reduced code size of
(u)short- and up to(s)byte8vector by vector division anddivremfunctions(and improved performance if compiling for SSE2 only) - reduced code size and improved performance of
isinrangefunctions for(u)longvector types - reduced code size of
ushortvector>=and<=operators for SSE2 fallback code by ~75% - improved performance and reduced code size of SSE2 down-casting fallback code
Changes
- API BREAKING CHANGE: The various boolean to integer/floating point conversion functions (
touint8/tof32etc.) are now renamed to contain C# types in their names (tobyte/tofloatetc.) - API BREAKING CHANGE: If you use this library as intended, meaning you import it and
Unity.Mathematics.mathstatically (using static MaxMath.maxmath;) and you use thepowfunctions with scalar bases and scalar exponents in those scripts, you will encounter the first ever function call resolution ambiguity. It is strongly recommended to always use themaxmath.powfunction, because it optimizes anypowcall enormously if the exponent is a compile time constant, which does NOT necessarily mean that such a call must declare the exponent as a literal value - the exponent may become a compile time constant due to constant propagation quarteris now areadonly structquartertosbyte,short,intandlongcoversions are now required to be declared explicitly- removed
countbits(void* ptr, ulong bytes)from the library and added it to https://github.com/MrUnbelievable92/SIMD-Algorithms with more options
Fixed Oversights
- (Issue #3) added constructor wrappers to the maxmath class analogous to
Unity.Mathematics(byte4 myByte4 = (maxmath.)byte4(1, 2, 3, 4);) - added
dsub- fused divide-subtract function for scalar and vectorfloattypes - added an optional
bool fast = falseparameter todad,dsub,dadsubanddsubaddfunctions - added
andnotfunction overloads for scalar and vectorbooltypes - added implicit type conversions of scalar
quartervalues tohalf,floatanddoublevectors - added
all_eqandall_diffunctions for vectors of size 2 - added
all_eqandall_diffunctions forfloatanddoublevectors
MaxMath v2.1.2
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed undefined behavior of "vshr" functions for vector types smaller than 128 bits
- fixed SSE2 implementations of "vrol" and "vror" functions for the (u)short16 type
Additions
- implemented Bmi1 and Bmi2 intrinsics as functions with a "bits_" prefix (except for "andn", which has already been implemented as "andnot")
- added high performance and/or SIMD "isdivisible" functions for all integer vector types and scalar value types
- added high performance and/or SIMD "intpow" - integer exponentiation - functions for (u)int, (u)long and all integer vector types
- added high performance and/or SIMD "floorpow2" functions for all integer vector types
- added "nabs" - negative absolute value functions for all non-boolean vector- and single value types
- added "indexof(vector v, value x)" functions for all non-boolean vector types
Improvements
- aggressivley optimized away global variables (shuffle masks) and thus memory access and usage where appropriate
- improved performance of 256 bit vector subvector getters
- added Sse2 fallback code for all (u)long2/3/4 operators
- improved performance of mulitplication, division and modulo operations for all (s)byte- and (u)short vector- and matrix types when dividing by a single non-compile time constant value
- added overloads for (s)byte- and (u)short vectors' "divrem" functions with a scalar value as the divisor parameter, improving performance when it is a compile time constant
- improved performance of "intsqrt" functions for most types
Changes
- bump com.unity.burst to version 1.5
Fixed Oversights
- added bitmask8 and bitmask16 functions for (s)byte and (u)short vector types, respectively
MaxMath v2.1.1
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
- fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
- fixed "tzcnt" implementations (were completely broken)
- fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors
Additions
- added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
- added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative
Improvements
- performance improvements of "avg" functions for signed integer vectors
- added SIMD implementations of the "transpose" functions for all matrix types
- added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
- added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
- added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
- added SSE2 fallback code for typecasting, propagating through the entire library
- added SSE2 fallback code for "addsub" and "subadd" functions
- bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively
Changes
- renamed "BurstCompilerException" to "CPUFeatureCheckException"
- "shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#
Fixed Oversights
- added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
- added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
- added "compareto" functions for all vector types except half- and quarter vectors
- added "all_dif" functions for (s)byte32 vectors
- added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors
2.1.1 Hotfix
Fixes
- fixed SSE2 "shl", "shrl" and "shra" implementations
- fixed SSE2 "intsqrt" implementations
Improvements
- improved performance of (s)byte2, -3, -4, -8, -16 and (u)short2, -3, -4, -8 "gcd" functions (and thus "lcm") when compiling for Avx2
- improved performance of "tzcnt" and "lzcnt" implementations for all vector types if compiling for SSE4 or higher, propagating through a lot of the library
Fixed Oversights
Added documentation for RandomX methods
MaxMath v2.1.0
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
- fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
- fixed "tzcnt" implementations (were completely broken)
- fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors
Additions
- added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
- added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative
Improvements
- performance improvements of "avg" functions for signed integer vectors
- added SIMD implementations of the "transpose" functions for all matrix types
- added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
- added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
- added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
- added SSE2 fallback code for typecasting, propagating through the entire library
- added SSE2 fallback code for "addsub" and "subadd" functions
- bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively
Changes
- renamed "BurstCompilerException" to "CPUFeatureCheckException"
- "shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#
Fixed Oversights
- added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
- added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
- added "compareto" functions for all vector types except half- and quarter vectors
- added "all_dif" functions for (s)byte32 vectors
- added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors
MaxMath v2.0.0
Re-Release Notes
- Version 2.0.0 adds - for the first time - fallback procedures from Avx2 to Sse4, Sse2 and platform independent instruction sets, respectively, with some major optimizations for all of them
- ARM and other instruction sets do NOT have optimized fallback procedures written for them, and there are no plans for it at this time. Burst/LLVM are good at recognizing the patterns in the code, though, and some of the code will be vectorized for other platforms (confirmed)
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed incorrect bool4 subvector getters of the bool8 type
Improvements
- removed "fixed" vector element access to improve performance in managed C#
Additions
- added "shuffle(vector, vector, ShuffleComponent(, ShuffleComponent)(, ShuffleComponent)(, ShuffleComponent)) functions for (s)byte, (u)short, (u)long, quarter and half vectors
Changes
- Bump com.unity.burst to version 1.4.4
Fixed Oversights
-
Added "addsub" function for floating point types, complementary to "subadd"
-
Added "addsub" and "subadd" functions for integer types