Summary
With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.
This was done by mirroring the surface area exposed by Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.
The APIs expose would include the following:
ExtractMostSignificantBits
- On x86/x64 this would be emitted as
MoveMask and performs exactly as expected
- On ARM64, this would be emulated via
and, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if the input is the result of a Compare instruction and elide the shift-right.
- On WASM, this is called
bitmask and works identically to MoveMask
- This API and its emulation are used throughout the BCL
Load/Store
- This is the basic load/store operations already in use for x86, x64, and ARM64
LoadAligned/StoreAligned
- This works exactly as the same named APIs on x86/x64
- When optimizations are disabled the alignment is verified
- When optimizations are enabled, this alignment checking may be skipped due to the load being folded into an instruction on modern hardware
- This enables efficient usage of the instruction on both older (pre-AVX) hardware as well as newer (post-AVX) or ARM64 hardware (where no load/store aligned instructions exist)
LoadAlignedNonTemporal/StoreAlignedNonTemporal
- This behaves as
LoadAligned/StoreAligned but may optionally treat the memory access as non-temporal and avoid polluting the cache
LoadUnsafe/StoreUnsafe
- These are "new APIs", they cover a "gap" in the API surface that has been encountered and worked around in the BCL and which is semi-regularly requested by the community
- The API that just takes a
ref T behaves exactly like the version that takes a pointer, just without requiring pinning
- The API that additionally takes an
nuint index behaves like ref Unsafe.Add(ref value, index) and avoids needing to further bloat IL and hinder readability
API Proposal
namespace System.Runtime.Intrinsics
{
public static partial class Vector64
{
public static uint ExtractMostSignificantBits<T>(Vector64<T> vector);
public static Vector64<T> Load<T>(T* address);
public static Vector64<T> LoadAligned<T>(T* address);
public static Vector64<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector64<T> LoadUnsafe<T>(ref T address);
public static Vector64<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector64<T> source);
public static void StoreAligned<T>(T* address, Vector64<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, Vector64<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector64<T> source);
}
public static partial class Vector128
{
public static uint ExtractMostSignificantBits<T>(Vector128<T> vector);
public static Vector128<T> Load<T>(T* address);
public static Vector128<T> LoadAligned<T>(T* address);
public static Vector128<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector128<T> LoadUnsafe<T>(ref T address);
public static Vector128<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector128<T> source);
public static void StoreAligned<T>(T* address, Vector128<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, Vector128<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector128<T> source);
}
public static partial class Vector256
{
public static uint ExtractMostSignificantBits<T>(Vector256<T> vector);
public static Vector256<T> Load<T>(T* address);
public static Vector256<T> LoadAligned<T>(T* address);
public static Vector256<T> LoadAlignedNonTemporal<T>(T* address);
public static Vector256<T> LoadUnsafe<T>(ref T address);
public static Vector256<T> LoadUnsafe<T>(ref T address, nuint index);
public static void Store<T>(T* address, Vector256<T> source);
public static void StoreAligned<T>(T* address, Vector256<T> source);
public static void StoreAlignedNonTemporal<T>(T* address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, Vector256<T> source);
public static void StoreUnsafe<T>(ref T address, nuint index, Vector256<T> source);
}
}
Additional Notes
Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
- On x86/x64 these are referred to as
Shuffle or Permute (generally takes two elements and one element, respectively; but that isn't always the case)
- On ARM64, these are referred to as
VectorTableLookup (only takes two elements)
- On WASM, these are referred to as
Shuffle (takes two elements) and Swizzle (takes one element).
- On LLVM, these are referred to as
VectorShuffle and only take two elements
Due to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for Vector128<T> is consistent on all platforms. Vector64<T> is ARM64 specific and Vector256<T> is x86/x64 specific. The former behaves like Vector128<T> while the latter generally behaves like 2x Vector128<T> (outside a few APIs called Permute#x#). For consistency, the Vector256<T> APIs exposed here would behave identically to Vector128<T> and allow "cross lane permutation".
For the single-vector reordering, the APIs are "trivial":
public static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> vector, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> vector, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> vector, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> vector, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> vector, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> vector, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> vector, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> vector, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> vector, Vector128<long> indices)
For the two-vector reordering, the APIs are generally the same:
public static Vector128<byte> Shuffle(Vector128<byte> lower, Vector128<byte> upper, Vector128<byte> indices)
public static Vector128<sbyte> Shuffle(Vector128<sbyte> lower, Vector128<sbyte> upper, Vector128<sbyte> indices)
public static Vector128<short> Shuffle(Vector128<short> lower, Vector128<short> upper, Vector128<short> indices)
public static Vector128<ushort> Shuffle(Vector128<ushort> lower, Vector128<ushort> upper, Vector128<ushort> indices)
public static Vector128<int> Shuffle(Vector128<int> lower, Vector128<int> upper, Vector128<int> indices)
public static Vector128<uint> Shuffle(Vector128<uint> lower, Vector128<uint> upper, Vector128<uint> indices)
public static Vector128<float> Shuffle(Vector128<float> lower, Vector128<float> upper, Vector128<int> indices)
public static Vector128<long> Shuffle(Vector128<long> lower, Vector128<long> upper, Vector128<long> indices)
public static Vector128<ulong> Shuffle(Vector128<ulong> lower, Vector128<ulong> upper, Vector128<ulong> indices)
public static Vector128<double> Shuffle(Vector128<double> lower, Vector128<double> upper, Vector128<long> indices)
An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain Vector256<T> shuffles involving byte, sbyte, short, or ushort that cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.
This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.
Summary
With #49397 we approved and exposed cross platform APIs on Vector64/128/256 to help developers more easily support multiple platforms.
This was done by mirroring the surface area exposed by
Vector<T>. However, due to their fixed sized there are some additional APIs that would be beneficial to expose. Likewise, there are a few APIs for loading/storing vectors that are commonly used for hardware intrinsics that would be beneficial to have cross platform helpers for.The APIs expose would include the following:
ExtractMostSignificantBitsMoveMaskand performs exactly as expectedand, element-wise shift-right, 64-bit pairwise add, extract. The JIT could optionally detect if theinputis the result of aCompareinstruction and elide theshift-right.bitmaskand works identically toMoveMaskLoad/StoreLoadAligned/StoreAlignedLoadAlignedNonTemporal/StoreAlignedNonTemporalLoadAligned/StoreAlignedbut may optionally treat the memory access asnon-temporaland avoid polluting the cacheLoadUnsafe/StoreUnsaferef Tbehaves exactly like the version that takes apointer, just without requiring pinningnuint indexbehaves likeref Unsafe.Add(ref value, index)and avoids needing to further bloat IL and hinder readabilityAPI Proposal
Additional Notes
Ideally we would also expose "shuffle" APIs allowing the elements of a single or multiple vectors to be reordered:
ShuffleorPermute(generally takes two elements and one element, respectively; but that isn't always the case)VectorTableLookup(only takes two elements)Shuffle(takes two elements) andSwizzle(takes one element).VectorShuffleand only take two elementsDue to the complexities of same APIs, they can't trivially be exposed as a "single" generic API. Likewise, while the behavior for
Vector128<T>is consistent on all platforms.Vector64<T>is ARM64 specific andVector256<T>is x86/x64 specific. The former behaves likeVector128<T>while the latter generally behaves like2x Vector128<T>(outside a few APIs calledPermute#x#). For consistency, theVector256<T>APIs exposed here would behave identically toVector128<T>and allow "cross lane permutation".For the single-vector reordering, the APIs are "trivial":
For the two-vector reordering, the APIs are generally the same:
An upside of these APIs is that for common input scenarios involving constant indices, these can be massively simplified.
A downside for these APIs is that non-constant inputs on older hardware or certain
Vector256<T>shuffles involvingbyte,sbyte,short, orushortthat cross the 128-bit lane boundary can take a couple instructions rather than being a single instruction.This is ultimately no worse than a few other scenarios on each platform where one platform may have slightly better instruction generation due to the instructions it provides.