Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions src/coreclr/jit/lowerxarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11124,6 +11124,14 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)
uint32_t maskSize = genTypeSize(simdBaseType);
uint32_t operSize = genTypeSize(op2->AsHWIntrinsic()->GetSimdBaseType());

// Avx512F.Insert/ExtractVector256 accepts inputs with base type smaller that 64
// bits. it makes operSize wrong in those case.
if (op2->AsHWIntrinsic()->GetHWIntrinsicId() == NI_AVX512F_InsertVector256 ||
op2->AsHWIntrinsic()->GetHWIntrinsicId() == NI_AVX512F_ExtractVector256)
{
operSize = 8;
}
Copy link
Copy Markdown
Member

@tannergooding tannergooding Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the issue, but I'm not sure its entirely the "right" fix (and may miss some other similar cases).

We have a few instructions (namely those listed here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/instr.cpp#L110) where they effectively operate "bitwise" without masking but which need to consider the actual operand size when masking (or broadcast, if supported) is used.

So we probably need to ensure all of these are handled and are picking the "right" operand size (generally 4 when the input is of a "small type").

In the particular case of Insert/ExtractVector256 we should probably be using vinserti32x8 instead of vinserti64x4 for the small types and for int/uint. Those are technically AVX512DQ, but the JIT relies on all of F+BW+CD+DQ+VL being available together, so it should be fine to just adjust them in hwintrinsiclistxarch.h and get the "better" codegen. Some of the other intrinsics that need special handling may be in the same boat.
-- Alternatively we could just explicitly lower cases like NI_AVX512F_InsertVector256 to be NI_AVX512DQ_InsertVector256 and change the base type to int/uint for byte/ubyte, short/ushort, and int/uint. We could similarly lower cases like NI_AVX512F_And to normalize the base type to int/uint, which would likely also fix the issue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the issue, but I'm not sure its entirely the "right" fix (and may miss some other similar cases).

We have a few instructions (namely those listed here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/instr.cpp#L110) where they effectively operate "bitwise" without masking but which need to consider the actual operand size when masking (or broadcast, if supported) is used.

Thanks for the inputs,

At a quick glance on the instructions on the list, and/andn/or/xor/insert/extract may hit the same bug, movudqx/round/broadcast does not have EmbMask support yet. Do we consider fix them all? or leave those unsupported instructions for now (those instructions intrinsically have the support for embedded masking).

So we probably need to ensure all of these are handled and are picking the "right" operand size (generally 4 when the input is of a "small type").

In the particular case of Insert/ExtractVector256 we should probably be using vinserti32x8 instead of vinserti64x4 for the small types and for int/uint. Those are technically AVX512DQ, but the JIT relies on all of F+BW+CD+DQ+VL being available together, so it should be fine to just adjust them in hwintrinsiclistxarch.h and get the "better" codegen. Some of the other intrinsics that need special handling may be in the same boat. -- Alternatively we could just explicitly lower cases like NI_AVX512F_InsertVector256 to be NI_AVX512DQ_InsertVector256 and change the base type to int/uint for byte/ubyte, short/ushort, and int/uint. We could similarly lower cases like NI_AVX512F_And to normalize the base type to int/uint, which would likely also fix the issue

I would try to use the first option, as this is how those logical instructions (And/Or/AndN/Xor) are designed as well and based on how Avx512 ISA family is being checked, it should safe to use DQ instructions. These might still need to be handled together during lowering to normalize the base data type to int/uint unless it is a long type. (did some extensive testing, And with base type to be byte will have the same bug.)


if ((maskSize == operSize) && IsInvariantInRange(op2, node))
{
MakeSrcContained(node, op2);
Expand Down
43 changes: 43 additions & 0 deletions src/tests/JIT/Regression/JitBlue/Runtime_105969/Runtime_105969.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System;
using System.Runtime.CompilerServices;
using System.Numerics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using Xunit;

// Generated by Fuzzlyn v2.2 on 2024-08-04 15:50:11
// Run on X64 Linux
// Seed: 9513455124659677293-vectort,vector128,vector256,vector512,x86aes,x86avx,x86avx2,x86avx512bw,x86avx512bwvl,x86avx512cd,x86avx512cdvl,x86avx512dq,x86avx512dqvl,x86avx512f,x86avx512fvl,x86avx512fx64,x86avx512vbmi,x86avx512vbmivl,x86bmi1,x86bmi1x64,x86bmi2,x86bmi2x64,x86fma,x86lzcnt,x86lzcntx64,x86pclmulqdq,x86popcnt,x86popcntx64,x86sse,x86ssex64,x86sse2,x86sse2x64,x86sse3,x86sse41,x86sse41x64,x86sse42,x86sse42x64,x86ssse3,x86x86base
// Reduced from 26.0 KiB to 1.1 KiB in 00:00:56
// Debug: Outputs <0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>
// Release: Outputs <0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>

public class Runtime_105969
{
public static byte s_5;

[Fact]
public static void TestEntryPoint()
{
if (Avx512BW.IsSupported)
{
var vr16 = Vector512.CreateScalar(s_5);
var vr17 = Vector512.Create<byte>(1);
var vr18 = (byte)0;
var vr19 = Vector512.CreateScalar(vr18);
var vr20 = Vector128.Create<byte>(0);
var vr21 = Avx512BW.BroadcastScalarToVector512(vr20);
var vr22 = Vector256.Create<byte>(1);
var vr23 = Avx512F.InsertVector256(vr21, vr22, 0);
var vr24 = Vector512.Create(249, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0);
var vr25 = Avx512BW.BlendVariable(vr19, vr23, vr24);
var vr26 = Avx512BW.Min(vr17, vr25);
Vector512<byte> vr27 = Avx512BW.UnpackLow(vr16, vr26);
Vector512<byte> expected = Vector512.Create(0, (byte)1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
Assert.Equal(expected, vr27);
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<Optimize>True</Optimize>
</PropertyGroup>
<ItemGroup>
<Compile Include="$(MSBuildProjectName).cs" />
</ItemGroup>
</Project>