Skip to content

Craneline vector shift count semantics does not match docs #4424

@uweigand

Description

@uweigand

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me. In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:

    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0

However, looking at the documentation, we have this statement:

        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.

Now, admittedly, this does not explicitly refer to vector types either way. However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction does implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics. And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:

      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }

It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it. This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior in the current implementation that needs fixingcraneliftIssues related to the Cranelift code generatorcranelift:area:clifcranelift:docscranelift:documentationAreas that need better documentation.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions