Craneline vector shift count semantics does not match docs

In implementing SIMD support for s390x, the precise semantics of handling too-large shift counts for Cranelift vector shift instructions seemed unclear to me.   In the existing aarch64 and x86_64 targets, the Cranelift instructions are mapped directly to the native instructions, which happen to treat shift counts larger than the lane width by shifting everything off the end of the lane, i.e. resulting in an all-zero (or all-sign-bit-copies) lane element.

This behavior is actually explicitly verified in the simd-bitwise-run.clif runtest:
```
    v0 = iconst.i32 17 ; note that this will shift off the end of each lane
    v1 = vconst.i16x8 [1 2 4 8 16 32 64 128]
    v2 = ishl v1, v0
```

However, looking at the documentation, we have this statement:
```
        Integer shift left. Shift the bits in ``x`` towards the MSB by ``y``
        places. Shift in zero bits to the LSB.

        The shift amount is masked to the size of ``x``.

        When shifting a B-bits integer type, this instruction computes:

            s &:= y \pmod B,
            a &:= x \cdot 2^s \pmod{2^B}.
```
Now, admittedly, this does not explicitly refer to vector types either way.  However, usually the semantics for the vector operation matches the one for the scalar operation, so I would have expected that the shift count would be considered modulo the lane size.

And since the s390x vector shift instruction *does* implement that exact modulo semantics, I've simply implemented the Cranelift instruction directly with the native s390x instruction - only to see a failure in the runtest.

Looking at the Wasmtime as main user of Cranelift, the WebAssembly vector shifts are also specified to use modulo semantics.  And indeed, the WebAssembly to Cranelift mapping adds an explicit modulo operation there:
```
      Operator::I8x16Shl | Operator::I16x8Shl | Operator::I32x4Shl | Operator::I64x2Shl => {
            let (a, b) = state.pop2();
            let bitcast_a = optionally_bitcast_vector(a, type_of(op), builder);
            let bitwidth = i64::from(type_of(op).lane_bits());
            // The spec expects to shift with `b mod lanewidth`; so, e.g., for 16 bit lane-width
            // we do `b AND 15`; this means fewer instructions than `iconst + urem`.
            let b_mod_bitwidth = builder.ins().band_imm(b, bitwidth - 1);
            state.push1(builder.ins().ishl(bitcast_a, b_mod_bitwidth))
        }
```

It seems to me it would be preferable to have the Cranelift IR semantics match the WebAssembly semantics, remove that explicit modulo operation during the WebAssembly->Cranelift translation, and add it to those back ends that need it.   This would allow efficient code generation on s390x without affecting performance on the other platforms.

In any case, the documentation should be updated to explicitly define the Cranelift shift count semantics for vector types.

FYI @cfallin @afonso360 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Craneline vector shift count semantics does not match docs #4424

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Craneline vector shift count semantics does not match docs #4424

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions