if (funcOp.getName() ==
"main$async_dispatch_1_matmul_transpose_b_1x1200x400_f64") {
l1Tiles[0] = 0;
l1Tiles[1] = 88;
l1Tiles[2] = 50;
}
<eval_with_key>.0 from /home/hoppip/Quidditch/venv/lib/python3.11/site-packages/torch/fx/experimental/proxy_tensor.py:551 in wrapped:19:0: warning: Let's look at ALL the allocOps before doging ANYTHING!
allocOp with memref shape 1 1200
allocOp with memref shape 1 1232
allocOp with memref shape 1 50
allocOp with memref shape 88 50
allocOp with memref shape 88 50
allocOp with memref shape 1 1200
allocOp with memref shape 1 1200
Well, those were all the allocOps... =_=
allocOp with memref shape 1 1200
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 9600
allocOp with memref shape 1 1232
memref size is 8
allocElements is 1232
NOW memref size is 9856
offset is 19456
allocOp with memref shape 1 50
memref size is 8
allocElements is 50
NOW memref size is 400
offset is 19856
allocOp with memref shape 88 50
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 55104
allocOp with memref shape 88 50
memref size is 8
allocElements is 4400
NOW memref size is 35200
offset is 90304
allocOp with memref shape 1 1200
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 99904
allocOp with memref shape 1 1200
memref size is 8
allocElements is 1200
NOW memref size is 9600
offset is 109504
allocElements is 1200
memref size is 9600
offset is 109504
l1MemoryBytes is 100000, so 9504 too much
kernel does not fit into L1 memory and cannot be compiled
// allocate a buffer of 1x1200 elements
%25 = "arith.constant"() <{value = 0 : index}> : () -> index
%26 = "memref.view"(%0, %25) : (memref<100000xi8>, index) -> memref<1200xf64>
%27 = "memref.reinterpret_cast"(%26) <{operandSegmentSizes = array<i32: 1, 0, 0, 0>, static_offsets = array<i64: 0>, static_sizes = array<i64: 1, 1200>, static_strides = array<i64: 1200, 1>}> : (memref<1200xf64>) -> memref<1x1200xf64>
%28 = "memref.alloca"() <{alignment = 64 : i64, operandSegmentSizes = array<i32: 0, 0>}>
: () -> memref<1x1200xf64, #quidditch_snitch.l1_encoding>
// set this buffer to all zeroes
%29 = "quidditch_snitch.compute_core_index"() : () -> index
%30 = "affine.apply"(%29) <{map = affine_map<()[s0] -> (s0 * 150)>}> : (index) -> index
"scf.for"(%30, %1, %1) ({
^bb0(%arg25: index):
%94 = "memref.subview"(%27, %arg25) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: 0, -9223372036854775808>, static_sizes = array<i64: 1, 150>, static_strides = array<i64: 1, 1>}> : (memref<1x1200xf64>, index) -> memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>
"quidditch_snitch.memref.microkernel"(%94) ({
^bb0(%arg26: memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>):
%95 = "arith.constant"() <{value = 0.000000e+00 : f64}> : () -> f64
"linalg.fill"(%95, %arg26) <{operandSegmentSizes = array<i32: 1, 1>}> ({
^bb0(%arg27: f64, %arg28: f64):
"linalg.yield"(%arg27) : (f64) -> ()
}) : (f64, memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
}) : (memref<1x150xf64, strided<[1200, 1], offset: ?>, #quidditch_snitch.l1_encoding>) -> ()
"quidditch_snitch.microkernel_fence"() : () -> ()
"scf.yield"() : () -> ()
}) : (index, index, index) -> ()
// after this point, the 1x1200 buffer allocated to %28 or equivalently %26 never gets used again!
Changes made to ConfigureForSnitch.cpp:
Replaced [0,40,100] tiling configuration with [0,88,50]:
Changes made to LowerL1Allocations.cpp:
offset >= l1MemoryBytesis met.LowerL1Allocations.cppas a.txtLowerL1Allocations.txt in case this makes my modifications clearer.
The kernel does not fit in L1:
When the kernel does not fit L1, it's IR gets dumped to stderr, and we can see an unnecessary buffer of gets allocated (lines 99-124 of
0-88-56-build-failure-dispatch-1.mlir):The buffer assigned to
%28and%26gets set to zero, and then is never used again.build output attached as
0-88-56-build-failure.txt0-88-56-build-failure.txt
annotated mlir for the dumped dispatch attached as
0-88-56-build-failure-dispatch-1.txt0-88-56-build-failure-dispatch-1.txt