Batch of small optimizations #556

pixelherodev · 2025-10-30T03:07:20Z

This primarily consists of reducing allocations.

e.g. 03be6d3 adds a specific case to TypeEqual for ListView types, which avoids reflect.DeepEqual. l.elem.Equal does not allocate; reflect.DeepEqual does, so this reduces allocations when comparing a ListViewType.

8c7729f changes from referencing flatbug internal types by reference to by-value, which avoids letting them escape to the heap.

e340658 switches from reflect.DeepEqual to slices.Equal when comparing the child IDs of a Union type, which is functionally equivalent for slices and, again, does not allocate.

9eba6d2 does similar for an arrowflight cookie test, replacing reflect.DeepEquals with maps.Equal; I did a grep for reflect.DeepEqual and noticed it.

7557d15 is probably the most objectionable; it replaces a *float64 with float64 directly for minSpaceSavings, since the current nil value has identical behavior to zero anyways. It's a tiny cleanup IMO, but doesn't really have any practical value.

2d3593f is a bit ugly; it inlines a copy of NewReaderFromMessageReader and a copy of NewMessageReader into NewReader - by implementing the two directly, it's able to avoid allocating two copies of the config structure, and to avoid looping over and invoking config functions twice. It's kinda ugly though, and the // types: make(dictTypeMap) which is copied over suggests that there's further room for improvement.

I didn't drop the two APIs, as they are exposed to users, but IMO the performance win is probably worth it, and it can be cleaned up later if needed.

0273ffe is a small optimization to the allocator: the Go allocator banches on each allocation, but both branches are functionally equivalent, so I just merged the two. I doubt this would have a major impact, since the cost of a branch is negligible compared to the cost of a heap allocation, but performance is often won with a lot of small impacts IMO :)

Lastly, for now at least, f1a6a33 statically allocates 4 bytes per IPC messageReader (I did not touch the ArrowFlight implementation), and uses it on every message read, rather than heap-allocating 4 bytes each time Message is called.

There's more to be done, but this should be a much more practically sized chunk to review, and I can add more later :)

This avoids an unnecessary heap allocation, reducing GC pressure and increasing latency and throughput of anything that processes messages.

This is a drive-by patch; the test was failing for the other optimization, and I noticed this when I went to take a look and figured I'd just send the patch.

This is a bit bloated, but I wasn't sure if the API for NewMessageReader and NewReaderFromMessageReader could be changed to take in the config struct directly. It would be a lot cleaner to expose both APIs with Config directly... This drops another allocation from every call to NewReader, which - if a new reader must be created for every API call - can be noticeable...

In branch 1, shift is defined as next-addr. We then slice buf with shift as the offset, and with size+shift as length and capacity. In branch 2, addr == next, and thus shift == 0. size+shift == size. The slice statements are thus equivalent; there's no reason to be branching here.

Avoids an allocation per Message call.

pixelherodev · 2025-10-30T03:10:43Z

There's a lot of variance in the run times for the existing benchmarks for me, but the before and after for allocations are consistent:

# branch 'main'
goos: linux
goarch: amd64
pkg: github.com/apache/arrow-go/v18/arrow/ipc
cpu: AMD Ryzen 9 7900X3D 12-Core Processor          
BenchmarkIPC/Writer/codec=plain-24         	 207901	     5824 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 247482	     4812 ns/op	   6132 B/op	     83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   3618	   332506 ns/op	2338754 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  40870	    24626 ns/op	  25887 B/op	    194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 112738	    10490 ns/op	  12864 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	 133892	     9154 ns/op	  10115 B/op	    124 allocs/op


# branch 'small-optimizations'
BenchmarkIPC/Writer/codec=plain-24         	 168954	     7469 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 259921	     4970 ns/op	   5993 B/op	     76 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   3633	   327594 ns/op	2338740 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  40442	    26658 ns/op	  25739 B/op	    187 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 113894	    13151 ns/op	  13098 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	 134320	     8952 ns/op	   9951 B/op	    117 allocs/op

pixelherodev · 2025-10-30T03:28:51Z

I ran go test -bench . -benchmem -benchtime 30s before and after to highlight the variance better without running the test 20 times.

Before:

BenchmarkIPC/Writer/codec=plain-24         	6277569	     6240 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	6876433	     5247 ns/op	   6131 B/op	     83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	 114764	   324831 ns/op	2338736 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	1310571	    28743 ns/op	  25869 B/op	    194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	2930214	    12186 ns/op	  12905 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	3445965	    10790 ns/op	  10174 B/op	    124 allocs/op

After:

BenchmarkIPC/Writer/codec=plain-24         	6302840	     5482 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	7418485	     4689 ns/op	   5995 B/op	     76 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	 102850	   346718 ns/op	2338731 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	1422286	    25704 ns/op	  25742 B/op	    187 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	2921695	    11956 ns/op	  12880 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	3545550	    10179 ns/op	   9986 B/op	    117 allocs/op

Both numbers are a bit worse like this, but more consistent; readers look to be consistently ~5-10% faster.

zeroshade · 2025-10-30T15:57:23Z

Very nice! Thanks for this! Just got a linting issue to deal with before I can merge this 😄

pixelherodev · 2025-10-30T16:23:26Z

Okay, go fmt ed :)

pixelherodev · 2025-10-30T18:55:01Z

...so, both Verify jobs passed on the prior run before go fmt. I'm assuming those CI tests are not reliable?

pixelherodev · 2025-10-30T20:18:27Z

Yeah, the ubuntu one passed on the second run with no changes, and the macos image is failing an entirely different one.

The macOS memory leak looks correct, by the way - in that it's leaking intentionally, because the memory is pooled. When returning the memory to the pool, buffer.ResizeNoShrink is used; since shrink=false is specified to resize, Free() is not called.

pixelherodev · 2025-10-30T20:29:03Z

My suspicion is that the test is flaky because of timing: if GC triggers twice after the memory is returned to the pool before the checked allocator runs, then the cleanup function will run, the Buffer's finalizer runs and Release()s the memory.

Using the checked allocator here is going to be very timing dependent, I think.

pixelherodev · 2025-10-30T20:31:02Z

The buffer pool used in that test has

New: func() interface{} {
  buf := memory.NewResizableBuffer(suite.mem)
  runtime.SetFinalizer(buf, func(obj *memory.Buffer) { obj.Release() })
  return buf
}

This is not reliable. It depends upon the pool getting GCed before the test finishes.

pixelherodev · 2025-10-30T20:31:43Z

...the test runs runtime.GC() manually twice, but AFAIK there is no guarantee that the GC will run the finalizers on the same pass that frees them.

pixelherodev · 2025-10-30T20:33:50Z

Moreover, I'm not sure the second run would even free the memory of an individual buffer.

Two GC passes should be sufficient that the pool becomes empty - but if the pool becomes empty during a GC pass, I don't think that the objects that are removed will themselves become freed?

It's possible a third GC pass would be sufficient... which would also explain why it sometimes works, as it's possible for a GC pass to begin in between the two calls to runtime.GC?

pixelherodev · 2025-10-30T20:54:14Z

Yeah, readning through AddFinalizer, it's not actually guaranteed that the finalizers will run at all, effectively. They also are explicitly run one-at-a-time, so if there's, say, 100 buffers being cleaned up, they'll be cleaned up in serial while runtime.GC finishes and returns.

I'm not even sure how that middleware test can even be flaky tbh.

pixelherodev · 2025-10-30T20:58:26Z

Pushed a small fix to the arrow flight tests; I didn't have localhost configured locally, so the test failed with s.lis == nil; s.lis.Addr().

With localhost configured correctly locally, those tests are passing...

zeroshade · 2025-10-30T21:59:23Z

I'm not even sure how that middleware test can even be flaky tbh.

Yea that's the issue, I've looked at it several times and still haven't been able to figure out what is flakey with it.

pixelherodev · 2025-10-30T22:16:34Z

hahahaha and once again the verification succeeded before running go fmt. 🤞 that it stays that way..

zeroshade · 2025-10-30T22:42:18Z

🤞

zeroshade

All passed, looks great! Thanks so much for these!

Noam Preil added 8 commits October 29, 2025 21:58

compare: add hot path for ListViews

03be6d3

optimization: IPC: pass FieldNodes around by value instead of reference

8c7729f

This avoids an unnecessary heap allocation, reducing GC pressure and increasing latency and throughput of anything that processes messages.

optimization: compare: avoid reflection in union comparison

e340658

cookie test: use maps.Equal instead of reflect.DeepEquals

9eba6d2

This is a drive-by patch; the test was failing for the other optimization, and I noticed this when I went to take a look and figured I'd just send the patch.

cleanup: minSpaceSavings need not have a pointer

7557d15

optimization: MessageReader: use a static 4-byte buffer for header reads

f1a6a33

Avoids an allocation per Message call.

pixelherodev requested a review from zeroshade as a code owner October 30, 2025 03:07

go fmt to make the linter shut up

57b582e

flight/test: panic if unable to init

408c119

Noam Preil added 2 commits October 30, 2025 17:13

gofmt...

fb6238f

lint..

abc8fa2

zeroshade approved these changes Oct 31, 2025

View reviewed changes

zeroshade merged commit a9b0be4 into apache:main Oct 31, 2025
16 checks passed

Batch of small optimizations #556

Batch of small optimizations #556

Uh oh!

Conversation

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

zeroshade commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

zeroshade commented Oct 30, 2025

Uh oh!

pixelherodev commented Oct 30, 2025

Uh oh!

zeroshade commented Oct 30, 2025

Uh oh!

zeroshade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants