Skip to content

Conversation

@pixelherodev
Copy link
Contributor

This primarily consists of reducing allocations.

e.g. 03be6d3 adds a specific case to TypeEqual for ListView types, which avoids reflect.DeepEqual. l.elem.Equal does not allocate; reflect.DeepEqual does, so this reduces allocations when comparing a ListViewType.

8c7729f changes from referencing flatbug internal types by reference to by-value, which avoids letting them escape to the heap.

e340658 switches from reflect.DeepEqual to slices.Equal when comparing the child IDs of a Union type, which is functionally equivalent for slices and, again, does not allocate.

9eba6d2 does similar for an arrowflight cookie test, replacing reflect.DeepEquals with maps.Equal; I did a grep for reflect.DeepEqual and noticed it.

7557d15 is probably the most objectionable; it replaces a *float64 with float64 directly for minSpaceSavings, since the current nil value has identical behavior to zero anyways. It's a tiny cleanup IMO, but doesn't really have any practical value.

2d3593f is a bit ugly; it inlines a copy of NewReaderFromMessageReader and a copy of NewMessageReader into NewReader - by implementing the two directly, it's able to avoid allocating two copies of the config structure, and to avoid looping over and invoking config functions twice. It's kinda ugly though, and the // types: make(dictTypeMap) which is copied over suggests that there's further room for improvement.

I didn't drop the two APIs, as they are exposed to users, but IMO the performance win is probably worth it, and it can be cleaned up later if needed.

0273ffe is a small optimization to the allocator: the Go allocator banches on each allocation, but both branches are functionally equivalent, so I just merged the two. I doubt this would have a major impact, since the cost of a branch is negligible compared to the cost of a heap allocation, but performance is often won with a lot of small impacts IMO :)

Lastly, for now at least, f1a6a33 statically allocates 4 bytes per IPC messageReader (I did not touch the ArrowFlight implementation), and uses it on every message read, rather than heap-allocating 4 bytes each time Message is called.

There's more to be done, but this should be a much more practically sized chunk to review, and I can add more later :)

Noam Preil added 8 commits October 29, 2025 21:58
This avoids an unnecessary heap allocation, reducing GC pressure and
increasing latency and throughput of anything that processes messages.
This is a drive-by patch; the test was failing for the other
optimization, and I noticed this when I went to take a look and figured
I'd just send the patch.
This is a bit bloated, but I wasn't sure if the API for NewMessageReader
and NewReaderFromMessageReader could be changed to take in the config
struct directly.

It would be a lot cleaner to expose both APIs with Config directly...

This drops another allocation from every call to NewReader, which - if a
new reader must be created for every API call - can be noticeable...
In branch 1, shift is defined as next-addr. We then slice buf with shift
as the offset, and with size+shift as length and capacity.

In branch 2, addr == next, and thus shift == 0. size+shift == size.

The slice statements are thus equivalent; there's no reason to be
branching here.
@pixelherodev
Copy link
Contributor Author

There's a lot of variance in the run times for the existing benchmarks for me, but the before and after for allocations are consistent:

# branch 'main'
goos: linux
goarch: amd64
pkg: github.com/apache/arrow-go/v18/arrow/ipc
cpu: AMD Ryzen 9 7900X3D 12-Core Processor          
BenchmarkIPC/Writer/codec=plain-24         	 207901	     5824 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 247482	     4812 ns/op	   6132 B/op	     83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   3618	   332506 ns/op	2338754 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  40870	    24626 ns/op	  25887 B/op	    194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 112738	    10490 ns/op	  12864 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	 133892	     9154 ns/op	  10115 B/op	    124 allocs/op


# branch 'small-optimizations'
BenchmarkIPC/Writer/codec=plain-24         	 168954	     7469 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 259921	     4970 ns/op	   5993 B/op	     76 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   3633	   327594 ns/op	2338740 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  40442	    26658 ns/op	  25739 B/op	    187 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 113894	    13151 ns/op	  13098 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	 134320	     8952 ns/op	   9951 B/op	    117 allocs/op

@pixelherodev
Copy link
Contributor Author

I ran go test -bench . -benchmem -benchtime 30s before and after to highlight the variance better without running the test 20 times.

Before:

BenchmarkIPC/Writer/codec=plain-24         	6277569	     6240 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	6876433	     5247 ns/op	   6131 B/op	     83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	 114764	   324831 ns/op	2338736 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	1310571	    28743 ns/op	  25869 B/op	    194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	2930214	    12186 ns/op	  12905 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	3445965	    10790 ns/op	  10174 B/op	    124 allocs/op

After:

BenchmarkIPC/Writer/codec=plain-24         	6302840	     5482 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	7418485	     4689 ns/op	   5995 B/op	     76 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	 102850	   346718 ns/op	2338731 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	1422286	    25704 ns/op	  25742 B/op	    187 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	2921695	    11956 ns/op	  12880 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	3545550	    10179 ns/op	   9986 B/op	    117 allocs/op

Both numbers are a bit worse like this, but more consistent; readers look to be consistently ~5-10% faster.

@zeroshade
Copy link
Member

Very nice! Thanks for this! Just got a linting issue to deal with before I can merge this 😄

@pixelherodev
Copy link
Contributor Author

Okay, go fmt ed :)

@pixelherodev
Copy link
Contributor Author

...so, both Verify jobs passed on the prior run before go fmt. I'm assuming those CI tests are not reliable?

@pixelherodev
Copy link
Contributor Author

Yeah, the ubuntu one passed on the second run with no changes, and the macos image is failing an entirely different one.

The macOS memory leak looks correct, by the way - in that it's leaking intentionally, because the memory is pooled. When returning the memory to the pool, buffer.ResizeNoShrink is used; since shrink=false is specified to resize, Free() is not called.

@pixelherodev
Copy link
Contributor Author

My suspicion is that the test is flaky because of timing: if GC triggers twice after the memory is returned to the pool before the checked allocator runs, then the cleanup function will run, the Buffer's finalizer runs and Release()s the memory.

Using the checked allocator here is going to be very timing dependent, I think.

@pixelherodev
Copy link
Contributor Author

The buffer pool used in that test has

New: func() interface{} {
  buf := memory.NewResizableBuffer(suite.mem)
  runtime.SetFinalizer(buf, func(obj *memory.Buffer) { obj.Release() })
  return buf
}

This is not reliable. It depends upon the pool getting GCed before the test finishes.

@pixelherodev
Copy link
Contributor Author

...the test runs runtime.GC() manually twice, but AFAIK there is no guarantee that the GC will run the finalizers on the same pass that frees them.

@pixelherodev
Copy link
Contributor Author

Moreover, I'm not sure the second run would even free the memory of an individual buffer.

Two GC passes should be sufficient that the pool becomes empty - but if the pool becomes empty during a GC pass, I don't think that the objects that are removed will themselves become freed?

It's possible a third GC pass would be sufficient... which would also explain why it sometimes works, as it's possible for a GC pass to begin in between the two calls to runtime.GC?

@pixelherodev
Copy link
Contributor Author

Yeah, readning through AddFinalizer, it's not actually guaranteed that the finalizers will run at all, effectively. They also are explicitly run one-at-a-time, so if there's, say, 100 buffers being cleaned up, they'll be cleaned up in serial while runtime.GC finishes and returns.

I'm not even sure how that middleware test can even be flaky tbh.

@pixelherodev
Copy link
Contributor Author

Pushed a small fix to the arrow flight tests; I didn't have localhost configured locally, so the test failed with s.lis == nil; s.lis.Addr().

With localhost configured correctly locally, those tests are passing...

@zeroshade
Copy link
Member

I'm not even sure how that middleware test can even be flaky tbh.

Yea that's the issue, I've looked at it several times and still haven't been able to figure out what is flakey with it.

@pixelherodev
Copy link
Contributor Author

hahahaha and once again the verification succeeded before running go fmt. 🤞 that it stays that way..

@zeroshade
Copy link
Member

🤞

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All passed, looks great! Thanks so much for these!

@zeroshade zeroshade merged commit a9b0be4 into apache:main Oct 31, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants