-
Notifications
You must be signed in to change notification settings - Fork 85
Batch of small optimizations #556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This avoids an unnecessary heap allocation, reducing GC pressure and increasing latency and throughput of anything that processes messages.
This is a drive-by patch; the test was failing for the other optimization, and I noticed this when I went to take a look and figured I'd just send the patch.
This is a bit bloated, but I wasn't sure if the API for NewMessageReader and NewReaderFromMessageReader could be changed to take in the config struct directly. It would be a lot cleaner to expose both APIs with Config directly... This drops another allocation from every call to NewReader, which - if a new reader must be created for every API call - can be noticeable...
In branch 1, shift is defined as next-addr. We then slice buf with shift as the offset, and with size+shift as length and capacity. In branch 2, addr == next, and thus shift == 0. size+shift == size. The slice statements are thus equivalent; there's no reason to be branching here.
Avoids an allocation per Message call.
|
There's a lot of variance in the run times for the existing benchmarks for me, but the before and after for allocations are consistent: # branch 'main'
goos: linux
goarch: amd64
pkg: github.com/apache/arrow-go/v18/arrow/ipc
cpu: AMD Ryzen 9 7900X3D 12-Core Processor
BenchmarkIPC/Writer/codec=plain-24 207901 5824 ns/op 8160 B/op 91 allocs/op
BenchmarkIPC/Reader/codec=plain-24 247482 4812 ns/op 6132 B/op 83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24 3618 332506 ns/op 2338754 B/op 150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24 40870 24626 ns/op 25887 B/op 194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24 112738 10490 ns/op 12864 B/op 134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24 133892 9154 ns/op 10115 B/op 124 allocs/op
# branch 'small-optimizations'
BenchmarkIPC/Writer/codec=plain-24 168954 7469 ns/op 8160 B/op 91 allocs/op
BenchmarkIPC/Reader/codec=plain-24 259921 4970 ns/op 5993 B/op 76 allocs/op
BenchmarkIPC/Writer/codec=zstd-24 3633 327594 ns/op 2338740 B/op 150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24 40442 26658 ns/op 25739 B/op 187 allocs/op
BenchmarkIPC/Writer/codec=lz4-24 113894 13151 ns/op 13098 B/op 134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24 134320 8952 ns/op 9951 B/op 117 allocs/op |
|
I ran Before: After: Both numbers are a bit worse like this, but more consistent; readers look to be consistently ~5-10% faster. |
|
Very nice! Thanks for this! Just got a linting issue to deal with before I can merge this 😄 |
|
Okay, go fmt ed :) |
|
...so, both Verify jobs passed on the prior run before go fmt. I'm assuming those CI tests are not reliable? |
|
Yeah, the ubuntu one passed on the second run with no changes, and the macos image is failing an entirely different one. The macOS memory leak looks correct, by the way - in that it's leaking intentionally, because the memory is pooled. When returning the memory to the pool, |
|
My suspicion is that the test is flaky because of timing: if GC triggers twice after the memory is returned to the pool before the checked allocator runs, then the cleanup function will run, the Buffer's finalizer runs and Release()s the memory. Using the checked allocator here is going to be very timing dependent, I think. |
|
The buffer pool used in that test has New: func() interface{} {
buf := memory.NewResizableBuffer(suite.mem)
runtime.SetFinalizer(buf, func(obj *memory.Buffer) { obj.Release() })
return buf
}This is not reliable. It depends upon the pool getting GCed before the test finishes. |
|
...the test runs runtime.GC() manually twice, but AFAIK there is no guarantee that the GC will run the finalizers on the same pass that frees them. |
|
Moreover, I'm not sure the second run would even free the memory of an individual buffer. Two GC passes should be sufficient that the pool becomes empty - but if the pool becomes empty during a GC pass, I don't think that the objects that are removed will themselves become freed? It's possible a third GC pass would be sufficient... which would also explain why it sometimes works, as it's possible for a GC pass to begin in between the two calls to runtime.GC? |
|
Yeah, readning through AddFinalizer, it's not actually guaranteed that the finalizers will run at all, effectively. They also are explicitly run one-at-a-time, so if there's, say, 100 buffers being cleaned up, they'll be cleaned up in serial while runtime.GC finishes and returns. I'm not even sure how that middleware test can even be flaky tbh. |
|
Pushed a small fix to the arrow flight tests; I didn't have With localhost configured correctly locally, those tests are passing... |
Yea that's the issue, I've looked at it several times and still haven't been able to figure out what is flakey with it. |
|
hahahaha and once again the verification succeeded before running |
|
🤞 |
zeroshade
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All passed, looks great! Thanks so much for these!
This primarily consists of reducing allocations.
e.g. 03be6d3 adds a specific case to TypeEqual for ListView types, which avoids reflect.DeepEqual. l.elem.Equal does not allocate; reflect.DeepEqual does, so this reduces allocations when comparing a ListViewType.
8c7729f changes from referencing flatbug internal types by reference to by-value, which avoids letting them escape to the heap.
e340658 switches from reflect.DeepEqual to slices.Equal when comparing the child IDs of a Union type, which is functionally equivalent for slices and, again, does not allocate.
9eba6d2 does similar for an arrowflight cookie test, replacing reflect.DeepEquals with maps.Equal; I did a grep for reflect.DeepEqual and noticed it.
7557d15 is probably the most objectionable; it replaces a *float64 with float64 directly for minSpaceSavings, since the current nil value has identical behavior to zero anyways. It's a tiny cleanup IMO, but doesn't really have any practical value.
2d3593f is a bit ugly; it inlines a copy of NewReaderFromMessageReader and a copy of NewMessageReader into NewReader - by implementing the two directly, it's able to avoid allocating two copies of the
configstructure, and to avoid looping over and invoking config functions twice. It's kinda ugly though, and the// types: make(dictTypeMap)which is copied over suggests that there's further room for improvement.I didn't drop the two APIs, as they are exposed to users, but IMO the performance win is probably worth it, and it can be cleaned up later if needed.
0273ffe is a small optimization to the allocator: the Go allocator banches on each allocation, but both branches are functionally equivalent, so I just merged the two. I doubt this would have a major impact, since the cost of a branch is negligible compared to the cost of a heap allocation, but performance is often won with a lot of small impacts IMO :)
Lastly, for now at least, f1a6a33 statically allocates 4 bytes per IPC messageReader (I did not touch the ArrowFlight implementation), and uses it on every message read, rather than heap-allocating 4 bytes each time Message is called.
There's more to be done, but this should be a much more practically sized chunk to review, and I can add more later :)