Skip to content

fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

@ericonr

Description

@ericonr

EDIT: Ah, happy new year!

Describe the bug you encountered:

When comparing the speed of fd with musl vs glibc, they are very comparable with -j1, but when the core count goes above a certain threshold, musl gets diminishing returns, to the point that with -j12, on a 6-core system (with hyperthreading), it's actually slower than with -j4.

With fd + musl:

~ ➜ time fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    3.17 secs   fish           external
   usr time    1.81 secs    0.00 micros    1.81 secs
   sys time    1.36 secs  599.00 micros    1.36 secs

~ took 3s ➜ time fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.11 secs   fish           external
   usr time    5.40 secs  621.00 micros    5.40 secs
   sys time    2.47 secs   46.00 micros    2.47 secs

~ took 2s ➜ time fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.62 secs   fish           external
   usr time   11.58 secs  476.00 micros   11.58 secs
   sys time    7.39 secs   35.00 micros    7.39 secs

~ took 2s ➜ time fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.98 secs   fish           external
   usr time   16.22 secs    0.00 micros   16.22 secs
   sys time   15.78 secs  559.00 micros   15.78 secs

With fd + musl + pre-loaded jemalloc:

~ took 2s ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.75 secs   fish           external
   usr time    1.43 secs    0.00 micros    1.43 secs
   sys time    1.32 secs  913.00 micros    1.32 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in  790.87 millis    fish           external
   usr time    1.75 secs  930.00 micros    1.75 secs
   sys time    1.37 secs    0.00 micros    1.37 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in  552.14 millis    fish           external
   usr time    2.40 secs    0.00 micros    2.40 secs
   sys time    1.89 secs  882.00 micros    1.89 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in  539.76 millis    fish           external
   usr time    3.42 secs  828.00 micros    3.42 secs
   sys time    2.76 secs    0.00 micros    2.76 secs

With fd + glibc:

[chroot] /home/ericonr > time fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2,73 secs   fish           external
   usr time    1,28 secs  488,00 micros    1,28 secs
   sys time    1,35 secs  141,00 micros    1,35 secs

[chroot] /home/ericonr > time fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in  854,58 millis    fish           external
   usr time    1,75 secs    0,00 micros    1,75 secs
   sys time    1,61 secs  585,00 micros    1,61 secs

[chroot] /home/ericonr > time fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in  601,37 millis    fish           external
   usr time    2,25 secs  439,00 micros    2,25 secs
   sys time    2,35 secs  144,00 micros    2,35 secs

[chroot] /home/ericonr > time fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in  558,01 millis    fish           external
   usr time    3,10 secs  612,00 micros    3,10 secs
   sys time    3,15 secs    0,00 micros    3,15 secs

With perf, I can track the issue (apparently) down to __lock:

  Children      Self  Command  Shared Object     Symbol
+   29.83%    29.58%  fd       libc.so           [.] __lock
+   27.29%     0.00%  fd       libc.so           [.] a_cas (inlined)

This matches with what was observed: with more threads, you get more lock contention, and can end up spending way too much time locked. I'd expect a similar effect to be observable with hardened_malloc from GrapheneOS and possibly OpenBSD's malloc. Theoretically speaking, rust might not require a hardened malloc, but you usually don't want to carry your own malloc for each application either, so it'd be nice if this could perform better "out of the box", by simply performing less allocations (which helps by removing most of the dependency on malloc performance). That said, I don't know how feasible fixing this is.

Describe what you expected to happen:

I'd expected all versions of the test to show similar growth in execution speed. Note that this is in my ~/ directory, because otherwise execution is short enough that you can't spot the issue (the numbers with a tarball of the linux kernel were too close).

What version of fd are you using?

fd 8.2.1

Which operating system / distribution are you on?

Linux (kernel 5.10.2), with musl 1.2.1

glibc used for testing was 2.30

jemalloc used for LD_PRELOAD was 5.2.1

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions