Skip to content

scanner: fix heredoc serialization buffer overflows#287

Merged
amaanq merged 1 commit intotree-sitter:masterfrom
look:look/fix-heredoc-crash
Mar 10, 2026
Merged

scanner: fix heredoc serialization buffer overflows#287
amaanq merged 1 commit intotree-sitter:masterfrom
look:look/fix-heredoc-crash

Conversation

@look
Copy link
Contributor

@look look commented Feb 10, 2026

I identified a crash in GitHub's symbol extraction system related to tree-sitter-ruby. Investigation revealed it was caused by very long HEREDOC identifiers (these are legal in Ruby, I checked...).

The heredoc word length was serialized as a single byte, which silently truncated identifiers of >= 256 characters. The deserializer would then read the wrong length, leaving unread bytes in the buffer, and hit the assert(size == length) check.

To fix, I adopted the approach used in tree-sitter-bash and tree-sitter-php: store the HEREDOC identifier length in a uint32_t.

This PR includes a reproduction test. If you run tree-sitter test on master with test/corpus/literals.txt you will get:

Assertion failed: (size == length), function deserialize, file scanner.c, line 160.
Abort trap: 6              tree-sitter test

On this branch, it is fixed.

Disclaimer: I used GitHub Copilot to help diagnose and fix this bug.

The heredoc word length was serialized as a single byte, which silently
truncated identifiers of >= 256 characters. The deserializer would then
read the wrong length, leaving unread bytes in the buffer, and hit the
assert(size == length) check.
for (uint32_t i = 0; i < scanner->open_heredocs.size; i++) {
Heredoc *heredoc = array_get(&scanner->open_heredocs, i);
if (size + 2 + heredoc->word.size >= TREE_SITTER_SERIALIZATION_BUFFER_SIZE) {
if (size + 3 + sizeof(uint32_t) + heredoc->word.size >= TREE_SITTER_SERIALIZATION_BUFFER_SIZE) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe 2 is incorrect in the old code: we write 3 bools below, not 2.

@look
Copy link
Contributor Author

look commented Mar 5, 2026

@amaanq any feedback on this PR? I'd love to get it merged as the bug is causing crashes in GitHub's extraction pipeline.

@WillLillis
Copy link
Member

WillLillis commented Mar 8, 2026

@look Sorry this slipped through the cracks! I'll try to get CI passing so we can take a look soon. Thanks!

@WillLillis WillLillis force-pushed the look/fix-heredoc-crash branch from eb2f0bb to 66e964b Compare March 8, 2026 07:20
@look
Copy link
Contributor Author

look commented Mar 9, 2026

@WillLillis thanks for taking a look! I also tried to fix CI in #288 -- you've made the equivalent change in this branch so that PR can be closed.

@WillLillis WillLillis force-pushed the look/fix-heredoc-crash branch from f5090ba to de8da6e Compare March 10, 2026 06:07
@WillLillis
Copy link
Member

We'll address the CI issues a little later, that's blocked by fixes to several of the bindings repos. This fix is fine to go in now. (Passed in CI, and I also verified locally)

@amaanq
Copy link
Member

amaanq commented Mar 10, 2026

Thanks @look, and sorry for the delay (also appreciate the disclaimer a lot :) )

@amaanq amaanq changed the title fix: crash when parsing heredocs with identifiers >= 256 chars scanner: fix heredoc serialization buffer overflows Mar 10, 2026
@amaanq amaanq merged commit ad907a6 into tree-sitter:master Mar 10, 2026
1 of 4 checks passed
@look
Copy link
Contributor Author

look commented Mar 11, 2026

No problem. Thanks for accepting the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants