Skip to content

Conversation

@fanfuxiaoran
Copy link
Contributor

In CI pipeline there were occassional test failures due to ORCA fallback with following stacktrace.

```
+INFO:  GPORCA failed to produce a plan, falling back to planner
+DETAIL:  CSyncHashtable.h:109: Failed assertion: IsValid(key) && "Invalid key is inaccessible"
+Stack trace:
+1    gpos::CException::Raise + 227
+2    <symbol not found> + 15235666
+3    gpos::CMemoryPoolManager::CreateMemoryPool + 653
+4    gpos::CAutoMemoryPool::CAutoMemoryPool + 34
+5    gpopt::CColumnFactory::CColumnFactory + 80
+6    gpopt::COptCtxt::PoctxtCreate + 77
+7    gpopt::CAutoOptCtxt::CAutoOptCtxt + 54
+8    gpopt::COptimizer::PdxlnOptimize + 411
+9    COptTasks::OptimizeTask + 850
+10   gpos::CTask::Execute + 52
+11   gpos::CWorker::Execute + 36
+12   gpos::CAutoTaskProxy::Execute + 97
+13   gpos_exec + 557
```

Core dump of failure showed CMemoryPool::m_hash_key had invalid key value 0xffffffff. Hence, the query raised an assertion error and fell back to PLANNER.

Issue is that CMemoryPool::m_hash_key was never directly initialized. This suggests that it was using uninitialized memory to produce randomness in the key. When that memory contains 0xffffffff in just the right place, then the value of the CMemoryPool::m_hash_key is an invalid key and ORCA falls back.

Following is patch that demonstrates the issue:
```
diff src/backend/utils/mmgr/aset.c
@@ -989,6 +989,8 @@ AllocSetAlloc(MemoryContext context, Size size)

                MEMORY_ACCOUNT_INC_ALLOCATED(set, chunk->size);

+               memset((char *) AllocChunkGetPointer(chunk), 0xFFFFFFFF, size);
+
                return AllocChunkGetPointer(chunk);
        }
```

A few lines above that patch, you can see that when compiled with RANDOMIZE_ALLOCATED_MEMORY the memory is randomly initialied. So we can make no assumptions about the uninitialied memory; meaning that 0xffffff is valid.

Note: Seemed this failure manifested more commonly with JIT ICW runs. (cherry picked from commit 2c7152f46aced9328d86dc1025d0395fcf467455)

fix #ISSUE_Number


Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

  • Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
  • Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
  • Learn the coding contribution guide, including our code conventions, workflow and more.
  • List your communication in the GitHub Issues or Discussions (if has or needed).
  • Document changes.
  • Add tests for the change
  • Pass make installcheck
  • Pass make -C src/test installcheck-cbdb-parallel
  • Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

In CI pipeline there were occassional test failures due to ORCA fallback
with following stacktrace.

    ```
    +INFO:  GPORCA failed to produce a plan, falling back to planner
    +DETAIL:  CSyncHashtable.h:109: Failed assertion: IsValid(key) && "Invalid key is inaccessible"
    +Stack trace:
    +1    gpos::CException::Raise + 227
    +2    <symbol not found> + 15235666
    +3    gpos::CMemoryPoolManager::CreateMemoryPool + 653
    +4    gpos::CAutoMemoryPool::CAutoMemoryPool + 34
    +5    gpopt::CColumnFactory::CColumnFactory + 80
    +6    gpopt::COptCtxt::PoctxtCreate + 77
    +7    gpopt::CAutoOptCtxt::CAutoOptCtxt + 54
    +8    gpopt::COptimizer::PdxlnOptimize + 411
    +9    COptTasks::OptimizeTask + 850
    +10   gpos::CTask::Execute + 52
    +11   gpos::CWorker::Execute + 36
    +12   gpos::CAutoTaskProxy::Execute + 97
    +13   gpos_exec + 557
    ```

Core dump of failure showed CMemoryPool::m_hash_key had invalid key
value 0xffffffff. Hence, the query raised an assertion error and fell
back to PLANNER.

Issue is that CMemoryPool::m_hash_key was never directly initialized.
This suggests that it was using uninitialized memory to produce
randomness in the key. When that memory contains 0xffffffff in just the
right place, then the value of the CMemoryPool::m_hash_key is an invalid
key and ORCA falls back.

Following is patch that demonstrates the issue:
    ```
    diff src/backend/utils/mmgr/aset.c
    @@ -989,6 +989,8 @@ AllocSetAlloc(MemoryContext context, Size size)

                    MEMORY_ACCOUNT_INC_ALLOCATED(set, chunk->size);

    +               memset((char *) AllocChunkGetPointer(chunk), 0xFFFFFFFF, size);
    +
                    return AllocChunkGetPointer(chunk);
            }
    ```

A few lines above that patch, you can see that when compiled with
RANDOMIZE_ALLOCATED_MEMORY the memory is randomly initialied. So we can
make no assumptions about the uninitialied memory; meaning that 0xffffff
is valid.

Note: Seemed this failure manifested more commonly with JIT ICW runs.
(cherry picked from commit 2c7152f46aced9328d86dc1025d0395fcf467455)
@fanfuxiaoran fanfuxiaoran requested a review from my-ship-it May 15, 2024 06:58
@CLAassistant
Copy link

CLAassistant commented May 15, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@fanfuxiaoran fanfuxiaoran added the cherry-pick cherry-pick upstream commts label May 15, 2024
@my-ship-it
Copy link
Contributor

Seems icw-parallel-test still fail, so need to fix

@fanfuxiaoran
Copy link
Contributor Author

Seems icw-parallel-test still fail, so need to fix

Done

@my-ship-it my-ship-it merged commit 9a63039 into apache:main May 16, 2024
@fanfuxiaoran fanfuxiaoran deleted the fix_orca_crash branch July 2, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick cherry-pick upstream commts

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants