Skip to content

Modernize dependencies and CI infrastructure + minor refactor#218

Merged
Mbompr merged 41 commits intocriteo:masterfrom
satishlokkoju:master
Nov 4, 2025
Merged

Modernize dependencies and CI infrastructure + minor refactor#218
Mbompr merged 41 commits intocriteo:masterfrom
satishlokkoju:master

Conversation

@satishlokkoju
Copy link
Copy Markdown
Contributor

All the changes from the PR #216 along with fixes to the transient test failures are added.

rom1504 and others added 30 commits August 9, 2025 22:57
- Add Python 3.12 support, drop Python 3.6/3.7
- Update dependencies: numpy <3, pyarrow >=16.0.0, fire <0.7.0
- Upgrade GitHub Actions: ubuntu-22.04, actions v2→v4
- Sync pyspark version in Makefile with requirements-test.txt

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Keep pyarrow <16 to maintain compatibility with embedding_reader dependency
while still supporting newer versions than before.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add explicit numpy constraint to prevent numpy 2.x conflicts with
pyarrow and embedding_reader dependencies in PEX builds.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Allow all Python versions to complete testing even if one fails,
providing better visibility into compatibility across versions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Move from Python 3.8 to 3.10 for releases to align with
supported Python versions and ensure better compatibility.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
PySpark requires Java runtime. Set up Java 17 (LTS) using Temurin
distribution for compatibility with PySpark dependencies.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- pyarrow: >=6.0.1,<16 → >=16.0.0,<18 (modern version)
- embedding_reader: >=1.5.1,<2 → >=1.8.0,<2 (supports new pyarrow)

Tested and confirmed all functionality works with updated dependencies.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Change pyarrow constraint from >=16.0.0,<18 to >=6.0.1,<30
to allow broader compatibility with different environments while
maintaining support for modern versions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add NumpyEncoder class to handle serialization of numpy types
- Update json.dump calls to use the custom encoder
- Resolves "TypeError: Object of type float32 is not JSON serializable"

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add explicit float() conversions for NumPy scalars to fix mypy type errors
- Fix NumpyEncoder parameter naming to match parent class
- Update JSON encoder to use modern super() syntax
- These changes ensure compatibility with NumPy 2.x while maintaining backward compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Drop Python 3.8 and 3.9 support, require Python ≥3.10
- Upgrade PySpark from 3.2.2 to 4.x for Java 17 compatibility
- Update CI matrix to test Python 3.10, 3.11, 3.12 only
- This resolves Java 17 module system compatibility issues with PySpark

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update lint job to use Python 3.10 instead of 3.8 (dropped support)
- Fix PEX build shell escaping for PySpark version constraint
- Both issues were caused by PySpark 4.x compatibility requirements

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add validation checks before faiss.merge_into() operations
- Validate index compatibility (nlist, dimensions, ntotal)
- Add proper error handling and logging for merge failures
- Fixes potential race conditions causing "Invalid key" FAISS exceptions
- Both test and production distributed merging code improved

This addresses CI-specific test failures while maintaining local functionality.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix string quote style to match black's formatting requirements after
FAISS robustness improvements.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix documentation CI failure by updating Python version from 3.9 to 3.10
to match the new minimum Python requirement after dependency modernization.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
The distributed test was failing in CI due to memory corruption when using
NumPy 2.x with PySpark 4.0 and FAISS 1.11. The error manifested as:
'Invalid key=94143314170815 nlist=1' where the large key appears to be
a corrupted 64-bit memory address (0x559f72cc93bf).

Pinning to numpy<2 (1.26.4) resolves the memory management incompatibility
between NumPy 2.x's new memory layout and FAISS/PySpark serialization.

This is a temporary fix until the upstream compatibility issues are resolved.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add detailed logging to understand CI failures:
- Version information (Python, NumPy, FAISS, PySpark)
- Distributed build success/failure tracking
- Search operation results and shapes
- Special detection for empty search results (PySpark worker failure)

This will help diagnose the Python 3.10 CI failure where search returns
empty results due to PySpark worker crashes.
Add comprehensive logging to pinpoint the exact crash location:
- _merge_to_n_indices: Entry parameters, batch creation, RDD operations
- _merge_index: File processing, FAISS operations in each worker
- _merge_from_local: Individual FAISS read_index and merge_into calls

This targets the actual crash point in distributed.py:246 where
metrics_rdd.collect() fails with PySpark worker crashes in Python 3.10 CI.
Convert all debugging print() calls to proper logger.debug/error calls:
- Better integration with existing logging infrastructure
- Avoids lint issues (trailing whitespace, import outside toplevel)
- Follows logging best practices with parameterized messages
- Maintains same debugging capability with proper log levels

This resolves lint failures while preserving debugging functionality
for future PySpark/FAISS compatibility issues.
- Move traceback import to module level to avoid 'import-outside-toplevel'
- Remove trailing whitespace
- Organize imports properly
- Achieve 10.00/10 pylint score

All debugging functionality preserved while maintaining code quality standards.
Previous crashes were fixed by reverting problematic validation code.
Now testing if NumPy 2.x works with our logging infrastructure in place.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
NumPy 2.x still causes PySpark worker crashes in CI. Reverting to
numpy<2 which is known to work. Also cleaned up all debugging logs
that were added for troubleshooting.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@satishlokkoju
Copy link
Copy Markdown
Contributor Author

@rom1504 Can you take a look at this ?

@rom1504
Copy link
Copy Markdown
Contributor

rom1504 commented Oct 7, 2025

Since you changed the kind of index I think you just need to change the test assumptions

@satishlokkoju
Copy link
Copy Markdown
Contributor Author

satishlokkoju commented Oct 9, 2025

@rom1504 fixed the test cases. Variance in the nprobe hyperparameter was causing transient test failures.
I’ve now forced nprobe via the input_param argument, making the test cases deterministic.
Let me know if you have any other comments.

@rom1504
Copy link
Copy Markdown
Contributor

rom1504 commented Oct 9, 2025

Nice! I think this is ready

@rom1504
Copy link
Copy Markdown
Contributor

rom1504 commented Oct 9, 2025

@Mbompr could you please review or send to other folks to review? Looks like I don't have write access to autofaiss anymore

@satishlokkoju
Copy link
Copy Markdown
Contributor Author

@hitchhicker Could you review this. Or point me to some one who can review this ?

@hitchhicker
Copy link
Copy Markdown
Contributor

@satishlokkoju Sorry I can't help as I don't have write access to it neither.

@satishlokkoju satishlokkoju marked this pull request as draft October 18, 2025 16:19
@satishlokkoju satishlokkoju deleted the branch criteo:master October 18, 2025 16:26
@satishlokkoju satishlokkoju deleted the master branch October 18, 2025 16:26
@satishlokkoju satishlokkoju restored the master branch October 18, 2025 16:27
@rom1504
Copy link
Copy Markdown
Contributor

rom1504 commented Oct 18, 2025

can you please keep this open? we will find someone that can merge

@Mbompr
Copy link
Copy Markdown
Contributor

Mbompr commented Oct 19, 2025

Hello,
Sorry for the late answer. I strangely don't have rights either but will figure out how to get them. I tell you as soon as it's ready.

@Mbompr Mbompr reopened this Oct 19, 2025
@Mbompr Mbompr marked this pull request as ready for review October 19, 2025 19:52
Comment thread requirements-test.txt
pytest==8.0.1
pyspark==3.2.2; python_version < "3.11"
pyspark<3.6.0; python_version >= "3.11"
pyspark>=4.0.1,<5.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sadly need to ensure compatibility with PySpark 3.4, do you think it would break anything?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be fine to change this change to >= 3.4,<5.0.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is only for tests, it should not matter much.

@Mbompr Mbompr merged commit 131fc2a into criteo:master Nov 4, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants