Skip to content

fix: add oov_method parameter to IntegerLookup for uniform OOV bucketing#22327

Merged
jeffcarp merged 2 commits into
keras-team:masterfrom
tanguyguyot:master
Mar 5, 2026
Merged

fix: add oov_method parameter to IntegerLookup for uniform OOV bucketing#22327
jeffcarp merged 2 commits into
keras-team:masterfrom
tanguyguyot:master

Conversation

@tanguyguyot
Copy link
Copy Markdown
Contributor

Summary

  • Adds an oov_method parameter ("floormod" | "farmhash") to IntegerLookup and the base IndexLookup class, allowing users to opt into FarmHash64-based OOV bucketing for integer inputs instead of the default floormod behavior, in the same way as Hashing layer does.
  • Backwards compatibility : Default is oov_method="floormod", preserving existing behavior exactly.
  • Tests files added

Testing

Verified pytest keras/src/layers/preprocessing ran locally

Fixes #22325

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Mar 1, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the IntegerLookup layer by introducing a new oov_method parameter, which provides a more robust and uniform way to handle out-of-vocabulary integer tokens. By offering a FarmHash64-based bucketing strategy alongside the existing modulo operation, it resolves potential data distribution biases, particularly when input IDs exhibit arithmetic patterns. This change improves the reliability of OOV token assignment while maintaining backward compatibility.

Highlights

  • New oov_method parameter: Introduced oov_method to IntegerLookup and IndexLookup classes, allowing selection between "floormod" and "farmhash" for out-of-vocabulary (OOV) bucketing.
  • Improved OOV bucketing: The "farmhash" method uses FarmHash64 to uniformly distribute integer OOV tokens, addressing potential bucket imbalance issues seen with the default "floormod" method.
  • Backward compatibility: The default oov_method remains "floormod" to ensure existing behavior is preserved.
  • Comprehensive testing: New test cases were added to validate the functionality and behavior of the oov_method parameter, including scenarios for integer and string inputs, and invalid parameter values.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras/src/layers/preprocessing/index_lookup.py
    • Added oov_method parameter to __init__ with "floormod" as default.
    • Included oov_method in get_config for serialization.
    • Implemented conditional logic in _lookup_dense to use tf.math.floormod or tf.strings.to_hash_bucket_fast based on oov_method for integer inputs.
    • Added validation for oov_method parameter.
    • Updated docstrings to describe the new oov_method parameter and its options.
  • keras/src/layers/preprocessing/index_lookup_test.py
    • Added test_oov_method_ignored_for_string_dtype to verify that oov_method is not applied to string inputs, which always use FarmHash64.
  • keras/src/layers/preprocessing/integer_lookup.py
    • Updated docstrings for num_oov_indices and added a new docstring for oov_method.
    • Modified examples to demonstrate the uniform OOV distribution achieved with oov_method="farmhash".
    • Passed the new oov_method parameter to the super().__init__ call.
  • keras/src/layers/preprocessing/integer_lookup_test.py
    • Added test_oov_method_farmhash to verify correct OOV bucketing with FarmHash.
    • Added test_oov_method_invalid_value to ensure ValueError is raised for unsupported oov_method values.
    • Added test_oov_method_ignored_when_single_oov_index to confirm oov_method has no effect when num_oov_indices is 1.
    • Added test_oov_method_farmhash_output_is_correct to assert the deterministic output of FarmHash64 for specific inputs.
Activity
  • No human activity (comments, reviews, etc.) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an oov_method parameter to the IntegerLookup and base IndexLookup layers, providing an option for FarmHash64-based out-of-vocabulary (OOV) bucketing. This is a valuable addition that addresses potential non-uniformity in OOV token distribution from the default modulo-based method. The implementation is sound, maintains backward compatibility, and is well-supported by updated documentation and comprehensive tests. I have one minor suggestion to improve the clarity of the docstring in IntegerLookup.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.94%. Comparing base (4f85917) to head (08eaa4c).
⚠️ Report is 34 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #22327      +/-   ##
==========================================
+ Coverage   82.73%   82.94%   +0.20%     
==========================================
  Files         594      595       +1     
  Lines       65732    66087     +355     
  Branches    10266    10313      +47     
==========================================
+ Hits        54385    54816     +431     
+ Misses       8711     8657      -54     
+ Partials     2636     2614      -22     
Flag Coverage Δ
keras 82.77% <100.00%> (+0.20%) ⬆️
keras-jax 60.81% <100.00%> (-0.11%) ⬇️
keras-numpy 55.00% <100.00%> (-0.11%) ⬇️
keras-openvino 49.09% <0.00%> (+9.75%) ⬆️
keras-tensorflow 62.04% <100.00%> (-0.12%) ⬇️
keras-torch 60.85% <100.00%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread keras/src/layers/preprocessing/index_lookup.py Outdated
@google-ml-butler google-ml-butler Bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Mar 5, 2026
@jeffcarp
Copy link
Copy Markdown
Member

jeffcarp commented Mar 5, 2026

Thanks! Lgtm with one style nit

@google-ml-butler google-ml-butler Bot removed the ready to pull Ready to be merged into the codebase label Mar 5, 2026
@tanguyguyot tanguyguyot requested a review from jeffcarp March 5, 2026 23:28
@jeffcarp jeffcarp merged commit cfff1d9 into keras-team:master Mar 5, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOV bucketing uses floormod for integers in IntegerLookup, inconsistent with Hashing

5 participants