Skip to content

Add support for file paths for providing entity rows during batch retrieval#365

Closed
voonhous wants to merge 10 commits into
masterfrom
batch-retrieval-file-path-dev
Closed

Add support for file paths for providing entity rows during batch retrieval#365
voonhous wants to merge 10 commits into
masterfrom
batch-retrieval-file-path-dev

Conversation

@voonhous
Copy link
Copy Markdown
Collaborator

Users should be able to provide large amounts of entity rows when retrieving batch features, but currently they are blocked by memory limits of pandas DataFrames.

Right now, for batch retrieval we already support Avro files as the format for sending entity rows, however this is only available on the Feast Serving API. The Python SDK hides this detail by doing

Entity_rows Pandas DF → .avro (local) → .avro (gcs) → BQ

This pull requests adds the ability for users to provide:

  • A pandas DataFrame with the "datetime" column
  • A local Avro file with the "event_timestamp" column
  • A gcs Avro file
  • A gcs wildcard path.

Examples:

  • entity_rows = [Pandas Dataframe]
  • entity_rows = subfolder/entities.avro
  • entity_rows = /data/subfolder/entities.avro
  • entity_rows = gs://food-recsys/folder/customer_entity_rows.avro
  • entity_rows = gs://food-recsys/folder/customer_entity_rows_*.avro

While datetime and event_timestamp are used interchangeably, there needs to be standardization within the SDK on which to use.

As of now:

  • datetime is enforced in Pandas DataFrame.
  • event_timestamp is enforced in local Avro file
  • No enforcement in files living in GCS. No validation will be done on GCS file paths.

Copy link
Copy Markdown
Member

@woop woop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous this looks pretty good, thanks!

Would you mind updating the tests as well? I think it's important to ensure that the code actually works.

Comment thread sdk/python/feast/client.py
Comment thread sdk/python/feast/client.py
@feast-ci-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: voonhous
To complete the pull request process, please assign woop
You can assign the PR to them by writing /assign @woop in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@feast-ci-bot
Copy link
Copy Markdown
Collaborator

@voonhous: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
test-end-to-end-batch b2a946e link /test test-end-to-end-batch
test-end-to-end b2a946e link /test test-end-to-end

Full PR test history

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

entity_rows["datetime"] = pd.DatetimeIndex(
entity_rows["datetime"]
).tz_localize(None)
elif isinstance(entity_rows, str):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an else statement here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants