Skip to content

add date filter to contributer resolution logic queries#3253

Merged
sgoggins merged 10 commits intomainfrom
partial-recollection-facade
Oct 9, 2025
Merged

add date filter to contributer resolution logic queries#3253
sgoggins merged 10 commits intomainfrom
partial-recollection-facade

Conversation

@IsaacMilarky
Copy link
Contributor

Description

  • Add a date filter to the Facade contributer resolution logic
  • Now, we will only use commits that have appeared since last time facade was ran in contributer resolution
  • I have also removed the start_date setting because it wasn't doing anything
  • Added a helper function for if we need to get the date that the last commit was added

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change noted with some additional comments on how getting the since_date bulletproofed against dufus contributors who make commit_date random could affect some other parts of the code. LOOKS VERY GOOD! Let me know when this is ready to test.

@sgoggins sgoggins added the workers Related to data workers label Sep 2, 2025
@IsaacMilarky IsaacMilarky marked this pull request as ready for review September 9, 2025 22:31
Copy link
Contributor

@Ulincsys Ulincsys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question about the change to the makefile.

On many systems, that should fail, because installing packages systemwide is not supported anymore (on EG: Ubuntu), and I believe with UV, we are no longer intended to run Augur in a venv (as UV supposedly manages those automatically).

sgoggins
sgoggins previously approved these changes Sep 30, 2025
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Lets let it fly!

@sgoggins
Copy link
Member

The test would be:

  1. New Augur instance
  2. Identify 3-5 very large repositories
  3. Do collection for them (note that for Facade collection it typically requires an Augur restart)
  4. Truncate the augur_operations.collection_status table
  5. Restart collection after commits were collected the first time and see if it goes insanely fast, which it should if this branch does what we claim it does.

@sgoggins sgoggins requested review from MoralCode and removed request for Ulincsys September 30, 2025 14:28
Copy link
Contributor

@ABrain7710 ABrain7710 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure we can still support full collection on demand I think it would good to pass a full_collection flag and if it is true then grab all commits, otherwise do this new logic to only get new commits

@MoralCode
Copy link
Contributor

in poking through the code, im actually not sure this even affects analyze_commits_in_parallel beyond removing some date related things

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
@IsaacMilarky
Copy link
Contributor Author

in poking through the code, im actually not sure this even affects analyze_commits_in_parallel beyond removing some date related things

The partial commit collection was already supported previously in the body of the analyze_commits_in_parallel method:

    repo = get_repo_by_repo_id(repo_id)

    #Get the huge list of commits to process.
    absolute_path = get_absolute_repo_path(facade_helper.repo_base_directory, repo.repo_id, repo.repo_path, repo.repo_name)
    repo_loc = (f"{absolute_path}/.git")
    # Grab the parents of HEAD

    parent_commits = get_parent_commits_set(repo_loc)

    # Grab the existing commits from the database
    existing_commits = get_existing_commits_set(repo_id)

    # Find missing commits and add them
    missing_commits = parent_commits - existing_commits

    facade_helper.log_activity('Debug',f"Commits missing from repo {repo_id}: {len(missing_commits)}")

These changes just effect the contributor resolution logic to only go through commits that we haven't before by default.

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
@MoralCode
Copy link
Contributor

What would the behavior of this PR be in the case of a repo deciding to rewrite the history of their git repo? Is that something we can detect and maybe just fall back to the full recollection?

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
@@ -8,6 +8,7 @@
from augur.application.db.models import Contributor
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pylint] reported by reviewdog 🐶
W0611: Unused Contributor imported from augur.application.db.models (unused-import)

@IsaacMilarky
Copy link
Contributor Author

What would the behavior of this PR be in the case of a repo deciding to rewrite the history of their git repo? Is that something we can detect and maybe just fall back to the full recollection?

The logic is based on git commit hashes for the main analysis logic and for the contributor resolution logic it's based on the date that the commit record was inserted.

So, if a repository is rewriting their history the hashes would presumably be different and would be collected by the main logic into the commits table. Then that new commit would be collected later and therefore not be filtered by the new logic. The date that it is filtering by is the date that the commit record was collected not the date the commit record was committed.

Comment on lines +25 to +28
conn.execute(text(f"""
INSERT INTO "augur_operations"."config" ("section_name", "setting_name", "value", "type") VALUES ('Facade', 'facade_contributor_full_recollect', '{0}', 'int');
"""))
# ### end Alembic commands ###
Copy link
Contributor

@MoralCode MoralCode Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A migration is not needed if you are simply adding a value to the database - its really only needed for changing the structure of the DB tables/columns themselves. I suspect you could add this to augur/application/config.py:~71 or so and it would get included in the database for anyone who has augur set to insert that config on startup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gotcha, I just wasn't sure where we create the config in the code. Have changed to reflect.

Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
@IsaacMilarky IsaacMilarky requested a review from MoralCode October 8, 2025 16:01
@IsaacMilarky IsaacMilarky requested a review from sgoggins October 8, 2025 16:01
sgoggins
sgoggins previously approved these changes Oct 8, 2025
Copy link
Member

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready for testing to me ... maybe @Ulincsys or @MoralCode can review this as well.

"run_analysis": 1,
"run_facade_contributors": 1
"run_facade_contributors": 1,
"facade_contributor_full_recollect": 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added at @ABrain7710 's request. The goal is to have a specific option to ensure that previously uncollected contributors are gathered. Probably the default here should be 0 ...

# Grab the parents of HEAD

parent_commits = get_parent_commits_set(repo_loc, start_date)
parent_commits = get_parent_commits_set(repo_loc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @MoralCode 's prior comment; The analyze commits in parallel method would go from the last time facade ran, but this doesn't really get used. RIght now it gets the parent commits first, and gets the existing commits in the database already, and only goes through the commit hashes for commits that are "missing commits" =parent commits - existing commits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not intended to affect the main contributor analysis process. It is less about 'commits' per se.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the start_date is never used in the function called with it, which is another reason to simplify/remove it ...

Ulincsys
Ulincsys previously approved these changes Oct 9, 2025
Copy link
Contributor

@Ulincsys Ulincsys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the makefile reverted

Set default to 0. 

Signed-off-by: Sean P. Goggins <outdoors@acm.org>
@sgoggins sgoggins dismissed stale reviews from Ulincsys and themself via a87c6a0 October 9, 2025 14:14
@sgoggins sgoggins merged commit 473673c into main Oct 9, 2025
11 of 13 checks passed
@sgoggins sgoggins deleted the partial-recollection-facade branch November 12, 2025 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workers Related to data workers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants