add date filter to contributer resolution logic queries#3253
Conversation
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
sgoggins
left a comment
There was a problem hiding this comment.
One minor change noted with some additional comments on how getting the since_date bulletproofed against dufus contributors who make commit_date random could affect some other parts of the code. LOOKS VERY GOOD! Let me know when this is ready to test.
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
Ulincsys
left a comment
There was a problem hiding this comment.
Just a question about the change to the makefile.
On many systems, that should fail, because installing packages systemwide is not supported anymore (on EG: Ubuntu), and I believe with UV, we are no longer intended to run Augur in a venv (as UV supposedly manages those automatically).
|
The test would be:
|
ABrain7710
left a comment
There was a problem hiding this comment.
To ensure we can still support full collection on demand I think it would good to pass a full_collection flag and if it is true then grab all commits, otherwise do this new logic to only get new commits
|
in poking through the code, im actually not sure this even affects analyze_commits_in_parallel beyond removing some date related things |
The partial commit collection was already supported previously in the body of the repo = get_repo_by_repo_id(repo_id)
#Get the huge list of commits to process.
absolute_path = get_absolute_repo_path(facade_helper.repo_base_directory, repo.repo_id, repo.repo_path, repo.repo_name)
repo_loc = (f"{absolute_path}/.git")
# Grab the parents of HEAD
parent_commits = get_parent_commits_set(repo_loc)
# Grab the existing commits from the database
existing_commits = get_existing_commits_set(repo_id)
# Find missing commits and add them
missing_commits = parent_commits - existing_commits
facade_helper.log_activity('Debug',f"Commits missing from repo {repo_id}: {len(missing_commits)}")These changes just effect the contributor resolution logic to only go through commits that we haven't before by default. |
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
|
What would the behavior of this PR be in the case of a repo deciding to rewrite the history of their git repo? Is that something we can detect and maybe just fall back to the full recollection? |
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
| @@ -8,6 +8,7 @@ | |||
| from augur.application.db.models import Contributor | |||
There was a problem hiding this comment.
[pylint] reported by reviewdog 🐶
W0611: Unused Contributor imported from augur.application.db.models (unused-import)
The logic is based on git commit hashes for the main analysis logic and for the contributor resolution logic it's based on the date that the commit record was inserted. So, if a repository is rewriting their history the hashes would presumably be different and would be collected by the main logic into the commits table. Then that new commit would be collected later and therefore not be filtered by the new logic. The date that it is filtering by is the date that the commit record was collected not the date the commit record was committed. |
| conn.execute(text(f""" | ||
| INSERT INTO "augur_operations"."config" ("section_name", "setting_name", "value", "type") VALUES ('Facade', 'facade_contributor_full_recollect', '{0}', 'int'); | ||
| """)) | ||
| # ### end Alembic commands ### |
There was a problem hiding this comment.
A migration is not needed if you are simply adding a value to the database - its really only needed for changing the structure of the DB tables/columns themselves. I suspect you could add this to augur/application/config.py:~71 or so and it would get included in the database for anyone who has augur set to insert that config on startup
There was a problem hiding this comment.
I gotcha, I just wasn't sure where we create the config in the code. Have changed to reflect.
Signed-off-by: Isaac Milarsky <imilarsky@gmail.com>
sgoggins
left a comment
There was a problem hiding this comment.
This looks ready for testing to me ... maybe @Ulincsys or @MoralCode can review this as well.
augur/application/config.py
Outdated
| "run_analysis": 1, | ||
| "run_facade_contributors": 1 | ||
| "run_facade_contributors": 1, | ||
| "facade_contributor_full_recollect": 1 |
There was a problem hiding this comment.
Added at @ABrain7710 's request. The goal is to have a specific option to ensure that previously uncollected contributors are gathered. Probably the default here should be 0 ...
| # Grab the parents of HEAD | ||
|
|
||
| parent_commits = get_parent_commits_set(repo_loc, start_date) | ||
| parent_commits = get_parent_commits_set(repo_loc) |
There was a problem hiding this comment.
Per @MoralCode 's prior comment; The analyze commits in parallel method would go from the last time facade ran, but this doesn't really get used. RIght now it gets the parent commits first, and gets the existing commits in the database already, and only goes through the commit hashes for commits that are "missing commits" =parent commits - existing commits
There was a problem hiding this comment.
This is not intended to affect the main contributor analysis process. It is less about 'commits' per se.
There was a problem hiding this comment.
Also note that the start_date is never used in the function called with it, which is another reason to simplify/remove it ...
Ulincsys
left a comment
There was a problem hiding this comment.
LGTM with the makefile reverted
Set default to 0. Signed-off-by: Sean P. Goggins <outdoors@acm.org>
Description
Signed commits