Avoid duplicate image indexing when scanning folders containing symlinks or hardlinks#1224
Avoid duplicate image indexing when scanning folders containing symlinks or hardlinks#1224Riddhi8077 wants to merge 16 commits intoAOSSIE-Org:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughSet sqlite3.Row on DB connections and commit in the connection context; added direct image lookup and safer favourite-toggle; introduced filesystem-based duplicate-image and path deduplication and simplified metadata parsing; replaced global test fixture with a per-test temp DB fixture and added connection/metadata/dedup tests; minor .gitignore whitespace change. Changes
Sequence Diagram(s)(omitted) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/routes/images.py (1)
109-114:⚠️ Potential issue | 🟠 MajorGuard against
Nonefromdb_get_image_by_idbefore using.get().If the record is deleted/changed between toggle and fetch,
imagecan beNone, andimage.get(...)will raise an exception.💡 Suggested hardening
- image = db_get_image_by_id(image_id) + image = db_get_image_by_id(image_id) + if image is None: + raise HTTPException( + status_code=404, detail="Image not found after toggle" + ) return { "success": True, "image_id": image_id, "isFavourite": image.get("isFavourite", False), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/routes/images.py` around lines 109 - 114, The code calls db_get_image_by_id(...) and immediately uses image.get(...), which will raise if db_get_image_by_id returns None; update the handler to check whether image is None after calling db_get_image_by_id (the local variable image) and handle that case explicitly (e.g., return a 404/appropriate error response or set isFavourite to False and success False), rather than calling image.get; ensure the response uses the guarded value and include the image_id and a clear error path so the route (images.py) won't crash if the record was deleted between toggle and fetch.
🧹 Nitpick comments (4)
backend/app/routes/images.py (1)
9-9: Remove duplicate import ofdb_get_all_images.This symbol is already imported at Line 3, so this line introduces unnecessary duplication.
As per coding guidelines, "Look for code duplication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/routes/images.py` at line 9, The file imports db_get_all_images twice; remove the duplicate import and consolidate imports from app.database.images so only one import statement remains (keep db_get_all_images and db_get_image_by_id in the single import). Locate the duplicate reference to db_get_all_images in the images.py imports and delete the redundant entry so there is a single import of db_get_all_images alongside db_get_image_by_id.backend/app/database/images.py (1)
9-21: Remove the repeatedget_db_connectionimport.
from app.database.connection import get_db_connectionappears twice in the module.As per coding guidelines, "Look for code duplication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/database/images.py` around lines 9 - 21, The module imports get_db_connection twice; remove the duplicate import statement so only a single "from app.database.connection import get_db_connection" remains; verify the remaining import is placed with the other imports (above logger initialization) and run linting to ensure no unused-import warnings; references to get_db_connection in the module should continue to work unchanged.backend/tests/database/test_metadata.py (1)
8-23: Consider reusing the shared DB fixture fromconftest.py.This local fixture duplicates temporary DB setup logic already centralized in
temp_db_path, which can drift over time.As per coding guidelines, "Look for code duplication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/tests/database/test_metadata.py` around lines 8 - 23, The local temp_db fixture duplicates setup already provided by the conftest.py temp_db_path fixture; remove this temp_db fixture and update tests to accept temp_db_path instead, then replace uses of temp_db with temp_db_path and call monkeypatch.setattr("app.database.metadata.DATABASE_PATH", str(temp_db_path)) in tests (or a small per-test setup) so the metadata module uses the shared temporary DB; ensure any tests that previously relied on temp_db's return value use the same path or Path object provided by temp_db_path.backend/app/database/connection.py (1)
11-12: Replace the emphatic comment with a specific reason.
# VERY IMPORTANTdoes not add actionable context. A short reason (e.g., whysqlite3.Rowis required downstream) will age better.As per coding guidelines, "Point out redundant obvious comments that do not add clarity to the code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/database/connection.py` around lines 11 - 12, Replace the vague comment above conn.row_factory = sqlite3.Row with a concise, actionable reason explaining why sqlite3.Row is required (for example: "Use sqlite3.Row so query results support dict-style access by column name required by downstream code that expects row['col_name']"). Update the comment adjacent to conn.row_factory = sqlite3.Row to reference the specific downstream expectation (e.g., dict-style access or JSON serialization) and remove "VERY IMPORTANT".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/utils/images.py`:
- Around line 25-45: Add focused pytest cases for
image_util_remove_duplicate_images to validate deduplication by filesystem
identity and handling of unreadable paths: create tests that (1) create two
records pointing to the same physical file via a hardlink (use os.link) and
assert the function returns a single record, (2) create a symlink to the same
file and assert dedup also collapses symlink+target into one record, and (3)
simulate an unreadable/missing path (either by changing file permissions or
forcing os.stat to raise OSError via monkeypatch) and assert that the unreadable
record is skipped. Use pytest tmp_path fixtures to create files and reference
image_util_remove_duplicate_images to locate the implementation.
- Around line 25-45: The dedup helper image_util_remove_duplicate_images is
implemented but never applied in the ingestion flow; update
image_util_process_folder_images to call image_util_remove_duplicate_images on
the list of records returned or built before the bulk insert step so that only
unique_records are passed to whatever bulk insert function (e.g., the existing
bulk_insert or similar call) — locate where image records are collected in
image_util_process_folder_images, run them through
image_util_remove_duplicate_images, and use that filtered list for
indexing/insertion.
- Around line 516-528: image_util_parse_metadata currently returns raw DB
strings unchanged, causing callers that expect a dict to break; update
image_util_parse_metadata to detect when metadata is a JSON string and
json.loads it (import json), catch and log JSONDecodeError, and then normalize
the resulting object to a dict with the expected keys
("name","date_created","width","height","file_location","file_size","item_type")
filling any missing keys with the same default values currently used; keep the
original default-return behavior for falsy metadata and ensure the function
always returns a dict-like object for downstream consumers.
---
Outside diff comments:
In `@backend/app/routes/images.py`:
- Around line 109-114: The code calls db_get_image_by_id(...) and immediately
uses image.get(...), which will raise if db_get_image_by_id returns None; update
the handler to check whether image is None after calling db_get_image_by_id (the
local variable image) and handle that case explicitly (e.g., return a
404/appropriate error response or set isFavourite to False and success False),
rather than calling image.get; ensure the response uses the guarded value and
include the image_id and a clear error path so the route (images.py) won't crash
if the record was deleted between toggle and fetch.
---
Nitpick comments:
In `@backend/app/database/connection.py`:
- Around line 11-12: Replace the vague comment above conn.row_factory =
sqlite3.Row with a concise, actionable reason explaining why sqlite3.Row is
required (for example: "Use sqlite3.Row so query results support dict-style
access by column name required by downstream code that expects
row['col_name']"). Update the comment adjacent to conn.row_factory = sqlite3.Row
to reference the specific downstream expectation (e.g., dict-style access or
JSON serialization) and remove "VERY IMPORTANT".
In `@backend/app/database/images.py`:
- Around line 9-21: The module imports get_db_connection twice; remove the
duplicate import statement so only a single "from app.database.connection import
get_db_connection" remains; verify the remaining import is placed with the other
imports (above logger initialization) and run linting to ensure no unused-import
warnings; references to get_db_connection in the module should continue to work
unchanged.
In `@backend/app/routes/images.py`:
- Line 9: The file imports db_get_all_images twice; remove the duplicate import
and consolidate imports from app.database.images so only one import statement
remains (keep db_get_all_images and db_get_image_by_id in the single import).
Locate the duplicate reference to db_get_all_images in the images.py imports and
delete the redundant entry so there is a single import of db_get_all_images
alongside db_get_image_by_id.
In `@backend/tests/database/test_metadata.py`:
- Around line 8-23: The local temp_db fixture duplicates setup already provided
by the conftest.py temp_db_path fixture; remove this temp_db fixture and update
tests to accept temp_db_path instead, then replace uses of temp_db with
temp_db_path and call monkeypatch.setattr("app.database.metadata.DATABASE_PATH",
str(temp_db_path)) in tests (or a small per-test setup) so the metadata module
uses the shared temporary DB; ensure any tests that previously relied on
temp_db's return value use the same path or Path object provided by
temp_db_path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3ec899a0-2f61-4ce8-a9bf-3a686c26f118
📒 Files selected for processing (8)
backend/.gitignorebackend/app/database/connection.pybackend/app/database/images.pybackend/app/routes/images.pybackend/app/utils/images.pybackend/tests/conftest.pybackend/tests/database/test_connection.pybackend/tests/database/test_metadata.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
backend/app/utils/images.py (2)
26-45:⚠️ Potential issue | 🔴 CriticalThis dedup helper never affects indexing.
image_util_process_folder_images()still bulk-inserts the rawall_image_recordslist on Line 100, so symlink/hardlink aliases still create duplicate rows and duplicate thumbnail work. Please run the filesystem-identity check in the ingest path beforeimage_util_prepare_image_records(), otherwise#1181remains unresolved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 26 - 45, The deduplication by filesystem identity in image_util_remove_duplicate_images is never applied to indexing because image_util_process_folder_images still bulk-inserts the raw all_image_records; modify image_util_process_folder_images to run image_util_remove_duplicate_images on all_image_records immediately after collecting them (and before calling image_util_prepare_image_records and before the bulk insert) so that all_image_records is replaced with the deduplicated list; ensure subsequent calls (image_util_prepare_image_records, thumbnail generation, and the DB bulk-insert) use the deduplicated all_image_records to avoid duplicate rows and duplicate thumbnail work.
554-566:⚠️ Potential issue | 🔴 Critical
image_util_parse_metadatamust still deserialize DB JSON.
metadatais stored as JSONTEXT. Callers atbackend/app/database/images.pyLine 213 and Line 579, plusbackend/app/routes/images.pyLine 67, expect adict; returning the raw string here will break response-model construction and downstream metadata access. Please keep dicts as-is andjson.loadsstring values with a default fallback.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 554 - 566, image_util_parse_metadata currently returns raw metadata which may be a JSON string; update it so if metadata is already a dict return it unchanged, and if it's a str attempt json.loads(metadata) inside a try/except and return the parsed dict (on exception or falsy input return the existing default dict with keys name,date_created,width,height,file_location,file_size,item_type); ensure numeric fields remain present (width,height,file_size) and types handled conservatively so callers expecting a dict (e.g., the image metadata consumers) receive a dict fallback rather than a raw string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/database/images.py`:
- Around line 485-507: Remove the stray token "fix-remove-duplicate-images" from
the body of db_get_image_by_id and delete the orphaned/ unreachable
exception/finally blocks that follow the early return; either wrap the DB
operations in a proper try/except/finally around get_db_connection() to call
conn.commit() and close/rollback as needed, or keep the simple with
get_db_connection() context usage and just return dict(row) after fetchone() so
there are no unmatched except/finally blocks left in the module.
---
Duplicate comments:
In `@backend/app/utils/images.py`:
- Around line 26-45: The deduplication by filesystem identity in
image_util_remove_duplicate_images is never applied to indexing because
image_util_process_folder_images still bulk-inserts the raw all_image_records;
modify image_util_process_folder_images to run
image_util_remove_duplicate_images on all_image_records immediately after
collecting them (and before calling image_util_prepare_image_records and before
the bulk insert) so that all_image_records is replaced with the deduplicated
list; ensure subsequent calls (image_util_prepare_image_records, thumbnail
generation, and the DB bulk-insert) use the deduplicated all_image_records to
avoid duplicate rows and duplicate thumbnail work.
- Around line 554-566: image_util_parse_metadata currently returns raw metadata
which may be a JSON string; update it so if metadata is already a dict return it
unchanged, and if it's a str attempt json.loads(metadata) inside a try/except
and return the parsed dict (on exception or falsy input return the existing
default dict with keys
name,date_created,width,height,file_location,file_size,item_type); ensure
numeric fields remain present (width,height,file_size) and types handled
conservatively so callers expecting a dict (e.g., the image metadata consumers)
receive a dict fallback rather than a raw string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d440ed75-489b-4a51-9535-0ec0ef26048e
📒 Files selected for processing (2)
backend/app/database/images.pybackend/app/utils/images.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
backend/app/utils/images.py (2)
554-566:⚠️ Potential issue | 🔴 CriticalParse the stored JSON before returning metadata.
Line 196 stores metadata as a JSON string. Returning DB values unchanged here hands callers a raw
strinstead of a dict, which breaks downstream field access and response serialization. Please restore JSON parsing and normalize the result to the expected shape.🛠️ Minimal fix
def image_util_parse_metadata(metadata): + default = { + "name": "unknown", + "date_created": "unknown", + "width": 0, + "height": 0, + "file_location": "unknown", + "file_size": 0, + "item_type": "image", + } if not metadata: - return { - "name": "unknown", - "date_created": "unknown", - "width": 0, - "height": 0, - "file_location": "unknown", - "file_size": 0, - "item_type": "image" - } - - return metadata + return default + if isinstance(metadata, dict): + return {**default, **metadata} + if isinstance(metadata, str): + try: + parsed = json.loads(metadata) + except json.JSONDecodeError: + logger.warning("Invalid image metadata JSON") + return default + if isinstance(parsed, dict): + return {**default, **parsed} + return default🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 554 - 566, The function image_util_parse_metadata currently returns the DB value unchanged, which can be a JSON string; update image_util_parse_metadata to detect if metadata is a string and parse it with json.loads, then normalize the resulting dict to always include keys name, date_created, width, height, file_location, file_size, and item_type (defaulting to "unknown" or 0 as in the current fallback), and return that dict; ensure you import json if needed and keep the existing early-return when metadata is falsy.
26-45:⚠️ Potential issue | 🟠 MajorStill missing focused regression coverage for the dedup path.
This is the core bugfix, but the provided changes still do not exercise hardlink collapse, symlink collapse, or the
OSError/empty-scan paths. Please add automated tests aroundimage_util_remove_duplicate_images()andimage_util_process_folder_images()so this behavior does not regress again.As per coding guidelines, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."
Also applies to: 99-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 26 - 45, Add automated pytest unit tests for image_util_remove_duplicate_images and image_util_process_folder_images to cover hardlink collapse, symlink collapse, OSError handling, and empty-scan behavior: create temporary files in a tmp_path fixture, make a hardlink to the same inode and a symlink to the file and assert duplicate entries are collapsed by image_util_remove_duplicate_images; exercise image_util_process_folder_images on an empty directory to assert it returns an empty result and on a directory with hardlinked/symlinked files to assert deduplication; also add a test that simulates os.stat raising OSError for a path (either by passing a non-existent path or monkeypatching os.stat to raise) and assert the function skips that record without crashing. Ensure tests import image_util_remove_duplicate_images and image_util_process_folder_images and follow existing test patterns and cleanup rules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/utils/images.py`:
- Around line 99-101: The current flow generates thumbnails in
image_util_prepare_image_records then calls image_util_remove_duplicate_images
before db_bulk_insert_images, causing orphaned thumbnail files; fix by
deduplicating earlier or cleaning up discarded thumbnails: either run
image_util_remove_duplicate_images on the raw input file paths before calling
image_util_prepare_image_records so thumbnails are only created for unique
images, or (if you must keep current order) collect the list of discarded image
records after image_util_remove_duplicate_images and call
image_util_remove_obsolete_images (or an explicit thumbnail-deletion helper) to
remove their thumbnail_{uuid}.jpg files before returning to
db_bulk_insert_images; update the code around image_util_prepare_image_records,
image_util_remove_duplicate_images, and image_util_remove_obsolete_images
accordingly.
- Around line 99-101: The function currently only sets unique_records and calls
db_bulk_insert_images when all_image_records is truthy, but then returns its
result outside the branch causing undefined behavior when all_image_records is
empty; modify the logic so the call to db_bulk_insert_images(unique_records)
remains inside the if all_image_records branch (using unique_records =
image_util_remove_duplicate_images(all_image_records) then return
db_bulk_insert_images(unique_records)), and add an explicit return True
immediately after the branch to ensure empty scans are treated as a successful
no-op; reference symbols: all_image_records, unique_records,
image_util_remove_duplicate_images, db_bulk_insert_images.
---
Duplicate comments:
In `@backend/app/utils/images.py`:
- Around line 554-566: The function image_util_parse_metadata currently returns
the DB value unchanged, which can be a JSON string; update
image_util_parse_metadata to detect if metadata is a string and parse it with
json.loads, then normalize the resulting dict to always include keys name,
date_created, width, height, file_location, file_size, and item_type (defaulting
to "unknown" or 0 as in the current fallback), and return that dict; ensure you
import json if needed and keep the existing early-return when metadata is falsy.
- Around line 26-45: Add automated pytest unit tests for
image_util_remove_duplicate_images and image_util_process_folder_images to cover
hardlink collapse, symlink collapse, OSError handling, and empty-scan behavior:
create temporary files in a tmp_path fixture, make a hardlink to the same inode
and a symlink to the file and assert duplicate entries are collapsed by
image_util_remove_duplicate_images; exercise image_util_process_folder_images on
an empty directory to assert it returns an empty result and on a directory with
hardlinked/symlinked files to assert deduplication; also add a test that
simulates os.stat raising OSError for a path (either by passing a non-existent
path or monkeypatching os.stat to raise) and assert the function skips that
record without crashing. Ensure tests import image_util_remove_duplicate_images
and image_util_process_folder_images and follow existing test patterns and
cleanup rules.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 76c5cafc-3bef-41c2-a58a-8b0a0c06fcd0
📒 Files selected for processing (1)
backend/app/utils/images.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
backend/app/database/images.py (1)
485-513:⚠️ Potential issue | 🔴 CriticalCritical:
db_get_image_by_idhas orphaned code that causes a SyntaxError.The function returns at line 505 with
return dict(row), making lines 506–513 unreachable. Additionally, theexceptandfinallyblocks have no matchingtry, causing aSyntaxErrorthat prevents the module from importing.🐛 Proposed fix
def db_get_image_by_id(image_id: str) -> dict | None: """ Retrieve a single image by its ID. """ with get_db_connection() as conn: cursor = conn.cursor() cursor.execute( """ SELECT * FROM images WHERE id = ? """, (image_id,), ) row = cursor.fetchone() if row is None: return None return dict(row) - conn.commit() - return cursor.rowcount > 0 - except Exception as e: - logger.error(f"Database error: {e}") - conn.rollback() - return False - finally: - conn.close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/database/images.py` around lines 485 - 513, The function db_get_image_by_id contains unreachable lines after an early return and orphaned except/finally blocks (no try) causing a SyntaxError; fix it by wrapping the DB operations in a try block (use with get_db_connection() as conn: then try: cursor = conn.cursor(); execute and fetch), remove the unreachable commits/returns after return dict(row), place the existing except Exception as e: logger.error(...); conn.rollback(); return None (or False per project convention) and the finally block to close the connection (or rely on the context manager), and ensure the function returns the dict(row) when found and None when not found without dangling code; reference db_get_image_by_id, get_db_connection, cursor, and logger when making changes.backend/app/utils/images.py (1)
100-104:⚠️ Potential issue | 🔴 CriticalCritical: Indentation error causes
NameErrorand unreachable code.The indentation is inconsistent:
unique_recordsis defined only inside theif all_image_records:block (lines 100-101), butreturn db_bulk_insert_images(unique_records)at line 102 appears to be at the outer indentation level. Whenall_image_recordsis empty:
unique_recordsis never defined- Line 102 raises
NameError: name 'unique_records' is not defined- Line 104
return Truebecomes unreachable🐛 Proposed fix
# Step 5: Bulk insert all new records if any exist - if all_image_records: - unique_records = image_util_remove_duplicate_images(all_image_records) - return db_bulk_insert_images(unique_records) - - return True # No images to process is not an error + if all_image_records: + unique_records = image_util_remove_duplicate_images(all_image_records) + return db_bulk_insert_images(unique_records) + + return True # No images to process is not an error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 100 - 104, The current block mis-indents handling of all_image_records so unique_records can be undefined; fix by moving the return that calls db_bulk_insert_images(unique_records) inside the same if all_image_records: block (after computing unique_records = image_util_remove_duplicate_images(all_image_records)) and ensure the function returns True only when there are no images to process; reference the variables and helpers all_image_records, unique_records, image_util_remove_duplicate_images(), and db_bulk_insert_images() to locate and correct the indentation so the db call is only made when unique_records is defined.
🧹 Nitpick comments (3)
backend/app/utils/images.py (3)
26-45: Add type hints for clarity and static analysis.The function lacks type annotations for its parameter and return value. As per coding guidelines for Python files, ensure proper use of type hints.
♻️ Proposed fix
-def image_util_remove_duplicate_images(records): +def image_util_remove_duplicate_images(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]: """Remove duplicate physical image files using file system identity."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 26 - 45, The function image_util_remove_duplicate_images lacks type annotations; update its signature to include precise typing (e.g., records: Sequence[Dict[str, Any]]) -> List[Dict[str, Any]] and add the necessary typing imports (List, Dict, Any, Sequence) at the top of the module so static analysis and editors can validate that each record is a dict containing a "path" key; keep the implementation logic unchanged.
585-601: Add type hints; approach is sound.The early path deduplication correctly prevents redundant thumbnail generation (addressing the concern about thumbnails being created before dedup). Add type hints per coding guidelines.
♻️ Proposed fix
-def image_util_remove_duplicate_paths(image_paths): +def image_util_remove_duplicate_paths(image_paths: List[str]) -> List[str]: """Remove duplicate physical image files before processing."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 585 - 601, The function image_util_remove_duplicate_paths currently lacks type annotations; update its signature to include input and return type hints (e.g., image_paths: Iterable[str] or Sequence[str] and return List[str]) and annotate any local collections if you follow local var typing (seen: Set[Tuple[int,int]], unique_paths: List[str]) to satisfy project typing guidelines, ensuring you also import the required typing symbols (Iterable, List, Set, Tuple) where necessary.
558-583: Add type hints to match docstring intent.The function lacks type annotations. Based on usage patterns in
backend/app/database/images.py(see context snippets 2-4), callers expect aMapping[str, Any]return.♻️ Proposed fix
-def image_util_parse_metadata(metadata): +def image_util_parse_metadata(metadata: Any) -> Dict[str, Any]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 558 - 583, The function image_util_parse_metadata lacks type annotations; update its signature to accept metadata: Optional[Union[str, Mapping[str, Any]]]) and return Mapping[str, Any] (or Dict[str, Any]) and import required typing names (Mapping, Any, Optional, Union, Dict) at the top; keep the same logic but ensure the default return is typed as a Dict[str, Any] and the JSON branch returns a Mapping[str, Any] so callers in backend/app/database/images.py get the expected Mapping[str, Any] type from image_util_parse_metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/database/images.py`:
- Line 10: There is a duplicate import of get_db_connection in images.py; remove
the redundant import statement so only a single "from app.database.connection
import get_db_connection" remains (keep one and delete the other) to satisfy
PEP8 and avoid unnecessary duplication; locate the two occurrences near the top
of images.py and remove one, ensuring any references to get_db_connection in
functions within this module continue to work.
In `@backend/app/utils/images.py`:
- Line 556: Remove the duplicate module import by deleting the redundant "import
json" statement found in the file (the one shown in the diff) so only the single
existing "import json" at the top of backend.app.utils.images remains; search
for the "import json" occurrences in utils/images.py and remove the later
duplicate to satisfy PEP8.
---
Duplicate comments:
In `@backend/app/database/images.py`:
- Around line 485-513: The function db_get_image_by_id contains unreachable
lines after an early return and orphaned except/finally blocks (no try) causing
a SyntaxError; fix it by wrapping the DB operations in a try block (use with
get_db_connection() as conn: then try: cursor = conn.cursor(); execute and
fetch), remove the unreachable commits/returns after return dict(row), place the
existing except Exception as e: logger.error(...); conn.rollback(); return None
(or False per project convention) and the finally block to close the connection
(or rely on the context manager), and ensure the function returns the dict(row)
when found and None when not found without dangling code; reference
db_get_image_by_id, get_db_connection, cursor, and logger when making changes.
In `@backend/app/utils/images.py`:
- Around line 100-104: The current block mis-indents handling of
all_image_records so unique_records can be undefined; fix by moving the return
that calls db_bulk_insert_images(unique_records) inside the same if
all_image_records: block (after computing unique_records =
image_util_remove_duplicate_images(all_image_records)) and ensure the function
returns True only when there are no images to process; reference the variables
and helpers all_image_records, unique_records,
image_util_remove_duplicate_images(), and db_bulk_insert_images() to locate and
correct the indentation so the db call is only made when unique_records is
defined.
---
Nitpick comments:
In `@backend/app/utils/images.py`:
- Around line 26-45: The function image_util_remove_duplicate_images lacks type
annotations; update its signature to include precise typing (e.g., records:
Sequence[Dict[str, Any]]) -> List[Dict[str, Any]] and add the necessary typing
imports (List, Dict, Any, Sequence) at the top of the module so static analysis
and editors can validate that each record is a dict containing a "path" key;
keep the implementation logic unchanged.
- Around line 585-601: The function image_util_remove_duplicate_paths currently
lacks type annotations; update its signature to include input and return type
hints (e.g., image_paths: Iterable[str] or Sequence[str] and return List[str])
and annotate any local collections if you follow local var typing (seen:
Set[Tuple[int,int]], unique_paths: List[str]) to satisfy project typing
guidelines, ensuring you also import the required typing symbols (Iterable,
List, Set, Tuple) where necessary.
- Around line 558-583: The function image_util_parse_metadata lacks type
annotations; update its signature to accept metadata: Optional[Union[str,
Mapping[str, Any]]]) and return Mapping[str, Any] (or Dict[str, Any]) and import
required typing names (Mapping, Any, Optional, Union, Dict) at the top; keep the
same logic but ensure the default return is typed as a Dict[str, Any] and the
JSON branch returns a Mapping[str, Any] so callers in
backend/app/database/images.py get the expected Mapping[str, Any] type from
image_util_parse_metadata.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b8905fdb-944d-426d-b23f-45633fa261a0
📒 Files selected for processing (2)
backend/app/database/images.pybackend/app/utils/images.py
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/utils/images.py (1)
50-50:⚠️ Potential issue | 🟡 MinorDuplicate logger shadows the custom logger.
Line 24 already defines
logger = get_logger(__name__). This reassignment at line 50 shadows it with the standard library logger, potentially losing custom logging configuration.🧹 Proposed fix: Remove duplicate
-logger = logging.getLogger(__name__) - -🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` at line 50, The file reassigns logger from the custom get_logger(__name__) to the stdlib logging.getLogger(__name__) (the symbol logger is defined twice), which shadows and discards custom configuration; remove the second assignment so that the module keeps the original logger returned by get_logger and ensure any references use the existing logger symbol rather than calling logging.getLogger again.
🧹 Nitpick comments (1)
backend/app/utils/images.py (1)
26-45: Add type hints for better maintainability.Both
image_util_remove_duplicate_imagesandimage_util_remove_duplicate_pathslack type annotations. Consider adding them for consistency with the rest of the codebase.♻️ Suggested type hints
-def image_util_remove_duplicate_images(records): +def image_util_remove_duplicate_images(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]: """Remove duplicate physical image files using file system identity."""And for
image_util_remove_duplicate_paths:-def image_util_remove_duplicate_paths(image_paths): +def image_util_remove_duplicate_paths(image_paths: List[str]) -> List[str]: """Remove duplicate physical image files before processing."""As per coding guidelines, "Ensure proper use of type hints."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils/images.py` around lines 26 - 45, Add explicit type hints to both functions: annotate image_util_remove_duplicate_images as def image_util_remove_duplicate_images(records: Sequence[Mapping[str, Any]]) -> List[Mapping[str, Any]] and annotate image_util_remove_duplicate_paths as def image_util_remove_duplicate_paths(paths: Sequence[str]) -> List[str]; import the necessary typing symbols (Sequence, Mapping, Any, List) at the top of the module and ensure the return and parameter types match how the functions are used (records being mappings with a "path" key and paths being strings).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/utils/images.py`:
- Around line 100-104: The current indentation around the all_image_records
handling causes unique_records to be referenced when undefined and makes the
final return unreachable; fix the block in the function that uses
all_image_records so that: check if all_image_records (if all_image_records:)
then compute unique_records =
image_util_remove_duplicate_images(all_image_records) and return
db_bulk_insert_images(unique_records) inside that if-block, and after the
if-block return True for the no-images case; ensure references to
unique_records, image_util_remove_duplicate_images, and db_bulk_insert_images
are only used within the corrected conditional so no NameError or unreachable
code remains.
In `@backend/tests/util/test_dedup.py`:
- Around line 4-16: The current test only covers identical-path deduplication;
add tests that exercise deduplication by physical file identity using
image_util_remove_duplicate_images: create a hardlink pair (os.link) with
different paths and assert only one record is returned, create a symlink
(Path.symlink_to) paired with its target and assert collapse to one record, and
add a test with a non-existent/unreadable path to ensure such records are
skipped gracefully; reference image_util_remove_duplicate_images in all new
tests and ensure they use tmp_path to create files/links and assert lengths and
preserved valid path content.
---
Outside diff comments:
In `@backend/app/utils/images.py`:
- Line 50: The file reassigns logger from the custom get_logger(__name__) to the
stdlib logging.getLogger(__name__) (the symbol logger is defined twice), which
shadows and discards custom configuration; remove the second assignment so that
the module keeps the original logger returned by get_logger and ensure any
references use the existing logger symbol rather than calling logging.getLogger
again.
---
Nitpick comments:
In `@backend/app/utils/images.py`:
- Around line 26-45: Add explicit type hints to both functions: annotate
image_util_remove_duplicate_images as def
image_util_remove_duplicate_images(records: Sequence[Mapping[str, Any]]) ->
List[Mapping[str, Any]] and annotate image_util_remove_duplicate_paths as def
image_util_remove_duplicate_paths(paths: Sequence[str]) -> List[str]; import the
necessary typing symbols (Sequence, Mapping, Any, List) at the top of the module
and ensure the return and parameter types match how the functions are used
(records being mappings with a "path" key and paths being strings).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 23db8a34-419a-433f-bbf2-c7e6c2aeb351
📒 Files selected for processing (2)
backend/app/utils/images.pybackend/tests/util/test_dedup.py
| if all_image_records: | ||
| unique_records = image_util_remove_duplicate_images(all_image_records) | ||
| return db_bulk_insert_images(unique_records) | ||
|
|
||
| return True # No images to process is not an error |
There was a problem hiding this comment.
Critical indentation bug causes NameError and unreachable code.
Lines 100-102 have inconsistent indentation:
- Line 100 has an extra leading space
- Line 101 has only 2-space indent instead of 4
- Line 102's
returnis outside theifblock, so whenall_image_recordsis empty,unique_recordsis undefined →NameError - Line 104 is unreachable
🐛 Proposed fix
# Step 5: Bulk insert all new records if any exist
- if all_image_records:
- unique_records = image_util_remove_duplicate_images(all_image_records)
- return db_bulk_insert_images(unique_records)
-
- return True # No images to process is not an error
+ if all_image_records:
+ unique_records = image_util_remove_duplicate_images(all_image_records)
+ return db_bulk_insert_images(unique_records)
+
+ return True # No images to process is not an error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/utils/images.py` around lines 100 - 104, The current indentation
around the all_image_records handling causes unique_records to be referenced
when undefined and makes the final return unreachable; fix the block in the
function that uses all_image_records so that: check if all_image_records (if
all_image_records:) then compute unique_records =
image_util_remove_duplicate_images(all_image_records) and return
db_bulk_insert_images(unique_records) inside that if-block, and after the
if-block return True for the no-images case; ensure references to
unique_records, image_util_remove_duplicate_images, and db_bulk_insert_images
are only used within the corrected conditional so no NameError or unreachable
code remains.
| def test_remove_duplicate_images(tmp_path): | ||
| file1 = tmp_path / "img1.jpg" | ||
| file1.write_text("data") | ||
|
|
||
| # simulate duplicate reference | ||
| records = [ | ||
| {"path": str(file1)}, | ||
| {"path": str(file1)}, | ||
| ] | ||
|
|
||
| result = image_util_remove_duplicate_images(records) | ||
|
|
||
| assert len(result) == 1 |
There was a problem hiding this comment.
Test doesn't validate the symlink/hardlink dedup scenario.
This test creates two records with identical paths, but the PR objective is to fix duplicate indexing when different paths (symlinks/hardlinks) reference the same physical file. The current test passes even without the (st_dev, st_ino) logic since identical paths would naturally have the same identity.
Consider adding tests for:
- Hardlinks: Two different paths, same inode
- Symlinks: Symlink path vs target path
- Unreadable paths: Verify graceful handling
💚 Suggested additional tests
import os
import pytest
from app.utils.images import image_util_remove_duplicate_images
def test_remove_duplicate_images(tmp_path):
file1 = tmp_path / "img1.jpg"
file1.write_text("data")
# simulate duplicate reference
records = [
{"path": str(file1)},
{"path": str(file1)},
]
result = image_util_remove_duplicate_images(records)
assert len(result) == 1
def test_dedup_hardlinks(tmp_path):
"""Two different paths pointing to the same physical file via hardlink."""
original = tmp_path / "original.jpg"
original.write_text("image data")
hardlink = tmp_path / "hardlink.jpg"
os.link(str(original), str(hardlink))
records = [
{"path": str(original)},
{"path": str(hardlink)},
]
result = image_util_remove_duplicate_images(records)
assert len(result) == 1
def test_dedup_symlinks(tmp_path):
"""Symlink and target should collapse to one record."""
original = tmp_path / "original.jpg"
original.write_text("image data")
symlink = tmp_path / "symlink.jpg"
symlink.symlink_to(original)
records = [
{"path": str(original)},
{"path": str(symlink)},
]
result = image_util_remove_duplicate_images(records)
assert len(result) == 1
def test_dedup_skips_unreadable_paths(tmp_path):
"""Records with non-existent paths should be skipped."""
valid = tmp_path / "valid.jpg"
valid.write_text("data")
records = [
{"path": str(valid)},
{"path": "/nonexistent/path.jpg"},
]
result = image_util_remove_duplicate_images(records)
assert len(result) == 1
assert result[0]["path"] == str(valid)As per coding guidelines, "Verify that all critical functionality is covered by tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/tests/util/test_dedup.py` around lines 4 - 16, The current test only
covers identical-path deduplication; add tests that exercise deduplication by
physical file identity using image_util_remove_duplicate_images: create a
hardlink pair (os.link) with different paths and assert only one record is
returned, create a symlink (Path.symlink_to) paired with its target and assert
collapse to one record, and add a test with a non-existent/unreadable path to
ensure such records are skipped gracefully; reference
image_util_remove_duplicate_images in all new tests and ensure they use tmp_path
to create files/links and assert lengths and preserved valid path content.
|
Closing this as unreviewed issue |
Addressed Issues:
Fixes #1181
Screenshots/Recordings:
Not applicable for this change since it improves backend ingestion logic rather than the user interface.
Behavior difference summary:
Before
After
Additional Notes:
During folder scanning, image paths are collected and converted into records before being inserted using
db_bulk_insert_images.Previously, the pipeline treated files as unique based only on their path, which allowed the same physical file to be indexed multiple times when symbolic links or hardlinks existed.
This PR introduces a small deduplication step in
image_util_process_folder_imagesbefore the bulk insert stage.The deduplication works by:
os.stat()to obtain filesystem identity(st_dev, st_ino)pairs to uniquely identify the physical filedb_bulk_insert_imagesThis ensures that each physical file is indexed only once.
Benefits of this change:
The change is intentionally minimal and does not require any database schema changes, keeping the existing ingestion pipeline intact.
AI Usage Disclosure:
I have used the following AI models and tools: ChatGPT (used for reviewing the approach and verifying edge cases). All code changes were tested locally before submitting the PR.
Checklist
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores