Skip to content

Avoid duplicate image indexing when scanning folders containing symlinks or hardlinks#1224

Closed
Riddhi8077 wants to merge 16 commits intoAOSSIE-Org:mainfrom
Riddhi8077:fix-remove-duplicate-images
Closed

Avoid duplicate image indexing when scanning folders containing symlinks or hardlinks#1224
Riddhi8077 wants to merge 16 commits intoAOSSIE-Org:mainfrom
Riddhi8077:fix-remove-duplicate-images

Conversation

@Riddhi8077
Copy link

@Riddhi8077 Riddhi8077 commented Mar 16, 2026

Addressed Issues:

Fixes #1181

Screenshots/Recordings:

Not applicable for this change since it improves backend ingestion logic rather than the user interface.

Behavior difference summary:

Before

  • When scanning folders, symbolic links or hardlinks pointing to the same physical file were treated as different images.
  • This caused duplicate database entries, repeated thumbnail generation, and redundant AI tagging.

After

  • The ingestion pipeline now detects duplicate physical files and prevents them from being inserted multiple times.
  • Each physical image is indexed only once even if referenced by multiple paths.

Additional Notes:

During folder scanning, image paths are collected and converted into records before being inserted using db_bulk_insert_images.

Previously, the pipeline treated files as unique based only on their path, which allowed the same physical file to be indexed multiple times when symbolic links or hardlinks existed.

This PR introduces a small deduplication step in image_util_process_folder_images before the bulk insert stage.

The deduplication works by:

  • Using os.stat() to obtain filesystem identity
  • Comparing (st_dev, st_ino) pairs to uniquely identify the physical file
  • Filtering duplicate entries before calling db_bulk_insert_images

This ensures that each physical file is indexed only once.

Benefits of this change:

  • Prevents duplicate records in the database
  • Avoids redundant thumbnail generation
  • Prevents repeated AI tagging for the same file
  • Reduces unnecessary indexing overhead

The change is intentionally minimal and does not require any database schema changes, keeping the existing ingestion pipeline intact.

AI Usage Disclosure:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: ChatGPT (used for reviewing the approach and verifying edge cases). All code changes were tested locally before submitting the PR.

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • New Features

    • Duplicate-image detection and safer metadata fallback to improve media import reliability.
  • Bug Fixes

    • More reliable favorite toggling with input validation and direct image lookup.
    • Hardened database connection behavior to enforce PRAGMAs and commit on success/rollback on errors.
  • Tests

    • Added tests for DB connection, metadata, and deduplication; introduced ephemeral test DB fixture for isolation.
  • Chores

    • Minor repository housekeeping and formatting tweaks.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Set sqlite3.Row on DB connections and commit in the connection context; added direct image lookup and safer favourite-toggle; introduced filesystem-based duplicate-image and path deduplication and simplified metadata parsing; replaced global test fixture with a per-test temp DB fixture and added connection/metadata/dedup tests; minor .gitignore whitespace change.

Changes

Cohort / File(s) Summary
Database connection
backend/app/database/connection.py
Context manager now sets row_factory = sqlite3.Row and calls conn.commit() before closing; PRAGMA settings preserved.
Image DB & routes
backend/app/database/images.py, backend/app/routes/images.py
Added db_get_image_by_id(image_id); db_toggle_image_favourite_status validates input, checks existence before toggling, uses context-managed connections and proper error handling; route uses direct lookup.
Image utilities
backend/app/utils/images.py
Added image_util_remove_duplicate_paths and image_util_remove_duplicate_images (dedupe by device+inode, skip unreadable paths); image_util_parse_metadata signature simplified and returns default mapping for falsy input; processing now deduplicates before insert.
Tests & fixtures
backend/tests/conftest.py, backend/tests/database/test_connection.py, backend/tests/database/test_metadata.py, backend/tests/util/test_dedup.py
Replaced session-scoped autouse fixture with temp_db_path(monkeypatch) that patches DATABASE_PATH; added tests for get_db_connection (PRAGMAs, commit/rollback/closure), metadata table operations (including corrupted JSON), and dedup behavior.
Misc
backend/.gitignore
Whitespace-only: inserted a blank line before the "Distribution / packaging" section.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

Python

Suggested reviewers

  • rahulharpal1603

Poem

🐰 I hop through rows and inode trails,
I skip the twins with tiny tails,
I flip the heart and close the lid,
Tests nod gently—no duplicates hid,
A nibble, a commit, then down the rails.

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The PR contains several changes beyond direct deduplication: database connection modifications, test infrastructure changes, and refactored database queries that appear tangential to the core deduplication objective. Review whether database connection refactoring, row_factory changes, and explicit commits are necessary for the deduplication fix, or should be separated into a distinct PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: deduplication of images when scanning folders with symlinks/hardlinks, which is the core focus of the PR.
Linked Issues check ✅ Passed The PR implements deduplication to prevent duplicate image indexing when scanning folders with symlinks/hardlinks, directly addressing the requirements in issue #1181.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/routes/images.py (1)

109-114: ⚠️ Potential issue | 🟠 Major

Guard against None from db_get_image_by_id before using .get().

If the record is deleted/changed between toggle and fetch, image can be None, and image.get(...) will raise an exception.

💡 Suggested hardening
-        image = db_get_image_by_id(image_id)
+        image = db_get_image_by_id(image_id)
+        if image is None:
+            raise HTTPException(
+                status_code=404, detail="Image not found after toggle"
+            )
         return {
             "success": True,
             "image_id": image_id,
             "isFavourite": image.get("isFavourite", False),
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/routes/images.py` around lines 109 - 114, The code calls
db_get_image_by_id(...) and immediately uses image.get(...), which will raise if
db_get_image_by_id returns None; update the handler to check whether image is
None after calling db_get_image_by_id (the local variable image) and handle that
case explicitly (e.g., return a 404/appropriate error response or set
isFavourite to False and success False), rather than calling image.get; ensure
the response uses the guarded value and include the image_id and a clear error
path so the route (images.py) won't crash if the record was deleted between
toggle and fetch.
🧹 Nitpick comments (4)
backend/app/routes/images.py (1)

9-9: Remove duplicate import of db_get_all_images.

This symbol is already imported at Line 3, so this line introduces unnecessary duplication.

As per coding guidelines, "Look for code duplication."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/routes/images.py` at line 9, The file imports db_get_all_images
twice; remove the duplicate import and consolidate imports from
app.database.images so only one import statement remains (keep db_get_all_images
and db_get_image_by_id in the single import). Locate the duplicate reference to
db_get_all_images in the images.py imports and delete the redundant entry so
there is a single import of db_get_all_images alongside db_get_image_by_id.
backend/app/database/images.py (1)

9-21: Remove the repeated get_db_connection import.

from app.database.connection import get_db_connection appears twice in the module.

As per coding guidelines, "Look for code duplication."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/database/images.py` around lines 9 - 21, The module imports
get_db_connection twice; remove the duplicate import statement so only a single
"from app.database.connection import get_db_connection" remains; verify the
remaining import is placed with the other imports (above logger initialization)
and run linting to ensure no unused-import warnings; references to
get_db_connection in the module should continue to work unchanged.
backend/tests/database/test_metadata.py (1)

8-23: Consider reusing the shared DB fixture from conftest.py.

This local fixture duplicates temporary DB setup logic already centralized in temp_db_path, which can drift over time.

As per coding guidelines, "Look for code duplication."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/database/test_metadata.py` around lines 8 - 23, The local
temp_db fixture duplicates setup already provided by the conftest.py
temp_db_path fixture; remove this temp_db fixture and update tests to accept
temp_db_path instead, then replace uses of temp_db with temp_db_path and call
monkeypatch.setattr("app.database.metadata.DATABASE_PATH", str(temp_db_path)) in
tests (or a small per-test setup) so the metadata module uses the shared
temporary DB; ensure any tests that previously relied on temp_db's return value
use the same path or Path object provided by temp_db_path.
backend/app/database/connection.py (1)

11-12: Replace the emphatic comment with a specific reason.

# VERY IMPORTANT does not add actionable context. A short reason (e.g., why sqlite3.Row is required downstream) will age better.

As per coding guidelines, "Point out redundant obvious comments that do not add clarity to the code."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/database/connection.py` around lines 11 - 12, Replace the vague
comment above conn.row_factory = sqlite3.Row with a concise, actionable reason
explaining why sqlite3.Row is required (for example: "Use sqlite3.Row so query
results support dict-style access by column name required by downstream code
that expects row['col_name']"). Update the comment adjacent to conn.row_factory
= sqlite3.Row to reference the specific downstream expectation (e.g., dict-style
access or JSON serialization) and remove "VERY IMPORTANT".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/utils/images.py`:
- Around line 25-45: Add focused pytest cases for
image_util_remove_duplicate_images to validate deduplication by filesystem
identity and handling of unreadable paths: create tests that (1) create two
records pointing to the same physical file via a hardlink (use os.link) and
assert the function returns a single record, (2) create a symlink to the same
file and assert dedup also collapses symlink+target into one record, and (3)
simulate an unreadable/missing path (either by changing file permissions or
forcing os.stat to raise OSError via monkeypatch) and assert that the unreadable
record is skipped. Use pytest tmp_path fixtures to create files and reference
image_util_remove_duplicate_images to locate the implementation.
- Around line 25-45: The dedup helper image_util_remove_duplicate_images is
implemented but never applied in the ingestion flow; update
image_util_process_folder_images to call image_util_remove_duplicate_images on
the list of records returned or built before the bulk insert step so that only
unique_records are passed to whatever bulk insert function (e.g., the existing
bulk_insert or similar call) — locate where image records are collected in
image_util_process_folder_images, run them through
image_util_remove_duplicate_images, and use that filtered list for
indexing/insertion.
- Around line 516-528: image_util_parse_metadata currently returns raw DB
strings unchanged, causing callers that expect a dict to break; update
image_util_parse_metadata to detect when metadata is a JSON string and
json.loads it (import json), catch and log JSONDecodeError, and then normalize
the resulting object to a dict with the expected keys
("name","date_created","width","height","file_location","file_size","item_type")
filling any missing keys with the same default values currently used; keep the
original default-return behavior for falsy metadata and ensure the function
always returns a dict-like object for downstream consumers.

---

Outside diff comments:
In `@backend/app/routes/images.py`:
- Around line 109-114: The code calls db_get_image_by_id(...) and immediately
uses image.get(...), which will raise if db_get_image_by_id returns None; update
the handler to check whether image is None after calling db_get_image_by_id (the
local variable image) and handle that case explicitly (e.g., return a
404/appropriate error response or set isFavourite to False and success False),
rather than calling image.get; ensure the response uses the guarded value and
include the image_id and a clear error path so the route (images.py) won't crash
if the record was deleted between toggle and fetch.

---

Nitpick comments:
In `@backend/app/database/connection.py`:
- Around line 11-12: Replace the vague comment above conn.row_factory =
sqlite3.Row with a concise, actionable reason explaining why sqlite3.Row is
required (for example: "Use sqlite3.Row so query results support dict-style
access by column name required by downstream code that expects
row['col_name']"). Update the comment adjacent to conn.row_factory = sqlite3.Row
to reference the specific downstream expectation (e.g., dict-style access or
JSON serialization) and remove "VERY IMPORTANT".

In `@backend/app/database/images.py`:
- Around line 9-21: The module imports get_db_connection twice; remove the
duplicate import statement so only a single "from app.database.connection import
get_db_connection" remains; verify the remaining import is placed with the other
imports (above logger initialization) and run linting to ensure no unused-import
warnings; references to get_db_connection in the module should continue to work
unchanged.

In `@backend/app/routes/images.py`:
- Line 9: The file imports db_get_all_images twice; remove the duplicate import
and consolidate imports from app.database.images so only one import statement
remains (keep db_get_all_images and db_get_image_by_id in the single import).
Locate the duplicate reference to db_get_all_images in the images.py imports and
delete the redundant entry so there is a single import of db_get_all_images
alongside db_get_image_by_id.

In `@backend/tests/database/test_metadata.py`:
- Around line 8-23: The local temp_db fixture duplicates setup already provided
by the conftest.py temp_db_path fixture; remove this temp_db fixture and update
tests to accept temp_db_path instead, then replace uses of temp_db with
temp_db_path and call monkeypatch.setattr("app.database.metadata.DATABASE_PATH",
str(temp_db_path)) in tests (or a small per-test setup) so the metadata module
uses the shared temporary DB; ensure any tests that previously relied on
temp_db's return value use the same path or Path object provided by
temp_db_path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ec899a0-2f61-4ce8-a9bf-3a686c26f118

📥 Commits

Reviewing files that changed from the base of the PR and between 8274ebd and 1a368cb.

📒 Files selected for processing (8)
  • backend/.gitignore
  • backend/app/database/connection.py
  • backend/app/database/images.py
  • backend/app/routes/images.py
  • backend/app/utils/images.py
  • backend/tests/conftest.py
  • backend/tests/database/test_connection.py
  • backend/tests/database/test_metadata.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/app/utils/images.py (2)

26-45: ⚠️ Potential issue | 🔴 Critical

This dedup helper never affects indexing.

image_util_process_folder_images() still bulk-inserts the raw all_image_records list on Line 100, so symlink/hardlink aliases still create duplicate rows and duplicate thumbnail work. Please run the filesystem-identity check in the ingest path before image_util_prepare_image_records(), otherwise #1181 remains unresolved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 26 - 45, The deduplication by
filesystem identity in image_util_remove_duplicate_images is never applied to
indexing because image_util_process_folder_images still bulk-inserts the raw
all_image_records; modify image_util_process_folder_images to run
image_util_remove_duplicate_images on all_image_records immediately after
collecting them (and before calling image_util_prepare_image_records and before
the bulk insert) so that all_image_records is replaced with the deduplicated
list; ensure subsequent calls (image_util_prepare_image_records, thumbnail
generation, and the DB bulk-insert) use the deduplicated all_image_records to
avoid duplicate rows and duplicate thumbnail work.

554-566: ⚠️ Potential issue | 🔴 Critical

image_util_parse_metadata must still deserialize DB JSON.

metadata is stored as JSON TEXT. Callers at backend/app/database/images.py Line 213 and Line 579, plus backend/app/routes/images.py Line 67, expect a dict; returning the raw string here will break response-model construction and downstream metadata access. Please keep dicts as-is and json.loads string values with a default fallback.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 554 - 566,
image_util_parse_metadata currently returns raw metadata which may be a JSON
string; update it so if metadata is already a dict return it unchanged, and if
it's a str attempt json.loads(metadata) inside a try/except and return the
parsed dict (on exception or falsy input return the existing default dict with
keys name,date_created,width,height,file_location,file_size,item_type); ensure
numeric fields remain present (width,height,file_size) and types handled
conservatively so callers expecting a dict (e.g., the image metadata consumers)
receive a dict fallback rather than a raw string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/database/images.py`:
- Around line 485-507: Remove the stray token "fix-remove-duplicate-images" from
the body of db_get_image_by_id and delete the orphaned/ unreachable
exception/finally blocks that follow the early return; either wrap the DB
operations in a proper try/except/finally around get_db_connection() to call
conn.commit() and close/rollback as needed, or keep the simple with
get_db_connection() context usage and just return dict(row) after fetchone() so
there are no unmatched except/finally blocks left in the module.

---

Duplicate comments:
In `@backend/app/utils/images.py`:
- Around line 26-45: The deduplication by filesystem identity in
image_util_remove_duplicate_images is never applied to indexing because
image_util_process_folder_images still bulk-inserts the raw all_image_records;
modify image_util_process_folder_images to run
image_util_remove_duplicate_images on all_image_records immediately after
collecting them (and before calling image_util_prepare_image_records and before
the bulk insert) so that all_image_records is replaced with the deduplicated
list; ensure subsequent calls (image_util_prepare_image_records, thumbnail
generation, and the DB bulk-insert) use the deduplicated all_image_records to
avoid duplicate rows and duplicate thumbnail work.
- Around line 554-566: image_util_parse_metadata currently returns raw metadata
which may be a JSON string; update it so if metadata is already a dict return it
unchanged, and if it's a str attempt json.loads(metadata) inside a try/except
and return the parsed dict (on exception or falsy input return the existing
default dict with keys
name,date_created,width,height,file_location,file_size,item_type); ensure
numeric fields remain present (width,height,file_size) and types handled
conservatively so callers expecting a dict (e.g., the image metadata consumers)
receive a dict fallback rather than a raw string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d440ed75-489b-4a51-9535-0ec0ef26048e

📥 Commits

Reviewing files that changed from the base of the PR and between 1a368cb and 445556b.

📒 Files selected for processing (2)
  • backend/app/database/images.py
  • backend/app/utils/images.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
backend/app/utils/images.py (2)

554-566: ⚠️ Potential issue | 🔴 Critical

Parse the stored JSON before returning metadata.

Line 196 stores metadata as a JSON string. Returning DB values unchanged here hands callers a raw str instead of a dict, which breaks downstream field access and response serialization. Please restore JSON parsing and normalize the result to the expected shape.

🛠️ Minimal fix
 def image_util_parse_metadata(metadata):
+    default = {
+        "name": "unknown",
+        "date_created": "unknown",
+        "width": 0,
+        "height": 0,
+        "file_location": "unknown",
+        "file_size": 0,
+        "item_type": "image",
+    }
     if not metadata:
-        return {
-            "name": "unknown",
-            "date_created": "unknown",
-            "width": 0,
-            "height": 0,
-            "file_location": "unknown",
-            "file_size": 0,
-            "item_type": "image"
-        }
-
-    return metadata
+        return default
+    if isinstance(metadata, dict):
+        return {**default, **metadata}
+    if isinstance(metadata, str):
+        try:
+            parsed = json.loads(metadata)
+        except json.JSONDecodeError:
+            logger.warning("Invalid image metadata JSON")
+            return default
+        if isinstance(parsed, dict):
+            return {**default, **parsed}
+    return default
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 554 - 566, The function
image_util_parse_metadata currently returns the DB value unchanged, which can be
a JSON string; update image_util_parse_metadata to detect if metadata is a
string and parse it with json.loads, then normalize the resulting dict to always
include keys name, date_created, width, height, file_location, file_size, and
item_type (defaulting to "unknown" or 0 as in the current fallback), and return
that dict; ensure you import json if needed and keep the existing early-return
when metadata is falsy.

26-45: ⚠️ Potential issue | 🟠 Major

Still missing focused regression coverage for the dedup path.

This is the core bugfix, but the provided changes still do not exercise hardlink collapse, symlink collapse, or the OSError/empty-scan paths. Please add automated tests around image_util_remove_duplicate_images() and image_util_process_folder_images() so this behavior does not regress again.

As per coding guidelines, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."

Also applies to: 99-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 26 - 45, Add automated pytest unit
tests for image_util_remove_duplicate_images and
image_util_process_folder_images to cover hardlink collapse, symlink collapse,
OSError handling, and empty-scan behavior: create temporary files in a tmp_path
fixture, make a hardlink to the same inode and a symlink to the file and assert
duplicate entries are collapsed by image_util_remove_duplicate_images; exercise
image_util_process_folder_images on an empty directory to assert it returns an
empty result and on a directory with hardlinked/symlinked files to assert
deduplication; also add a test that simulates os.stat raising OSError for a path
(either by passing a non-existent path or monkeypatching os.stat to raise) and
assert the function skips that record without crashing. Ensure tests import
image_util_remove_duplicate_images and image_util_process_folder_images and
follow existing test patterns and cleanup rules.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/utils/images.py`:
- Around line 99-101: The current flow generates thumbnails in
image_util_prepare_image_records then calls image_util_remove_duplicate_images
before db_bulk_insert_images, causing orphaned thumbnail files; fix by
deduplicating earlier or cleaning up discarded thumbnails: either run
image_util_remove_duplicate_images on the raw input file paths before calling
image_util_prepare_image_records so thumbnails are only created for unique
images, or (if you must keep current order) collect the list of discarded image
records after image_util_remove_duplicate_images and call
image_util_remove_obsolete_images (or an explicit thumbnail-deletion helper) to
remove their thumbnail_{uuid}.jpg files before returning to
db_bulk_insert_images; update the code around image_util_prepare_image_records,
image_util_remove_duplicate_images, and image_util_remove_obsolete_images
accordingly.
- Around line 99-101: The function currently only sets unique_records and calls
db_bulk_insert_images when all_image_records is truthy, but then returns its
result outside the branch causing undefined behavior when all_image_records is
empty; modify the logic so the call to db_bulk_insert_images(unique_records)
remains inside the if all_image_records branch (using unique_records =
image_util_remove_duplicate_images(all_image_records) then return
db_bulk_insert_images(unique_records)), and add an explicit return True
immediately after the branch to ensure empty scans are treated as a successful
no-op; reference symbols: all_image_records, unique_records,
image_util_remove_duplicate_images, db_bulk_insert_images.

---

Duplicate comments:
In `@backend/app/utils/images.py`:
- Around line 554-566: The function image_util_parse_metadata currently returns
the DB value unchanged, which can be a JSON string; update
image_util_parse_metadata to detect if metadata is a string and parse it with
json.loads, then normalize the resulting dict to always include keys name,
date_created, width, height, file_location, file_size, and item_type (defaulting
to "unknown" or 0 as in the current fallback), and return that dict; ensure you
import json if needed and keep the existing early-return when metadata is falsy.
- Around line 26-45: Add automated pytest unit tests for
image_util_remove_duplicate_images and image_util_process_folder_images to cover
hardlink collapse, symlink collapse, OSError handling, and empty-scan behavior:
create temporary files in a tmp_path fixture, make a hardlink to the same inode
and a symlink to the file and assert duplicate entries are collapsed by
image_util_remove_duplicate_images; exercise image_util_process_folder_images on
an empty directory to assert it returns an empty result and on a directory with
hardlinked/symlinked files to assert deduplication; also add a test that
simulates os.stat raising OSError for a path (either by passing a non-existent
path or monkeypatching os.stat to raise) and assert the function skips that
record without crashing. Ensure tests import image_util_remove_duplicate_images
and image_util_process_folder_images and follow existing test patterns and
cleanup rules.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 76c5cafc-3bef-41c2-a58a-8b0a0c06fcd0

📥 Commits

Reviewing files that changed from the base of the PR and between 445556b and 09bb87c.

📒 Files selected for processing (1)
  • backend/app/utils/images.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
backend/app/database/images.py (1)

485-513: ⚠️ Potential issue | 🔴 Critical

Critical: db_get_image_by_id has orphaned code that causes a SyntaxError.

The function returns at line 505 with return dict(row), making lines 506–513 unreachable. Additionally, the except and finally blocks have no matching try, causing a SyntaxError that prevents the module from importing.

🐛 Proposed fix
 def db_get_image_by_id(image_id: str) -> dict | None:
     """
     Retrieve a single image by its ID.
     """
     with get_db_connection() as conn:
         cursor = conn.cursor()

         cursor.execute(
             """
             SELECT * FROM images
             WHERE id = ?
             """,
             (image_id,),
         )

         row = cursor.fetchone()

         if row is None:
             return None

         return dict(row)
-        conn.commit()
-        return cursor.rowcount > 0
-    except Exception as e:
-        logger.error(f"Database error: {e}")
-        conn.rollback()
-        return False
-    finally:
-        conn.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/database/images.py` around lines 485 - 513, The function
db_get_image_by_id contains unreachable lines after an early return and orphaned
except/finally blocks (no try) causing a SyntaxError; fix it by wrapping the DB
operations in a try block (use with get_db_connection() as conn: then try:
cursor = conn.cursor(); execute and fetch), remove the unreachable
commits/returns after return dict(row), place the existing except Exception as
e: logger.error(...); conn.rollback(); return None (or False per project
convention) and the finally block to close the connection (or rely on the
context manager), and ensure the function returns the dict(row) when found and
None when not found without dangling code; reference db_get_image_by_id,
get_db_connection, cursor, and logger when making changes.
backend/app/utils/images.py (1)

100-104: ⚠️ Potential issue | 🔴 Critical

Critical: Indentation error causes NameError and unreachable code.

The indentation is inconsistent: unique_records is defined only inside the if all_image_records: block (lines 100-101), but return db_bulk_insert_images(unique_records) at line 102 appears to be at the outer indentation level. When all_image_records is empty:

  1. unique_records is never defined
  2. Line 102 raises NameError: name 'unique_records' is not defined
  3. Line 104 return True becomes unreachable
🐛 Proposed fix
         # Step 5: Bulk insert all new records if any exist
-         if all_image_records:
-          unique_records = image_util_remove_duplicate_images(all_image_records)
-        return db_bulk_insert_images(unique_records)
-
-        return True  # No images to process is not an error
+        if all_image_records:
+            unique_records = image_util_remove_duplicate_images(all_image_records)
+            return db_bulk_insert_images(unique_records)
+
+        return True  # No images to process is not an error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 100 - 104, The current block
mis-indents handling of all_image_records so unique_records can be undefined;
fix by moving the return that calls db_bulk_insert_images(unique_records) inside
the same if all_image_records: block (after computing unique_records =
image_util_remove_duplicate_images(all_image_records)) and ensure the function
returns True only when there are no images to process; reference the variables
and helpers all_image_records, unique_records,
image_util_remove_duplicate_images(), and db_bulk_insert_images() to locate and
correct the indentation so the db call is only made when unique_records is
defined.
🧹 Nitpick comments (3)
backend/app/utils/images.py (3)

26-45: Add type hints for clarity and static analysis.

The function lacks type annotations for its parameter and return value. As per coding guidelines for Python files, ensure proper use of type hints.

♻️ Proposed fix
-def image_util_remove_duplicate_images(records):
+def image_util_remove_duplicate_images(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
     """Remove duplicate physical image files using file system identity."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 26 - 45, The function
image_util_remove_duplicate_images lacks type annotations; update its signature
to include precise typing (e.g., records: Sequence[Dict[str, Any]]) ->
List[Dict[str, Any]] and add the necessary typing imports (List, Dict, Any,
Sequence) at the top of the module so static analysis and editors can validate
that each record is a dict containing a "path" key; keep the implementation
logic unchanged.

585-601: Add type hints; approach is sound.

The early path deduplication correctly prevents redundant thumbnail generation (addressing the concern about thumbnails being created before dedup). Add type hints per coding guidelines.

♻️ Proposed fix
-def image_util_remove_duplicate_paths(image_paths):
+def image_util_remove_duplicate_paths(image_paths: List[str]) -> List[str]:
     """Remove duplicate physical image files before processing."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 585 - 601, The function
image_util_remove_duplicate_paths currently lacks type annotations; update its
signature to include input and return type hints (e.g., image_paths:
Iterable[str] or Sequence[str] and return List[str]) and annotate any local
collections if you follow local var typing (seen: Set[Tuple[int,int]],
unique_paths: List[str]) to satisfy project typing guidelines, ensuring you also
import the required typing symbols (Iterable, List, Set, Tuple) where necessary.

558-583: Add type hints to match docstring intent.

The function lacks type annotations. Based on usage patterns in backend/app/database/images.py (see context snippets 2-4), callers expect a Mapping[str, Any] return.

♻️ Proposed fix
-def image_util_parse_metadata(metadata):
+def image_util_parse_metadata(metadata: Any) -> Dict[str, Any]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 558 - 583, The function
image_util_parse_metadata lacks type annotations; update its signature to accept
metadata: Optional[Union[str, Mapping[str, Any]]]) and return Mapping[str, Any]
(or Dict[str, Any]) and import required typing names (Mapping, Any, Optional,
Union, Dict) at the top; keep the same logic but ensure the default return is
typed as a Dict[str, Any] and the JSON branch returns a Mapping[str, Any] so
callers in backend/app/database/images.py get the expected Mapping[str, Any]
type from image_util_parse_metadata.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/database/images.py`:
- Line 10: There is a duplicate import of get_db_connection in images.py; remove
the redundant import statement so only a single "from app.database.connection
import get_db_connection" remains (keep one and delete the other) to satisfy
PEP8 and avoid unnecessary duplication; locate the two occurrences near the top
of images.py and remove one, ensuring any references to get_db_connection in
functions within this module continue to work.

In `@backend/app/utils/images.py`:
- Line 556: Remove the duplicate module import by deleting the redundant "import
json" statement found in the file (the one shown in the diff) so only the single
existing "import json" at the top of backend.app.utils.images remains; search
for the "import json" occurrences in utils/images.py and remove the later
duplicate to satisfy PEP8.

---

Duplicate comments:
In `@backend/app/database/images.py`:
- Around line 485-513: The function db_get_image_by_id contains unreachable
lines after an early return and orphaned except/finally blocks (no try) causing
a SyntaxError; fix it by wrapping the DB operations in a try block (use with
get_db_connection() as conn: then try: cursor = conn.cursor(); execute and
fetch), remove the unreachable commits/returns after return dict(row), place the
existing except Exception as e: logger.error(...); conn.rollback(); return None
(or False per project convention) and the finally block to close the connection
(or rely on the context manager), and ensure the function returns the dict(row)
when found and None when not found without dangling code; reference
db_get_image_by_id, get_db_connection, cursor, and logger when making changes.

In `@backend/app/utils/images.py`:
- Around line 100-104: The current block mis-indents handling of
all_image_records so unique_records can be undefined; fix by moving the return
that calls db_bulk_insert_images(unique_records) inside the same if
all_image_records: block (after computing unique_records =
image_util_remove_duplicate_images(all_image_records)) and ensure the function
returns True only when there are no images to process; reference the variables
and helpers all_image_records, unique_records,
image_util_remove_duplicate_images(), and db_bulk_insert_images() to locate and
correct the indentation so the db call is only made when unique_records is
defined.

---

Nitpick comments:
In `@backend/app/utils/images.py`:
- Around line 26-45: The function image_util_remove_duplicate_images lacks type
annotations; update its signature to include precise typing (e.g., records:
Sequence[Dict[str, Any]]) -> List[Dict[str, Any]] and add the necessary typing
imports (List, Dict, Any, Sequence) at the top of the module so static analysis
and editors can validate that each record is a dict containing a "path" key;
keep the implementation logic unchanged.
- Around line 585-601: The function image_util_remove_duplicate_paths currently
lacks type annotations; update its signature to include input and return type
hints (e.g., image_paths: Iterable[str] or Sequence[str] and return List[str])
and annotate any local collections if you follow local var typing (seen:
Set[Tuple[int,int]], unique_paths: List[str]) to satisfy project typing
guidelines, ensuring you also import the required typing symbols (Iterable,
List, Set, Tuple) where necessary.
- Around line 558-583: The function image_util_parse_metadata lacks type
annotations; update its signature to accept metadata: Optional[Union[str,
Mapping[str, Any]]]) and return Mapping[str, Any] (or Dict[str, Any]) and import
required typing names (Mapping, Any, Optional, Union, Dict) at the top; keep the
same logic but ensure the default return is typed as a Dict[str, Any] and the
JSON branch returns a Mapping[str, Any] so callers in
backend/app/database/images.py get the expected Mapping[str, Any] type from
image_util_parse_metadata.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b8905fdb-944d-426d-b23f-45633fa261a0

📥 Commits

Reviewing files that changed from the base of the PR and between 09bb87c and ae274c4.

📒 Files selected for processing (2)
  • backend/app/database/images.py
  • backend/app/utils/images.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/utils/images.py (1)

50-50: ⚠️ Potential issue | 🟡 Minor

Duplicate logger shadows the custom logger.

Line 24 already defines logger = get_logger(__name__). This reassignment at line 50 shadows it with the standard library logger, potentially losing custom logging configuration.

🧹 Proposed fix: Remove duplicate
-logger = logging.getLogger(__name__)
-
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` at line 50, The file reassigns logger from the
custom get_logger(__name__) to the stdlib logging.getLogger(__name__) (the
symbol logger is defined twice), which shadows and discards custom
configuration; remove the second assignment so that the module keeps the
original logger returned by get_logger and ensure any references use the
existing logger symbol rather than calling logging.getLogger again.
🧹 Nitpick comments (1)
backend/app/utils/images.py (1)

26-45: Add type hints for better maintainability.

Both image_util_remove_duplicate_images and image_util_remove_duplicate_paths lack type annotations. Consider adding them for consistency with the rest of the codebase.

♻️ Suggested type hints
-def image_util_remove_duplicate_images(records):
+def image_util_remove_duplicate_images(records: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
     """Remove duplicate physical image files using file system identity."""

And for image_util_remove_duplicate_paths:

-def image_util_remove_duplicate_paths(image_paths):
+def image_util_remove_duplicate_paths(image_paths: List[str]) -> List[str]:
     """Remove duplicate physical image files before processing."""

As per coding guidelines, "Ensure proper use of type hints."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 26 - 45, Add explicit type hints to
both functions: annotate image_util_remove_duplicate_images as def
image_util_remove_duplicate_images(records: Sequence[Mapping[str, Any]]) ->
List[Mapping[str, Any]] and annotate image_util_remove_duplicate_paths as def
image_util_remove_duplicate_paths(paths: Sequence[str]) -> List[str]; import the
necessary typing symbols (Sequence, Mapping, Any, List) at the top of the module
and ensure the return and parameter types match how the functions are used
(records being mappings with a "path" key and paths being strings).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/utils/images.py`:
- Around line 100-104: The current indentation around the all_image_records
handling causes unique_records to be referenced when undefined and makes the
final return unreachable; fix the block in the function that uses
all_image_records so that: check if all_image_records (if all_image_records:)
then compute unique_records =
image_util_remove_duplicate_images(all_image_records) and return
db_bulk_insert_images(unique_records) inside that if-block, and after the
if-block return True for the no-images case; ensure references to
unique_records, image_util_remove_duplicate_images, and db_bulk_insert_images
are only used within the corrected conditional so no NameError or unreachable
code remains.

In `@backend/tests/util/test_dedup.py`:
- Around line 4-16: The current test only covers identical-path deduplication;
add tests that exercise deduplication by physical file identity using
image_util_remove_duplicate_images: create a hardlink pair (os.link) with
different paths and assert only one record is returned, create a symlink
(Path.symlink_to) paired with its target and assert collapse to one record, and
add a test with a non-existent/unreadable path to ensure such records are
skipped gracefully; reference image_util_remove_duplicate_images in all new
tests and ensure they use tmp_path to create files/links and assert lengths and
preserved valid path content.

---

Outside diff comments:
In `@backend/app/utils/images.py`:
- Line 50: The file reassigns logger from the custom get_logger(__name__) to the
stdlib logging.getLogger(__name__) (the symbol logger is defined twice), which
shadows and discards custom configuration; remove the second assignment so that
the module keeps the original logger returned by get_logger and ensure any
references use the existing logger symbol rather than calling logging.getLogger
again.

---

Nitpick comments:
In `@backend/app/utils/images.py`:
- Around line 26-45: Add explicit type hints to both functions: annotate
image_util_remove_duplicate_images as def
image_util_remove_duplicate_images(records: Sequence[Mapping[str, Any]]) ->
List[Mapping[str, Any]] and annotate image_util_remove_duplicate_paths as def
image_util_remove_duplicate_paths(paths: Sequence[str]) -> List[str]; import the
necessary typing symbols (Sequence, Mapping, Any, List) at the top of the module
and ensure the return and parameter types match how the functions are used
(records being mappings with a "path" key and paths being strings).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 23db8a34-419a-433f-bbf2-c7e6c2aeb351

📥 Commits

Reviewing files that changed from the base of the PR and between ae274c4 and 5ee64de.

📒 Files selected for processing (2)
  • backend/app/utils/images.py
  • backend/tests/util/test_dedup.py

Comment on lines +100 to 104
if all_image_records:
unique_records = image_util_remove_duplicate_images(all_image_records)
return db_bulk_insert_images(unique_records)

return True # No images to process is not an error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical indentation bug causes NameError and unreachable code.

Lines 100-102 have inconsistent indentation:

  • Line 100 has an extra leading space
  • Line 101 has only 2-space indent instead of 4
  • Line 102's return is outside the if block, so when all_image_records is empty, unique_records is undefined → NameError
  • Line 104 is unreachable
🐛 Proposed fix
         # Step 5: Bulk insert all new records if any exist
-         if all_image_records:
-          unique_records = image_util_remove_duplicate_images(all_image_records)
-        return db_bulk_insert_images(unique_records)
-
-        return True  # No images to process is not an error
+        if all_image_records:
+            unique_records = image_util_remove_duplicate_images(all_image_records)
+            return db_bulk_insert_images(unique_records)
+
+        return True  # No images to process is not an error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils/images.py` around lines 100 - 104, The current indentation
around the all_image_records handling causes unique_records to be referenced
when undefined and makes the final return unreachable; fix the block in the
function that uses all_image_records so that: check if all_image_records (if
all_image_records:) then compute unique_records =
image_util_remove_duplicate_images(all_image_records) and return
db_bulk_insert_images(unique_records) inside that if-block, and after the
if-block return True for the no-images case; ensure references to
unique_records, image_util_remove_duplicate_images, and db_bulk_insert_images
are only used within the corrected conditional so no NameError or unreachable
code remains.

Comment on lines +4 to +16
def test_remove_duplicate_images(tmp_path):
file1 = tmp_path / "img1.jpg"
file1.write_text("data")

# simulate duplicate reference
records = [
{"path": str(file1)},
{"path": str(file1)},
]

result = image_util_remove_duplicate_images(records)

assert len(result) == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Test doesn't validate the symlink/hardlink dedup scenario.

This test creates two records with identical paths, but the PR objective is to fix duplicate indexing when different paths (symlinks/hardlinks) reference the same physical file. The current test passes even without the (st_dev, st_ino) logic since identical paths would naturally have the same identity.

Consider adding tests for:

  1. Hardlinks: Two different paths, same inode
  2. Symlinks: Symlink path vs target path
  3. Unreadable paths: Verify graceful handling
💚 Suggested additional tests
import os
import pytest
from app.utils.images import image_util_remove_duplicate_images

def test_remove_duplicate_images(tmp_path):
    file1 = tmp_path / "img1.jpg"
    file1.write_text("data")

    # simulate duplicate reference
    records = [
        {"path": str(file1)},
        {"path": str(file1)},
    ]

    result = image_util_remove_duplicate_images(records)

    assert len(result) == 1


def test_dedup_hardlinks(tmp_path):
    """Two different paths pointing to the same physical file via hardlink."""
    original = tmp_path / "original.jpg"
    original.write_text("image data")
    
    hardlink = tmp_path / "hardlink.jpg"
    os.link(str(original), str(hardlink))
    
    records = [
        {"path": str(original)},
        {"path": str(hardlink)},
    ]
    
    result = image_util_remove_duplicate_images(records)
    assert len(result) == 1


def test_dedup_symlinks(tmp_path):
    """Symlink and target should collapse to one record."""
    original = tmp_path / "original.jpg"
    original.write_text("image data")
    
    symlink = tmp_path / "symlink.jpg"
    symlink.symlink_to(original)
    
    records = [
        {"path": str(original)},
        {"path": str(symlink)},
    ]
    
    result = image_util_remove_duplicate_images(records)
    assert len(result) == 1


def test_dedup_skips_unreadable_paths(tmp_path):
    """Records with non-existent paths should be skipped."""
    valid = tmp_path / "valid.jpg"
    valid.write_text("data")
    
    records = [
        {"path": str(valid)},
        {"path": "/nonexistent/path.jpg"},
    ]
    
    result = image_util_remove_duplicate_images(records)
    assert len(result) == 1
    assert result[0]["path"] == str(valid)

As per coding guidelines, "Verify that all critical functionality is covered by tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/tests/util/test_dedup.py` around lines 4 - 16, The current test only
covers identical-path deduplication; add tests that exercise deduplication by
physical file identity using image_util_remove_duplicate_images: create a
hardlink pair (os.link) with different paths and assert only one record is
returned, create a symlink (Path.symlink_to) paired with its target and assert
collapse to one record, and add a test with a non-existent/unreadable path to
ensure such records are skipped gracefully; reference
image_util_remove_duplicate_images in all new tests and ensure they use tmp_path
to create files/links and assert lengths and preserved valid path content.

@SIDDHANTCOOKIE
Copy link

Closing this as unreviewed issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Same image gets indexed multiple times when directory contains symlinks/hardlinks

2 participants