[Security] Fix directory traversal vulnerability in file upload handling (#502) by SAI-HARISH2007 · Pull Request #660 · AOSSIE-Org/EduAid

SAI-HARISH2007 · 2026-04-10T18:05:47Z

Addressed Issues:

Fixes #502

Screenshots/Recordings:

Not applicable (backend security fix)

Additional Notes:

Summary

This PR fixes a directory traversal vulnerability in the file upload pipeline.

This vulnerability was identified during manual codebase review and reported earlier as Issue #502.

The issue was caused by directly using file.filename when constructing file paths, which allowed malicious inputs such as ../../etc/passwd to escape the intended upload directory.

Changes

Sanitized filenames using secure_filename
Added absolute path validation using os.path.commonpath
Ensured uploaded files remain within the configured upload directory
Added try/finally block to guarantee file cleanup even on failure
Improved file type handling with explicit validation

Security Impact

This prevents:

Arbitrary file write outside upload directory
Potential overwriting of sensitive system/application files

AI Usage Disclosure:

This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I take full responsibility for the changes.

I primarily identified the issue, analyzed the root cause, and designed the solution independently through manual codebase exploration. AI tools (ChatGPT, Claude) were used minimally to refine the implementation and review the final approach.

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

Bug Fixes
- Enhanced file upload security by validating filenames and preventing unauthorized directory access.
- Improved file type validation with explicit error messages for unsupported formats.
- Strengthened error handling to ensure temporary files are properly cleaned up during processing.

)

coderabbitai · 2026-04-10T18:06:02Z

📝 Walkthrough

Walkthrough

The FileProcessor.process_file method was updated to address a directory traversal security vulnerability. Filename sanitization via secure_filename and path validation using os.path.commonpath were implemented to restrict file writes to the intended upload directory. File extension validation was refactored, and cleanup logic was restructured with try...finally to ensure temporary files are always removed.

Changes

Cohort / File(s)	Summary
File Upload Security Hardening `backend/Generator/main.py`	Implemented filename sanitization with `secure_filename`, added directory traversal protection via `os.path.commonpath` validation, refactored extension checking, and introduced `try...finally` block for guaranteed cleanup of temporary files.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hops through upload gates,
With secure_filename that validates!
No traversal tricks shall pass today,
Path boundaries keep hackers at bay,
Try, finally—cleanup's here to stay! 🔒

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main security fix—preventing directory traversal in file upload handling—which aligns with the primary change in the changeset.
Linked Issues check	✅ Passed	The pull request successfully addresses the security objectives from Issue `#502` by sanitizing filenames with secure_filename, validating absolute paths using os.path.commonpath, ensuring uploads remain within the upload directory, and implementing proper cleanup logic.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the directory traversal vulnerability in file upload handling as specified in Issue `#502`. No unrelated modifications are evident.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 376-401: The current code derives file_path from the
client-provided filename (base_dir + filename) which allows concurrent requests
with identical sanitized basenames to overwrite or delete each other's files;
update the upload handling in the block that constructs file_path and calls
file.save(...) to create a unique temp file inside base_dir using
tempfile.mkstemp (preserve the extension via
os.path.splitext(filename)[1].lower()), close the returned fd, then save to that
temp path and keep the existing cleanup in the finally; continue to use or adapt
the existing directory traversal check (base_dir, file_path) and the existing
downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.
- Around line 372-398: The current process_file implementation raises ValueError
for invalid filenames, directory traversal, and unsupported extensions, which
breaks the caller's expected contract; instead, in process_file replace those
raise ValueError calls with safe early returns (e.g., return "" or None) and
avoid saving the file when checks fail: if secure_filename yields no filename,
log and return "", if os.path.commonpath check fails, do not call file.save and
return "", and for unsupported ext (checked via ext =
filename.lower().rsplit('.', 1)[-1]) log and return "" rather than raising; keep
normal flows that call extract_text_from_pdf and extract_text_from_docx
unchanged so callers (that check the returned content) can respond with 400.
- Around line 382-401: Move the file.save(file_path) call into the try block so
any exception during saving still runs the finally cleanup; change
extract_text_from_pdf(self, file_path) to open the PDF with a context manager
(e.g., use fitz.open(file_path) in a with block) and return the concatenated
page text, mirroring extract_text_from_docx, so the PDF file handle is closed
before the finally block attempts os.remove(file_path).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fcd7dcaa-cbd7-46c8-8b7a-5d5b721d57be

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and a210283.

📒 Files selected for processing (1)

backend/Generator/main.py

coderabbitai · 2026-04-10T18:14:18Z

+        filename = secure_filename(file.filename)
+        if not filename:
+            raise ValueError("Invalid filename")
+
+        base_dir = os.path.abspath(self.upload_folder)
+        file_path = os.path.abspath(os.path.join(base_dir, filename))
+
+        if os.path.commonpath([base_dir, file_path]) != base_dir:
+            raise ValueError("Directory traversal attempt detected.")
+
        file.save(file_path)
        content = ""

-        if file.filename.endswith('.txt'):
-            with open(file_path, 'r') as f:
-                content = f.read()
-        elif file.filename.endswith('.pdf'):
-            content = self.extract_text_from_pdf(file_path)
-        elif file.filename.endswith('.docx'):
-            content = self.extract_text_from_docx(file_path)
-
-        os.remove(file_path)
+        try:
+            # Isolate the file extension cleanly
+            ext = filename.lower().rsplit('.', 1)[-1]
+
+            if ext == 'txt':
+                with open(file_path, 'r') as f:
+                    content = f.read()
+            elif ext == 'pdf':
+                content = self.extract_text_from_pdf(file_path)
+            elif ext == 'docx':
+                content = self.extract_text_from_docx(file_path)
+            else:
+                # Explicitly bail on unsupported types rather than failing silently
+                raise ValueError("Unsupported file type")


⚠️ Potential issue | 🟠 Major

process_file() now escapes the upload route contract.

backend/server.py Line 498 calls this method without try/except and only checks the returned content afterwards. The new ValueErrors here will turn invalid filenames and unsupported extensions into 500s instead of the existing 400 response path.

Caller-side follow-up

try: content = file_processor.process_file(file) except ValueError: return jsonify({"error": "Invalid upload"}), 400

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 372 - 398, The current process_file implementation raises ValueError for invalid filenames, directory traversal, and unsupported extensions, which breaks the caller's expected contract; instead, in process_file replace those raise ValueError calls with safe early returns (e.g., return "" or None) and avoid saving the file when checks fail: if secure_filename yields no filename, log and return "", if os.path.commonpath check fails, do not call file.save and return "", and for unsupported ext (checked via ext = filename.lower().rsplit('.', 1)[-1]) log and return "" rather than raising; keep normal flows that call extract_text_from_pdf and extract_text_from_docx unchanged so callers (that check the returned content) can respond with 400.

coderabbitai · 2026-04-10T18:14:18Z

+        base_dir = os.path.abspath(self.upload_folder)
+        file_path = os.path.abspath(os.path.join(base_dir, filename))
+
+        if os.path.commonpath([base_dir, file_path]) != base_dir:
+            raise ValueError("Directory traversal attempt detected.")
+
        file.save(file_path)
        content = ""

-        if file.filename.endswith('.txt'):
-            with open(file_path, 'r') as f:
-                content = f.read()
-        elif file.filename.endswith('.pdf'):
-            content = self.extract_text_from_pdf(file_path)
-        elif file.filename.endswith('.docx'):
-            content = self.extract_text_from_docx(file_path)
-
-        os.remove(file_path)
+        try:
+            # Isolate the file extension cleanly
+            ext = filename.lower().rsplit('.', 1)[-1]
+
+            if ext == 'txt':
+                with open(file_path, 'r') as f:
+                    content = f.read()
+            elif ext == 'pdf':
+                content = self.extract_text_from_pdf(file_path)
+            elif ext == 'docx':
+                content = self.extract_text_from_docx(file_path)
+            else:
+                # Explicitly bail on unsupported types rather than failing silently
+                raise ValueError("Unsupported file type")
+        finally:
+            if os.path.exists(file_path):
+                os.remove(file_path)


⚠️ Potential issue | 🔴 Critical

The temporary filename is still shared across requests.

The on-disk path is derived only from the sanitized client filename. With the app-wide FileProcessor in backend/server.py Lines 37-46 and a shared upload directory, two concurrent uploads of the same sanitized basename can overwrite each other or delete the other request's file in finally.

Suggested direction

import tempfile ext = os.path.splitext(filename)[1].lower() fd, file_path = tempfile.mkstemp(dir=base_dir, suffix=ext) os.close(fd) file.save(file_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 376 - 401, The current code derives file_path from the client-provided filename (base_dir + filename) which allows concurrent requests with identical sanitized basenames to overwrite or delete each other's files; update the upload handling in the block that constructs file_path and calls file.save(...) to create a unique temp file inside base_dir using tempfile.mkstemp (preserve the extension via os.path.splitext(filename)[1].lower()), close the returned fd, then save to that temp path and keep the existing cleanup in the finally; continue to use or adapt the existing directory traversal check (base_dir, file_path) and the existing downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.

coderabbitai · 2026-04-10T18:14:18Z

        file.save(file_path)
        content = ""

-        if file.filename.endswith('.txt'):
-            with open(file_path, 'r') as f:
-                content = f.read()
-        elif file.filename.endswith('.pdf'):
-            content = self.extract_text_from_pdf(file_path)
-        elif file.filename.endswith('.docx'):
-            content = self.extract_text_from_docx(file_path)
-
-        os.remove(file_path)
+        try:
+            # Isolate the file extension cleanly
+            ext = filename.lower().rsplit('.', 1)[-1]
+
+            if ext == 'txt':
+                with open(file_path, 'r') as f:
+                    content = f.read()
+            elif ext == 'pdf':
+                content = self.extract_text_from_pdf(file_path)
+            elif ext == 'docx':
+                content = self.extract_text_from_docx(file_path)
+            else:
+                # Explicitly bail on unsupported types rather than failing silently
+                raise ValueError("Unsupported file type")
+        finally:
+            if os.path.exists(file_path):
+                os.remove(file_path)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyMuPDF documentation: does fitz.open(file_path)keep the file handle open untilclose()/a context manager, and can that prevent deleting the file on Windows?

💡 Result:

Yes, fitz.open(file_path) (equivalent to pymupdf.open(file_path) or pymupdf.Document(file_path)) keeps the underlying file handle open until the Document is closed via doc.close or the context manager exits (e.g., with fitz.open(file_path) as doc: ...). The official documentation states that Document.close "Release objects and space allocations associated with the document. If created from a file, also closes filename (releasing control to the OS)." It also supports context managers, which automatically close the document on exit. This can prevent deleting or modifying the file on Windows, as confirmed by multiple GitHub issues and Stack Overflow reports where os.remove(file_path) or os.rename fails with PermissionError: [WinError 32] The process cannot access the file because it is being used by another process. after doc.close if pages or other references are still alive, or in older versions/bugs. Fixes in recent versions (e.g., 1.23.18+) address some cases, but proper usage (context manager, doc.close, nulling pages like page = None) is required. Some reports persist even in 1.26.4 on Windows, often not reproducible by maintainers, suggesting environment-specific factors.

Citations:

1: https://pymupdf.readthedocs.io/en/latest/document.html

2: https://documentation.help/pymupdf/document.html

3: Context manager pymupdf/PyMuPDF#680

4: doc.close() not closing the document pymupdf/PyMuPDF#3081

5: PDF document stays in use after closing pymupdf/PyMuPDF#3140

6: https://stackoverflow.com/questions/70573629/pymupdf-fitz-not-properly-closing-files-resulting-in-permissionerror-winerro

7: File not closing pymupdf/PyMuPDF#4715

8: Issue Closing File with PyMuPDF 1.25.3 on Windows 11 pymupdf/PyMuPDF#4319

🏁 Script executed:

# Find the extract_text_from_pdf method implementation in the main.py file rg -A 10 "def extract_text_from_pdf" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 430

🏁 Script executed:

# Also check the extract_text_from_docx method to compare handling patterns rg -A 10 "def extract_text_from_docx" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 462

🏁 Script executed:

# Check if there are any other references to fitz.open in the file rg "fitz\.open|fitz import" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 95

Fix two cleanup failure paths in file processing.

The extract_text_from_pdf() method opens the PDF file handle without closing it, which locks the file on Windows and causes os.remove() on line 400 to fail with PermissionError. Move file.save(file_path) inside the try block so cleanup occurs even if file.save() fails partway through, and use a context manager in extract_text_from_pdf() to match the pattern used in extract_text_from_docx():

Suggested fixes

- file.save(file_path) content = "" try: + file.save(file_path) # Isolate the file extension cleanly ext = filename.lower().rsplit('.', 1)[-1]

def extract_text_from_pdf(self, file_path): with fitz.open(file_path) as doc: return "".join(page.get_text() for page in doc)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

file.save(file_path)

content = ""

if file.filename.endswith('.txt'):

with open(file_path, 'r') as f:

content = f.read()

elif file.filename.endswith('.pdf'):

content = self.extract_text_from_pdf(file_path)

elif file.filename.endswith('.docx'):

content = self.extract_text_from_docx(file_path)

os.remove(file_path)

try:

# Isolate the file extension cleanly

ext = filename.lower().rsplit('.', 1)[-1]

if ext == 'txt':

with open(file_path, 'r') as f:

content = f.read()

elif ext == 'pdf':

content = self.extract_text_from_pdf(file_path)

elif ext == 'docx':

content = self.extract_text_from_docx(file_path)

else:

# Explicitly bail on unsupported types rather than failing silently

raise ValueError("Unsupported file type")

finally:

if os.path.exists(file_path):

os.remove(file_path)

content = ""

try:

file.save(file_path)

# Isolate the file extension cleanly

ext = filename.lower().rsplit('.', 1)[-1]

if ext == 'txt':

with open(file_path, 'r') as f:

content = f.read()

elif ext == 'pdf':

content = self.extract_text_from_pdf(file_path)

elif ext == 'docx':

content = self.extract_text_from_docx(file_path)

else:

# Explicitly bail on unsupported types rather than failing silently

raise ValueError("Unsupported file type")

finally:

if os.path.exists(file_path):

os.remove(file_path)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 382 - 401, Move the file.save(file_path) call into the try block so any exception during saving still runs the finally cleanup; change extract_text_from_pdf(self, file_path) to open the PDF with a context manager (e.g., use fitz.open(file_path) in a with block) and return the concatenated page text, mirroring extract_text_from_docx, so the PDF file handle is closed before the finally block attempts os.remove(file_path).

SAI-HARISH2007 added 2 commits April 10, 2026 23:17

fix: prevent directory traversal in file upload handling (AOSSIE-Org#502

7a4c6c2

)

chore: remove unrelated file from PR

a210283

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Security] Fix directory traversal vulnerability in file upload handling (#502)#660

[Security] Fix directory traversal vulnerability in file upload handling (#502)#660
SAI-HARISH2007 wants to merge 2 commits intoAOSSIE-Org:mainfrom
SAI-HARISH2007:fix-directory-traversal

SAI-HARISH2007 commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

coderabbitai Bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SAI-HARISH2007 commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

Summary

Changes

Security Impact

AI Usage Disclosure:

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SAI-HARISH2007 commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading