Skip to content

[Security] Fix directory traversal vulnerability in file upload handling (#502)#660

Open
SAI-HARISH2007 wants to merge 2 commits intoAOSSIE-Org:mainfrom
SAI-HARISH2007:fix-directory-traversal
Open

[Security] Fix directory traversal vulnerability in file upload handling (#502)#660
SAI-HARISH2007 wants to merge 2 commits intoAOSSIE-Org:mainfrom
SAI-HARISH2007:fix-directory-traversal

Conversation

@SAI-HARISH2007
Copy link
Copy Markdown

@SAI-HARISH2007 SAI-HARISH2007 commented Apr 10, 2026

Addressed Issues:

Fixes #502


Screenshots/Recordings:

Not applicable (backend security fix)


Additional Notes:

Summary

This PR fixes a directory traversal vulnerability in the file upload pipeline.

This vulnerability was identified during manual codebase review and reported earlier as Issue #502.

The issue was caused by directly using file.filename when constructing file paths, which allowed malicious inputs such as ../../etc/passwd to escape the intended upload directory.

Changes

  • Sanitized filenames using secure_filename
  • Added absolute path validation using os.path.commonpath
  • Ensured uploaded files remain within the configured upload directory
  • Added try/finally block to guarantee file cleanup even on failure
  • Improved file type handling with explicit validation

Security Impact

This prevents:

  • Arbitrary file write outside upload directory
  • Potential overwriting of sensitive system/application files

AI Usage Disclosure:

  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I take full responsibility for the changes.

I primarily identified the issue, analyzed the root cause, and designed the solution independently through manual codebase exploration. AI tools (ChatGPT, Claude) were used minimally to refine the implementation and review the final approach.


Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced file upload security by validating filenames and preventing unauthorized directory access.
    • Improved file type validation with explicit error messages for unsupported formats.
    • Strengthened error handling to ensure temporary files are properly cleaned up during processing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

The FileProcessor.process_file method was updated to address a directory traversal security vulnerability. Filename sanitization via secure_filename and path validation using os.path.commonpath were implemented to restrict file writes to the intended upload directory. File extension validation was refactored, and cleanup logic was restructured with try...finally to ensure temporary files are always removed.

Changes

Cohort / File(s) Summary
File Upload Security Hardening
backend/Generator/main.py
Implemented filename sanitization with secure_filename, added directory traversal protection via os.path.commonpath validation, refactored extension checking, and introduced try...finally block for guaranteed cleanup of temporary files.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hops through upload gates,
With secure_filename that validates!
No traversal tricks shall pass today,
Path boundaries keep hackers at bay,
Try, finally—cleanup's here to stay! 🔒

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main security fix—preventing directory traversal in file upload handling—which aligns with the primary change in the changeset.
Linked Issues check ✅ Passed The pull request successfully addresses the security objectives from Issue #502 by sanitizing filenames with secure_filename, validating absolute paths using os.path.commonpath, ensuring uploads remain within the upload directory, and implementing proper cleanup logic.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the directory traversal vulnerability in file upload handling as specified in Issue #502. No unrelated modifications are evident.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 376-401: The current code derives file_path from the
client-provided filename (base_dir + filename) which allows concurrent requests
with identical sanitized basenames to overwrite or delete each other's files;
update the upload handling in the block that constructs file_path and calls
file.save(...) to create a unique temp file inside base_dir using
tempfile.mkstemp (preserve the extension via
os.path.splitext(filename)[1].lower()), close the returned fd, then save to that
temp path and keep the existing cleanup in the finally; continue to use or adapt
the existing directory traversal check (base_dir, file_path) and the existing
downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.
- Around line 372-398: The current process_file implementation raises ValueError
for invalid filenames, directory traversal, and unsupported extensions, which
breaks the caller's expected contract; instead, in process_file replace those
raise ValueError calls with safe early returns (e.g., return "" or None) and
avoid saving the file when checks fail: if secure_filename yields no filename,
log and return "", if os.path.commonpath check fails, do not call file.save and
return "", and for unsupported ext (checked via ext =
filename.lower().rsplit('.', 1)[-1]) log and return "" rather than raising; keep
normal flows that call extract_text_from_pdf and extract_text_from_docx
unchanged so callers (that check the returned content) can respond with 400.
- Around line 382-401: Move the file.save(file_path) call into the try block so
any exception during saving still runs the finally cleanup; change
extract_text_from_pdf(self, file_path) to open the PDF with a context manager
(e.g., use fitz.open(file_path) in a with block) and return the concatenated
page text, mirroring extract_text_from_docx, so the PDF file handle is closed
before the finally block attempts os.remove(file_path).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fcd7dcaa-cbd7-46c8-8b7a-5d5b721d57be

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and a210283.

📒 Files selected for processing (1)
  • backend/Generator/main.py

Comment thread backend/Generator/main.py
Comment on lines +372 to +398
filename = secure_filename(file.filename)
if not filename:
raise ValueError("Invalid filename")

base_dir = os.path.abspath(self.upload_folder)
file_path = os.path.abspath(os.path.join(base_dir, filename))

if os.path.commonpath([base_dir, file_path]) != base_dir:
raise ValueError("Directory traversal attempt detected.")

file.save(file_path)
content = ""

if file.filename.endswith('.txt'):
with open(file_path, 'r') as f:
content = f.read()
elif file.filename.endswith('.pdf'):
content = self.extract_text_from_pdf(file_path)
elif file.filename.endswith('.docx'):
content = self.extract_text_from_docx(file_path)

os.remove(file_path)
try:
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]

if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

process_file() now escapes the upload route contract.

backend/server.py Line 498 calls this method without try/except and only checks the returned content afterwards. The new ValueErrors here will turn invalid filenames and unsupported extensions into 500s instead of the existing 400 response path.

Caller-side follow-up
try:
    content = file_processor.process_file(file)
except ValueError:
    return jsonify({"error": "Invalid upload"}), 400
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 372 - 398, The current process_file
implementation raises ValueError for invalid filenames, directory traversal, and
unsupported extensions, which breaks the caller's expected contract; instead, in
process_file replace those raise ValueError calls with safe early returns (e.g.,
return "" or None) and avoid saving the file when checks fail: if
secure_filename yields no filename, log and return "", if os.path.commonpath
check fails, do not call file.save and return "", and for unsupported ext
(checked via ext = filename.lower().rsplit('.', 1)[-1]) log and return "" rather
than raising; keep normal flows that call extract_text_from_pdf and
extract_text_from_docx unchanged so callers (that check the returned content)
can respond with 400.

Comment thread backend/Generator/main.py
Comment on lines +376 to +401
base_dir = os.path.abspath(self.upload_folder)
file_path = os.path.abspath(os.path.join(base_dir, filename))

if os.path.commonpath([base_dir, file_path]) != base_dir:
raise ValueError("Directory traversal attempt detected.")

file.save(file_path)
content = ""

if file.filename.endswith('.txt'):
with open(file_path, 'r') as f:
content = f.read()
elif file.filename.endswith('.pdf'):
content = self.extract_text_from_pdf(file_path)
elif file.filename.endswith('.docx'):
content = self.extract_text_from_docx(file_path)

os.remove(file_path)
try:
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]

if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
finally:
if os.path.exists(file_path):
os.remove(file_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The temporary filename is still shared across requests.

The on-disk path is derived only from the sanitized client filename. With the app-wide FileProcessor in backend/server.py Lines 37-46 and a shared upload directory, two concurrent uploads of the same sanitized basename can overwrite each other or delete the other request's file in finally.

Suggested direction
import tempfile

ext = os.path.splitext(filename)[1].lower()
fd, file_path = tempfile.mkstemp(dir=base_dir, suffix=ext)
os.close(fd)
file.save(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 376 - 401, The current code derives
file_path from the client-provided filename (base_dir + filename) which allows
concurrent requests with identical sanitized basenames to overwrite or delete
each other's files; update the upload handling in the block that constructs
file_path and calls file.save(...) to create a unique temp file inside base_dir
using tempfile.mkstemp (preserve the extension via
os.path.splitext(filename)[1].lower()), close the returned fd, then save to that
temp path and keep the existing cleanup in the finally; continue to use or adapt
the existing directory traversal check (base_dir, file_path) and the existing
downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.

Comment thread backend/Generator/main.py
Comment on lines 382 to +401
file.save(file_path)
content = ""

if file.filename.endswith('.txt'):
with open(file_path, 'r') as f:
content = f.read()
elif file.filename.endswith('.pdf'):
content = self.extract_text_from_pdf(file_path)
elif file.filename.endswith('.docx'):
content = self.extract_text_from_docx(file_path)

os.remove(file_path)
try:
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]

if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
finally:
if os.path.exists(file_path):
os.remove(file_path)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyMuPDF documentation: does fitz.open(file_path)keep the file handle open untilclose()/a context manager, and can that prevent deleting the file on Windows?

💡 Result:

Yes, fitz.open(file_path) (equivalent to pymupdf.open(file_path) or pymupdf.Document(file_path)) keeps the underlying file handle open until the Document is closed via doc.close or the context manager exits (e.g., with fitz.open(file_path) as doc: ...). The official documentation states that Document.close "Release objects and space allocations associated with the document. If created from a file, also closes filename (releasing control to the OS)." It also supports context managers, which automatically close the document on exit. This can prevent deleting or modifying the file on Windows, as confirmed by multiple GitHub issues and Stack Overflow reports where os.remove(file_path) or os.rename fails with PermissionError: [WinError 32] The process cannot access the file because it is being used by another process. after doc.close if pages or other references are still alive, or in older versions/bugs. Fixes in recent versions (e.g., 1.23.18+) address some cases, but proper usage (context manager, doc.close, nulling pages like page = None) is required. Some reports persist even in 1.26.4 on Windows, often not reproducible by maintainers, suggesting environment-specific factors.

Citations:


🏁 Script executed:

# Find the extract_text_from_pdf method implementation in the main.py file
rg -A 10 "def extract_text_from_pdf" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 430


🏁 Script executed:

# Also check the extract_text_from_docx method to compare handling patterns
rg -A 10 "def extract_text_from_docx" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 462


🏁 Script executed:

# Check if there are any other references to fitz.open in the file
rg "fitz\.open|fitz import" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 95


Fix two cleanup failure paths in file processing.

The extract_text_from_pdf() method opens the PDF file handle without closing it, which locks the file on Windows and causes os.remove() on line 400 to fail with PermissionError. Move file.save(file_path) inside the try block so cleanup occurs even if file.save() fails partway through, and use a context manager in extract_text_from_pdf() to match the pattern used in extract_text_from_docx():

Suggested fixes
-        file.save(file_path)
         content = ""

         try:
+            file.save(file_path)
             # Isolate the file extension cleanly
             ext = filename.lower().rsplit('.', 1)[-1]
def extract_text_from_pdf(self, file_path):
    with fitz.open(file_path) as doc:
        return "".join(page.get_text() for page in doc)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file.save(file_path)
content = ""
if file.filename.endswith('.txt'):
with open(file_path, 'r') as f:
content = f.read()
elif file.filename.endswith('.pdf'):
content = self.extract_text_from_pdf(file_path)
elif file.filename.endswith('.docx'):
content = self.extract_text_from_docx(file_path)
os.remove(file_path)
try:
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]
if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
finally:
if os.path.exists(file_path):
os.remove(file_path)
content = ""
try:
file.save(file_path)
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]
if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
finally:
if os.path.exists(file_path):
os.remove(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 382 - 401, Move the
file.save(file_path) call into the try block so any exception during saving
still runs the finally cleanup; change extract_text_from_pdf(self, file_path) to
open the PDF with a context manager (e.g., use fitz.open(file_path) in a with
block) and return the concatenated page text, mirroring extract_text_from_docx,
so the PDF file handle is closed before the finally block attempts
os.remove(file_path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] Potential Directory Traversal in File Upload Handling

1 participant