[Security] Fix directory traversal vulnerability in file upload handling (#502)#660
[Security] Fix directory traversal vulnerability in file upload handling (#502)#660SAI-HARISH2007 wants to merge 2 commits intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 376-401: The current code derives file_path from the
client-provided filename (base_dir + filename) which allows concurrent requests
with identical sanitized basenames to overwrite or delete each other's files;
update the upload handling in the block that constructs file_path and calls
file.save(...) to create a unique temp file inside base_dir using
tempfile.mkstemp (preserve the extension via
os.path.splitext(filename)[1].lower()), close the returned fd, then save to that
temp path and keep the existing cleanup in the finally; continue to use or adapt
the existing directory traversal check (base_dir, file_path) and the existing
downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.
- Around line 372-398: The current process_file implementation raises ValueError
for invalid filenames, directory traversal, and unsupported extensions, which
breaks the caller's expected contract; instead, in process_file replace those
raise ValueError calls with safe early returns (e.g., return "" or None) and
avoid saving the file when checks fail: if secure_filename yields no filename,
log and return "", if os.path.commonpath check fails, do not call file.save and
return "", and for unsupported ext (checked via ext =
filename.lower().rsplit('.', 1)[-1]) log and return "" rather than raising; keep
normal flows that call extract_text_from_pdf and extract_text_from_docx
unchanged so callers (that check the returned content) can respond with 400.
- Around line 382-401: Move the file.save(file_path) call into the try block so
any exception during saving still runs the finally cleanup; change
extract_text_from_pdf(self, file_path) to open the PDF with a context manager
(e.g., use fitz.open(file_path) in a with block) and return the concatenated
page text, mirroring extract_text_from_docx, so the PDF file handle is closed
before the finally block attempts os.remove(file_path).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| filename = secure_filename(file.filename) | ||
| if not filename: | ||
| raise ValueError("Invalid filename") | ||
|
|
||
| base_dir = os.path.abspath(self.upload_folder) | ||
| file_path = os.path.abspath(os.path.join(base_dir, filename)) | ||
|
|
||
| if os.path.commonpath([base_dir, file_path]) != base_dir: | ||
| raise ValueError("Directory traversal attempt detected.") | ||
|
|
||
| file.save(file_path) | ||
| content = "" | ||
|
|
||
| if file.filename.endswith('.txt'): | ||
| with open(file_path, 'r') as f: | ||
| content = f.read() | ||
| elif file.filename.endswith('.pdf'): | ||
| content = self.extract_text_from_pdf(file_path) | ||
| elif file.filename.endswith('.docx'): | ||
| content = self.extract_text_from_docx(file_path) | ||
|
|
||
| os.remove(file_path) | ||
| try: | ||
| # Isolate the file extension cleanly | ||
| ext = filename.lower().rsplit('.', 1)[-1] | ||
|
|
||
| if ext == 'txt': | ||
| with open(file_path, 'r') as f: | ||
| content = f.read() | ||
| elif ext == 'pdf': | ||
| content = self.extract_text_from_pdf(file_path) | ||
| elif ext == 'docx': | ||
| content = self.extract_text_from_docx(file_path) | ||
| else: | ||
| # Explicitly bail on unsupported types rather than failing silently | ||
| raise ValueError("Unsupported file type") |
There was a problem hiding this comment.
process_file() now escapes the upload route contract.
backend/server.py Line 498 calls this method without try/except and only checks the returned content afterwards. The new ValueErrors here will turn invalid filenames and unsupported extensions into 500s instead of the existing 400 response path.
Caller-side follow-up
try:
content = file_processor.process_file(file)
except ValueError:
return jsonify({"error": "Invalid upload"}), 400🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/main.py` around lines 372 - 398, The current process_file
implementation raises ValueError for invalid filenames, directory traversal, and
unsupported extensions, which breaks the caller's expected contract; instead, in
process_file replace those raise ValueError calls with safe early returns (e.g.,
return "" or None) and avoid saving the file when checks fail: if
secure_filename yields no filename, log and return "", if os.path.commonpath
check fails, do not call file.save and return "", and for unsupported ext
(checked via ext = filename.lower().rsplit('.', 1)[-1]) log and return "" rather
than raising; keep normal flows that call extract_text_from_pdf and
extract_text_from_docx unchanged so callers (that check the returned content)
can respond with 400.
| base_dir = os.path.abspath(self.upload_folder) | ||
| file_path = os.path.abspath(os.path.join(base_dir, filename)) | ||
|
|
||
| if os.path.commonpath([base_dir, file_path]) != base_dir: | ||
| raise ValueError("Directory traversal attempt detected.") | ||
|
|
||
| file.save(file_path) | ||
| content = "" | ||
|
|
||
| if file.filename.endswith('.txt'): | ||
| with open(file_path, 'r') as f: | ||
| content = f.read() | ||
| elif file.filename.endswith('.pdf'): | ||
| content = self.extract_text_from_pdf(file_path) | ||
| elif file.filename.endswith('.docx'): | ||
| content = self.extract_text_from_docx(file_path) | ||
|
|
||
| os.remove(file_path) | ||
| try: | ||
| # Isolate the file extension cleanly | ||
| ext = filename.lower().rsplit('.', 1)[-1] | ||
|
|
||
| if ext == 'txt': | ||
| with open(file_path, 'r') as f: | ||
| content = f.read() | ||
| elif ext == 'pdf': | ||
| content = self.extract_text_from_pdf(file_path) | ||
| elif ext == 'docx': | ||
| content = self.extract_text_from_docx(file_path) | ||
| else: | ||
| # Explicitly bail on unsupported types rather than failing silently | ||
| raise ValueError("Unsupported file type") | ||
| finally: | ||
| if os.path.exists(file_path): | ||
| os.remove(file_path) |
There was a problem hiding this comment.
The temporary filename is still shared across requests.
The on-disk path is derived only from the sanitized client filename. With the app-wide FileProcessor in backend/server.py Lines 37-46 and a shared upload directory, two concurrent uploads of the same sanitized basename can overwrite each other or delete the other request's file in finally.
Suggested direction
import tempfile
ext = os.path.splitext(filename)[1].lower()
fd, file_path = tempfile.mkstemp(dir=base_dir, suffix=ext)
os.close(fd)
file.save(file_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/main.py` around lines 376 - 401, The current code derives
file_path from the client-provided filename (base_dir + filename) which allows
concurrent requests with identical sanitized basenames to overwrite or delete
each other's files; update the upload handling in the block that constructs
file_path and calls file.save(...) to create a unique temp file inside base_dir
using tempfile.mkstemp (preserve the extension via
os.path.splitext(filename)[1].lower()), close the returned fd, then save to that
temp path and keep the existing cleanup in the finally; continue to use or adapt
the existing directory traversal check (base_dir, file_path) and the existing
downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.
| file.save(file_path) | ||
| content = "" | ||
|
|
||
| if file.filename.endswith('.txt'): | ||
| with open(file_path, 'r') as f: | ||
| content = f.read() | ||
| elif file.filename.endswith('.pdf'): | ||
| content = self.extract_text_from_pdf(file_path) | ||
| elif file.filename.endswith('.docx'): | ||
| content = self.extract_text_from_docx(file_path) | ||
|
|
||
| os.remove(file_path) | ||
| try: | ||
| # Isolate the file extension cleanly | ||
| ext = filename.lower().rsplit('.', 1)[-1] | ||
|
|
||
| if ext == 'txt': | ||
| with open(file_path, 'r') as f: | ||
| content = f.read() | ||
| elif ext == 'pdf': | ||
| content = self.extract_text_from_pdf(file_path) | ||
| elif ext == 'docx': | ||
| content = self.extract_text_from_docx(file_path) | ||
| else: | ||
| # Explicitly bail on unsupported types rather than failing silently | ||
| raise ValueError("Unsupported file type") | ||
| finally: | ||
| if os.path.exists(file_path): | ||
| os.remove(file_path) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
PyMuPDF documentation: does fitz.open(file_path)keep the file handle open untilclose()/a context manager, and can that prevent deleting the file on Windows?
💡 Result:
Yes, fitz.open(file_path) (equivalent to pymupdf.open(file_path) or pymupdf.Document(file_path)) keeps the underlying file handle open until the Document is closed via doc.close or the context manager exits (e.g., with fitz.open(file_path) as doc: ...). The official documentation states that Document.close "Release objects and space allocations associated with the document. If created from a file, also closes filename (releasing control to the OS)." It also supports context managers, which automatically close the document on exit. This can prevent deleting or modifying the file on Windows, as confirmed by multiple GitHub issues and Stack Overflow reports where os.remove(file_path) or os.rename fails with PermissionError: [WinError 32] The process cannot access the file because it is being used by another process. after doc.close if pages or other references are still alive, or in older versions/bugs. Fixes in recent versions (e.g., 1.23.18+) address some cases, but proper usage (context manager, doc.close, nulling pages like page = None) is required. Some reports persist even in 1.26.4 on Windows, often not reproducible by maintainers, suggesting environment-specific factors.
Citations:
- 1: https://pymupdf.readthedocs.io/en/latest/document.html
- 2: https://documentation.help/pymupdf/document.html
- 3: Context manager pymupdf/PyMuPDF#680
- 4: doc.close() not closing the document pymupdf/PyMuPDF#3081
- 5: PDF document stays in use after closing pymupdf/PyMuPDF#3140
- 6: https://stackoverflow.com/questions/70573629/pymupdf-fitz-not-properly-closing-files-resulting-in-permissionerror-winerro
- 7: File not closing pymupdf/PyMuPDF#4715
- 8: Issue Closing File with PyMuPDF 1.25.3 on Windows 11 pymupdf/PyMuPDF#4319
🏁 Script executed:
# Find the extract_text_from_pdf method implementation in the main.py file
rg -A 10 "def extract_text_from_pdf" backend/Generator/main.pyRepository: AOSSIE-Org/EduAid
Length of output: 430
🏁 Script executed:
# Also check the extract_text_from_docx method to compare handling patterns
rg -A 10 "def extract_text_from_docx" backend/Generator/main.pyRepository: AOSSIE-Org/EduAid
Length of output: 462
🏁 Script executed:
# Check if there are any other references to fitz.open in the file
rg "fitz\.open|fitz import" backend/Generator/main.pyRepository: AOSSIE-Org/EduAid
Length of output: 95
Fix two cleanup failure paths in file processing.
The extract_text_from_pdf() method opens the PDF file handle without closing it, which locks the file on Windows and causes os.remove() on line 400 to fail with PermissionError. Move file.save(file_path) inside the try block so cleanup occurs even if file.save() fails partway through, and use a context manager in extract_text_from_pdf() to match the pattern used in extract_text_from_docx():
Suggested fixes
- file.save(file_path)
content = ""
try:
+ file.save(file_path)
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]def extract_text_from_pdf(self, file_path):
with fitz.open(file_path) as doc:
return "".join(page.get_text() for page in doc)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| file.save(file_path) | |
| content = "" | |
| if file.filename.endswith('.txt'): | |
| with open(file_path, 'r') as f: | |
| content = f.read() | |
| elif file.filename.endswith('.pdf'): | |
| content = self.extract_text_from_pdf(file_path) | |
| elif file.filename.endswith('.docx'): | |
| content = self.extract_text_from_docx(file_path) | |
| os.remove(file_path) | |
| try: | |
| # Isolate the file extension cleanly | |
| ext = filename.lower().rsplit('.', 1)[-1] | |
| if ext == 'txt': | |
| with open(file_path, 'r') as f: | |
| content = f.read() | |
| elif ext == 'pdf': | |
| content = self.extract_text_from_pdf(file_path) | |
| elif ext == 'docx': | |
| content = self.extract_text_from_docx(file_path) | |
| else: | |
| # Explicitly bail on unsupported types rather than failing silently | |
| raise ValueError("Unsupported file type") | |
| finally: | |
| if os.path.exists(file_path): | |
| os.remove(file_path) | |
| content = "" | |
| try: | |
| file.save(file_path) | |
| # Isolate the file extension cleanly | |
| ext = filename.lower().rsplit('.', 1)[-1] | |
| if ext == 'txt': | |
| with open(file_path, 'r') as f: | |
| content = f.read() | |
| elif ext == 'pdf': | |
| content = self.extract_text_from_pdf(file_path) | |
| elif ext == 'docx': | |
| content = self.extract_text_from_docx(file_path) | |
| else: | |
| # Explicitly bail on unsupported types rather than failing silently | |
| raise ValueError("Unsupported file type") | |
| finally: | |
| if os.path.exists(file_path): | |
| os.remove(file_path) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/Generator/main.py` around lines 382 - 401, Move the
file.save(file_path) call into the try block so any exception during saving
still runs the finally cleanup; change extract_text_from_pdf(self, file_path) to
open the PDF with a context manager (e.g., use fitz.open(file_path) in a with
block) and return the concatenated page text, mirroring extract_text_from_docx,
so the PDF file handle is closed before the finally block attempts
os.remove(file_path).
Addressed Issues:
Fixes #502
Screenshots/Recordings:
Not applicable (backend security fix)
Additional Notes:
Summary
This PR fixes a directory traversal vulnerability in the file upload pipeline.
This vulnerability was identified during manual codebase review and reported earlier as Issue #502.
The issue was caused by directly using
file.filenamewhen constructing file paths, which allowed malicious inputs such as../../etc/passwdto escape the intended upload directory.Changes
secure_filenameos.path.commonpathSecurity Impact
This prevents:
AI Usage Disclosure:
I primarily identified the issue, analyzed the root cause, and designed the solution independently through manual codebase exploration. AI tools (ChatGPT, Claude) were used minimally to refine the implementation and review the final approach.
Checklist
Summary by CodeRabbit