Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 29 additions & 10 deletions backend/Generator/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import os
import fitz
import mammoth
from werkzeug.utils import secure_filename

class MCQGenerator:

Expand Down Expand Up @@ -368,19 +369,37 @@ def extract_text_from_docx(self, file_path):
return result.value

def process_file(self, file):
file_path = os.path.join(self.upload_folder, file.filename)
filename = secure_filename(file.filename)
if not filename:
raise ValueError("Invalid filename")

base_dir = os.path.abspath(self.upload_folder)
file_path = os.path.abspath(os.path.join(base_dir, filename))

if os.path.commonpath([base_dir, file_path]) != base_dir:
raise ValueError("Directory traversal attempt detected.")

file.save(file_path)
content = ""

if file.filename.endswith('.txt'):
with open(file_path, 'r') as f:
content = f.read()
elif file.filename.endswith('.pdf'):
content = self.extract_text_from_pdf(file_path)
elif file.filename.endswith('.docx'):
content = self.extract_text_from_docx(file_path)

os.remove(file_path)
try:
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]

if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
Comment on lines +372 to +398
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

process_file() now escapes the upload route contract.

backend/server.py Line 498 calls this method without try/except and only checks the returned content afterwards. The new ValueErrors here will turn invalid filenames and unsupported extensions into 500s instead of the existing 400 response path.

Caller-side follow-up
try:
    content = file_processor.process_file(file)
except ValueError:
    return jsonify({"error": "Invalid upload"}), 400
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 372 - 398, The current process_file
implementation raises ValueError for invalid filenames, directory traversal, and
unsupported extensions, which breaks the caller's expected contract; instead, in
process_file replace those raise ValueError calls with safe early returns (e.g.,
return "" or None) and avoid saving the file when checks fail: if
secure_filename yields no filename, log and return "", if os.path.commonpath
check fails, do not call file.save and return "", and for unsupported ext
(checked via ext = filename.lower().rsplit('.', 1)[-1]) log and return "" rather
than raising; keep normal flows that call extract_text_from_pdf and
extract_text_from_docx unchanged so callers (that check the returned content)
can respond with 400.

finally:
if os.path.exists(file_path):
os.remove(file_path)
Comment on lines +376 to +401
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

The temporary filename is still shared across requests.

The on-disk path is derived only from the sanitized client filename. With the app-wide FileProcessor in backend/server.py Lines 37-46 and a shared upload directory, two concurrent uploads of the same sanitized basename can overwrite each other or delete the other request's file in finally.

Suggested direction
import tempfile

ext = os.path.splitext(filename)[1].lower()
fd, file_path = tempfile.mkstemp(dir=base_dir, suffix=ext)
os.close(fd)
file.save(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 376 - 401, The current code derives
file_path from the client-provided filename (base_dir + filename) which allows
concurrent requests with identical sanitized basenames to overwrite or delete
each other's files; update the upload handling in the block that constructs
file_path and calls file.save(...) to create a unique temp file inside base_dir
using tempfile.mkstemp (preserve the extension via
os.path.splitext(filename)[1].lower()), close the returned fd, then save to that
temp path and keep the existing cleanup in the finally; continue to use or adapt
the existing directory traversal check (base_dir, file_path) and the existing
downstream calls (extract_text_from_pdf, extract_text_from_docx) unchanged.

Comment on lines 382 to +401
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

PyMuPDF documentation: does fitz.open(file_path)keep the file handle open untilclose()/a context manager, and can that prevent deleting the file on Windows?

💡 Result:

Yes, fitz.open(file_path) (equivalent to pymupdf.open(file_path) or pymupdf.Document(file_path)) keeps the underlying file handle open until the Document is closed via doc.close or the context manager exits (e.g., with fitz.open(file_path) as doc: ...). The official documentation states that Document.close "Release objects and space allocations associated with the document. If created from a file, also closes filename (releasing control to the OS)." It also supports context managers, which automatically close the document on exit. This can prevent deleting or modifying the file on Windows, as confirmed by multiple GitHub issues and Stack Overflow reports where os.remove(file_path) or os.rename fails with PermissionError: [WinError 32] The process cannot access the file because it is being used by another process. after doc.close if pages or other references are still alive, or in older versions/bugs. Fixes in recent versions (e.g., 1.23.18+) address some cases, but proper usage (context manager, doc.close, nulling pages like page = None) is required. Some reports persist even in 1.26.4 on Windows, often not reproducible by maintainers, suggesting environment-specific factors.

Citations:


🏁 Script executed:

# Find the extract_text_from_pdf method implementation in the main.py file
rg -A 10 "def extract_text_from_pdf" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 430


🏁 Script executed:

# Also check the extract_text_from_docx method to compare handling patterns
rg -A 10 "def extract_text_from_docx" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 462


🏁 Script executed:

# Check if there are any other references to fitz.open in the file
rg "fitz\.open|fitz import" backend/Generator/main.py

Repository: AOSSIE-Org/EduAid

Length of output: 95


Fix two cleanup failure paths in file processing.

The extract_text_from_pdf() method opens the PDF file handle without closing it, which locks the file on Windows and causes os.remove() on line 400 to fail with PermissionError. Move file.save(file_path) inside the try block so cleanup occurs even if file.save() fails partway through, and use a context manager in extract_text_from_pdf() to match the pattern used in extract_text_from_docx():

Suggested fixes
-        file.save(file_path)
         content = ""

         try:
+            file.save(file_path)
             # Isolate the file extension cleanly
             ext = filename.lower().rsplit('.', 1)[-1]
def extract_text_from_pdf(self, file_path):
    with fitz.open(file_path) as doc:
        return "".join(page.get_text() for page in doc)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file.save(file_path)
content = ""
if file.filename.endswith('.txt'):
with open(file_path, 'r') as f:
content = f.read()
elif file.filename.endswith('.pdf'):
content = self.extract_text_from_pdf(file_path)
elif file.filename.endswith('.docx'):
content = self.extract_text_from_docx(file_path)
os.remove(file_path)
try:
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]
if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
finally:
if os.path.exists(file_path):
os.remove(file_path)
content = ""
try:
file.save(file_path)
# Isolate the file extension cleanly
ext = filename.lower().rsplit('.', 1)[-1]
if ext == 'txt':
with open(file_path, 'r') as f:
content = f.read()
elif ext == 'pdf':
content = self.extract_text_from_pdf(file_path)
elif ext == 'docx':
content = self.extract_text_from_docx(file_path)
else:
# Explicitly bail on unsupported types rather than failing silently
raise ValueError("Unsupported file type")
finally:
if os.path.exists(file_path):
os.remove(file_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 382 - 401, Move the
file.save(file_path) call into the try block so any exception during saving
still runs the finally cleanup; change extract_text_from_pdf(self, file_path) to
open the PDF with a context manager (e.g., use fitz.open(file_path) in a with
block) and return the concatenated page text, mirroring extract_text_from_docx,
so the PDF file handle is closed before the finally block attempts
os.remove(file_path).


return content

class QuestionGenerator:
Expand Down