Description
I'm seeing a VRAM leak using SentenceTransformerReranker.
In the _rerank method, CrossEncoder is initialized all the time. It should be reused.
Something like:
if self.sentence_transformer_client is None:
self.sentence_transformer_client = CrossEncoder(model_name_or_path=self.model, model_kwargs=self.model_kwargs)
Thank you
Steps to Reproduce
knowledge = Knowledge(
max_results=5,
vector_db=LanceDb(
uri="tmp/kbase",
table_name="PP_SMIB",
search_type=SearchType.hybrid,
embedder=OllamaEmbedder(id="snowflake-arctic-embed2:latest",dimensions=1024),
reranker=SentenceTransformerReranker(),
),
)
Agent Configuration (if applicable)
agent = Agent(
model=VLLM(id="Qwen/Qwen3-4B-Instruct-2507", top_k=16, enable_thinking=False,base_url="http://localhost:9999/v1"),
knowledge=knowledge,
search_knowledge=True,
add_knowledge_to_context=True,
instructions=dedent("""
You are a knowledge retrieval specialist. Your responsibilities include:
1. Thoroughly understanding the user’s query;
2. Retrieving the most relevant information from the knowledge base;
3. Evaluating whether the retrieved results are pertinent to the query;
4. Returning the results in a structured format, including content and source information;
"""),
debug_mode=False,
markdown=True
)
Expected Behavior
No vram leaks
Actual Behavior
vram leaks
Screenshots or Logs (if applicable)
No response
Environment
- OS: Windows 11, wsl2
- Agno version: 2.4.0
- Python: 3.11.12
Possible Solutions (optional)
No response
Additional Context
No response
Description
I'm seeing a VRAM leak using SentenceTransformerReranker.
In the _rerank method, CrossEncoder is initialized all the time. It should be reused.
Something like:
Thank you
Steps to Reproduce
Agent Configuration (if applicable)
Expected Behavior
No vram leaks
Actual Behavior
vram leaks
Screenshots or Logs (if applicable)
No response
Environment
Possible Solutions (optional)
No response
Additional Context
No response