Skip to content

Conversation

@sfallah
Copy link
Contributor

@sfallah sfallah commented Dec 20, 2025

This PR adds support for LlamaBidirectionalModel architectures, resolving feature request #17478.

The implementation enables bidirectional LLaMA embedding models—such as:

  • nvidia/llama-embed-nemotron-8b
  • nvidia/llama-nemotron-embed-1b-v2

while keeping changes minimal by reusing/copying the existing implementation of LLAMA wherever possible.


Key Points

  • Support for LlamaBidirectionalModel
  • Minimal changes: the implementation largely copies the existing LLaMA architecture already present
  • this copies and adapts the existiting model implementation in src/models/llama.cpp to handle bidirectional attention required for embeddings

Validation

  • Successfully tested with:
    • nvidia/llama-embed-nemotron-8b
    • nvidia/llama-nemotron-embed-1b-v2
  • Embedding outputs have the expected shape and value ranges compared to reference implementations

GGUF Models

sabafallah/llama-nemotron-embed-1b-v2-GGUF
sabafallah/llama-embed-nemotron-8b-GGUF

Tests

llama-nemotron-embed-1b-v2

curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
  "model": "llama-nemotron-embed-1b-v2-Q4_K_M.gguf",
  "input": [
    "query: How do neural networks learn patterns from examples?",
    "passage: Deep learning models adjust their weights through backpropagation, using gradient descent to minimize error on training data and improve predictions over time.",
    "passage: Market prices are determined by the relationship between how much people want to buy a product and how much is available for sale, with scarcity driving prices up and abundance driving them down."
  ]
}'

llama-nemotron-embed-1b-v2

curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
  "model": "llama-embed-nemotron-8b-Q4_K_M.gguf",
  "input": [
    "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?",
    "Deep learning models adjust their weights through backpropagation, using gradient descent to minimize error on training data and improve predictions over time.",
    "Market prices are determined by the relationship between how much people want to buy a product and how much is available for sale, with scarcity driving prices up and abundance driving them down."
  ]
}'

@github-actions github-actions bot added model Model specific python python script changes labels Dec 20, 2025
@sfallah sfallah marked this pull request as ready for review December 22, 2025 12:46
@sfallah sfallah changed the title model: support nvidia/llama-embed-nemotron model: support for LlamaBidirectionalModel architecture Dec 22, 2025
@sfallah sfallah requested a review from CISC December 22, 2025 17:38
@CISC
Copy link
Collaborator

CISC commented Dec 22, 2025

Instead of duplicating the model code, perhaps just template it as for non/swa models, just with/out cache and output.

raise ValueError(f"Unprocessed experts: {experts}")


@ModelBase.register("LlamaBidirectionalModel")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested if adding LlamaBidirectionalForSequenceClassification too just works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested it, but I have had a look at nvidia/llama-nemotron-rerank-1b-v2.
I think the classifier tensor (score.weight) need to be added for it to work.

@CISC CISC merged commit 54132f1 into ggml-org:master Dec 24, 2025
74 of 75 checks passed
@CISC CISC linked an issue Dec 24, 2025 that may be closed by this pull request
4 tasks
ppaleja pushed a commit to ppaleja/llama.cpp that referenced this pull request Dec 26, 2025
)

* model: llama-embed-nemotron

* minor: python lint

* changed arch-name

* templated llm_build_llama to be used for both llama and llama-embed arch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support for LlamaBidirectionalModel architecture

2 participants