model: support for LlamaBidirectionalModel architecture #18220

sfallah · 2025-12-20T11:21:01Z

This PR adds support for LlamaBidirectionalModel architectures, resolving feature request #17478.

The implementation enables bidirectional LLaMA embedding models—such as:

nvidia/llama-embed-nemotron-8b
nvidia/llama-nemotron-embed-1b-v2

while keeping changes minimal by reusing/copying the existing implementation of LLAMA wherever possible.

Key Points

Support for LlamaBidirectionalModel
Minimal changes: the implementation largely copies the existing LLaMA architecture already present
this copies and adapts the existiting model implementation in src/models/llama.cpp to handle bidirectional attention required for embeddings

Validation

Successfully tested with:
- nvidia/llama-embed-nemotron-8b
- nvidia/llama-nemotron-embed-1b-v2
Embedding outputs have the expected shape and value ranges compared to reference implementations

GGUF Models

sabafallah/llama-nemotron-embed-1b-v2-GGUF
sabafallah/llama-embed-nemotron-8b-GGUF

Tests

llama-nemotron-embed-1b-v2

curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
  "model": "llama-nemotron-embed-1b-v2-Q4_K_M.gguf",
  "input": [
    "query: How do neural networks learn patterns from examples?",
    "passage: Deep learning models adjust their weights through backpropagation, using gradient descent to minimize error on training data and improve predictions over time.",
    "passage: Market prices are determined by the relationship between how much people want to buy a product and how much is available for sale, with scarcity driving prices up and abundance driving them down."
  ]
}'

llama-nemotron-embed-1b-v2

curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
  "model": "llama-embed-nemotron-8b-Q4_K_M.gguf",
  "input": [
    "Instruct: Given a question, retrieve passages that answer the question\nQuery: How do neural networks learn patterns from examples?",
    "Deep learning models adjust their weights through backpropagation, using gradient descent to minimize error on training data and improve predictions over time.",
    "Market prices are determined by the relationship between how much people want to buy a product and how much is available for sale, with scarcity driving prices up and abundance driving them down."
  ]
}'

src/llama-arch.cpp

CISC · 2025-12-22T19:53:08Z

Instead of duplicating the model code, perhaps just template it as for non/swa models, just with/out cache and output.

CISC · 2025-12-23T13:13:49Z

convert_hf_to_gguf.py

                raise ValueError(f"Unprocessed experts: {experts}")


+@ModelBase.register("LlamaBidirectionalModel")


Have you tested if adding LlamaBidirectionalForSequenceClassification too just works?

I have not tested it, but I have had a look at nvidia/llama-nemotron-rerank-1b-v2.
I think the classifier tensor (score.weight) need to be added for it to work.

) * model: llama-embed-nemotron * minor: python lint * changed arch-name * templated llm_build_llama to be used for both llama and llama-embed arch

model: llama-embed-nemotron

c2ff22b

github-actions bot added model Model specific python python script changes labels Dec 20, 2025

minor: python lint

97811cf

sfallah marked this pull request as ready for review December 22, 2025 12:46

sfallah requested review from CISC and ggerganov as code owners December 22, 2025 12:46

sfallah changed the title ~~model: support nvidia/llama-embed-nemotron~~ model: support for LlamaBidirectionalModel architecture Dec 22, 2025

danbev mentioned this pull request Dec 22, 2025

model-conversion : add trust_remote_code for embedding scripts #18288

Merged

loci-dev mentioned this pull request Dec 22, 2025

UPSTREAM PR #18288: model-conversion : add trust_remote_code for embedding scripts auroralabs-loci/llama.cpp#663

Open

CISC reviewed Dec 22, 2025

View reviewed changes

src/llama-arch.cpp Outdated Show resolved Hide resolved

changed arch-name

1089d06

sfallah requested a review from CISC December 22, 2025 17:38

templated llm_build_llama to be used for both llama and llama-embed arch

7f745a9

CISC approved these changes Dec 23, 2025

View reviewed changes

CISC merged commit 54132f1 into ggml-org:master Dec 24, 2025
74 of 75 checks passed

CISC linked an issue Dec 24, 2025 that may be closed by this pull request

Feature Request: Support for LlamaBidirectionalModel architecture #17478

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model: support for LlamaBidirectionalModel architecture #18220

model: support for LlamaBidirectionalModel architecture #18220

sfallah commented Dec 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

CISC commented Dec 22, 2025

Uh oh!

CISC Dec 23, 2025

Uh oh!

sfallah Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		raise ValueError(f"Unprocessed experts: {experts}")


		@ModelBase.register("LlamaBidirectionalModel")

model: support for LlamaBidirectionalModel architecture #18220

model: support for LlamaBidirectionalModel architecture #18220

Conversation

sfallah commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Points

Validation

GGUF Models

Tests

llama-nemotron-embed-1b-v2

llama-nemotron-embed-1b-v2

Uh oh!

Uh oh!

CISC commented Dec 22, 2025

Uh oh!

CISC Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

sfallah Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sfallah commented Dec 20, 2025 •

edited

Loading