llama: Attempt to add ModernBert#14014
Conversation
|
|
|
The embedding result seems random and very low. There is something wrong with this |
CISC
left a comment
There was a problem hiding this comment.
Delete the files you added in models, we don't need them, just make sure test-tokenizer-0 succeeds with the GGUF.
src/llama-model.cpp
Outdated
| inpL = build_norm(inpL, model.tok_norm, nullptr, LLM_NORM, -1); | ||
| cb(inpL, "inp_norm", -1); | ||
|
|
||
| auto * inp_attn = build_attn_inp_kv_unified_iswa(); |
There was a problem hiding this comment.
This should probably become:
| auto * inp_attn = build_attn_inp_kv_unified_iswa(); | |
| auto * inp_attn = build_attn_inp_no_cache_iswa(); |
And add the corresponding mask logic in llama-graph. Special attention should be taken about how the SWA works for this model - i.e. is it symmetric or not:
# non-symmetric
token i attends to [i - n_swa, i]
# symmetric:
token i attends to [i - n_swa/2, i + n_swa/2]There was a problem hiding this comment.
@huydt-bti Hey, is the issue that you forgot to make this function so that swa is actually never applied?
There was a problem hiding this comment.
@CISC No, I made it here:
Lines 349 to 352 in 454d7b7
I use build_attn_inp_no_cache()
#14014 (review)
There was a problem hiding this comment.
Yes, but there's no kq_mask_swa, so is this even executed?
There was a problem hiding this comment.
@CISC I have just implemented it. Please check again
Yep, I also noticed the same with |
@CISC After fixing that, the result is much better now :) but it is still lower than my expectations about ModernBert. Maybe there is problem somewhere else... |
CISC
left a comment
There was a problem hiding this comment.
Everything else LGTM, so pending finding the output issue.
src/llama-kv-cache-unified.cpp
Outdated
| } break; | ||
| case LLAMA_SWA_TYPE_SYMMETRIC: | ||
| { | ||
| if ( p1 - p0 <= (int32_t) n_swa / 2 || p0 - p1 >= (int32_t) n_swa / 2) { |
There was a problem hiding this comment.
@CISC I see part of the problem! I'm masking the token inside the window, which should be outside
There was a problem hiding this comment.
No, the function isn't used because it belongs to llama_kv_cache_unified
src/llama-graph.cpp
Outdated
| } | ||
|
|
||
| // Handle symmetric SWA mask separately if it exists | ||
| if (kq_mask_swa) { |
There was a problem hiding this comment.
No, this is unnecessary duplication, it should be handled like this once you add llm_graph_input/build_attn_inp_no_cache_iswa:
Lines 356 to 370 in d7da8dc
There was a problem hiding this comment.
It is not interleaved swa, so I still prefer using build_attn_inp_no_cache. I will try to refactor llm_graph_input_attn_no_cache::set_input, but the mechanisim is different from llm_graph_input_attn_kv_unified and llm_graph_input_attn_kv_unified_iswa, since they use kv_state
There was a problem hiding this comment.
Sure, the point is just that you can handle it similarly.
There was a problem hiding this comment.
I just pushed my code - but it's ugly to copy is_masked_swa and place it inside llm_graph_input_attn_no_cache::set_input. Do you have any suggestions?
There was a problem hiding this comment.
I don't understand why you can't split the methods just like in unified, you can have a is_masked_swa implementation for no_cache and a much cleaner set_input.
There was a problem hiding this comment.
Personally I don't think that would be cleaner, since the new build_attn_inp_no_cache_swa will almost be the same as the current build_attn_inp_no_cache, and we have a new additional build_attn_inp_no_cache. But I will try that
There was a problem hiding this comment.
That doesn't matter, build_attn_inp_* are tiny, the important part is that you can reuse the same set_input_kq_mask.
|
@huydt84 How is progress on this? Would be very nice to bring it to completion soon, unfortunately there have been a lot of internal changes lately, hope resolving conflicts won't be too much of a chore... |
|
@CISC my desparate attempts "to make it work" led to no results but my local branches becoming a mess now :)) I'm sorry but currenty I don't work on this anymore, so it's really helpful if other people can add support for this model, whether using the code from this PR or not (that's the reason why I still open this PR) |
I don't know whether my implementation is correct or not