Fix DenseRetrievalExactSearch evaluation#154
Fix DenseRetrievalExactSearch evaluation#154NouamaneTazi wants to merge 2 commits intobeir-cellar:mainfrom
DenseRetrievalExactSearch evaluation#154Conversation
Muennighoff
left a comment
There was a problem hiding this comment.
I'm not fully understanding yet, maybe you can help me out 😅🧐
| corpus_ids = sorted(corpus, key=lambda k: len(corpus[k].get("title", "") + corpus[k].get("text", "")), reverse=True) | ||
| if ignore_identical_ids: | ||
| # We remove the query from results if it exists in corpus | ||
| corpus_ids = [cid for cid in corpus_ids if cid not in query_ids] |
There was a problem hiding this comment.
Doesn't this make the task "easier" by removing all other queries as options for each query?
I.e. previously, given query1 the model could wrongly retrieve query2 (if it was also in the corpus).
Now the model cannot retrieve any of the other queries which makes it easier assuming the answer is never another query.
There was a problem hiding this comment.
I think thus option was for Quora: You want to find paraphrases of queries, but not the original start query. But this original query will always be ranked first at it is also part of the corpus
There was a problem hiding this comment.
Which is why we have the ignore_identical_ids option I think. This PR only tries to fix ignore_identical_ids=True case
| cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k+1, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted) | ||
| cos_scores_top_k_values, cos_scores_top_k_idx = torch.topk(cos_scores, min(top_k, len(cos_scores[1])), dim=1, largest=True, sorted=return_sorted) |
There was a problem hiding this comment.
You write that Which means some queries would have top_k retrieved documents, while others have top_k-1 retrieved documents., but didn't this +1 ensure that that does not happen cuz we retrieve top_k+1 but then only allow top_k lateron?
There was a problem hiding this comment.
IIUC, the problem comes from this line
So we only keep the top_k (which sometimes include the query inside the retrieved docs)
There was a problem hiding this comment.
I see, I thought the if corpus_id != query_id: would ensure that the query would never be added to result_heaps[query_id] 🧐
There was a problem hiding this comment.
Hmm, then why do we get different results? 🧐
There was a problem hiding this comment.
It's easy to check, we just have to assert that number of results of each query is top_k. Can you check that please @Muennighoff ?
I noticed there was a problem in the way we handled queries that exist in the retrieval corpus. By default we have
ignore_identical_ids=Truewhich pops these duplicated queries from theresults. Which means some queries would havetop_kretrieved documents, while others havetop_k-1retrieved documents.Fixing this behaviour gives a noticeable change in scores. Here's the difference in scores noticed for
"intfloat/e5-large"on ArguAna evaluated using MTEB:Scores before fix:
Scores after fix:
cc @thakur-nandan