Releases: embeddings-benchmark/mteb
2.12.26
2.12.26 (2026-04-21)
Fix
-
fix: HF benchmark result (#4344)
-
init benchmark eval results
-
add get score to benchmark
-
update scoring
-
add method for benchmark card creation
-
fix typing (
990c1cf)
Unknown
-
[MVEB] Add fps implementation to Video Sampling (#4441)
-
fix: Reclassify SIBFLEURS as AudioClassification instead of AudioMultilabelClassification
-
fix: Move SIBFLEURS descriptive stats to AudioClassification
-
refactor: add FPS-based video frame sampling to collator
FramesCollator and VideoCollator now support two modes:
- FPS-based (default, fps=2.0): frame count scales with video duration,
with max_frames=128 as a safety cap for long videos - Fixed-sample (num_frames=N): always selects exactly N frames uniformly,
preserving the previous behavior for models that need it
Existing callers (PE-AV, random baseline) switched to num_frames to
preserve their current fixed-sample behavior.
- refactor: switch PE-AV and random baseline to FPS-based frame sampling
Both models now use the default FPS-based mode (fps=2.0, max_frames=256)
instead of fixed num_frames. This gives duration-proportional frame
coverage across videos of different lengths.
- fix: address PR review - defaults to None, use end_stream_seconds
- Set fps and max_frames defaults to None so models can skip collator
resampling and let their own processors handle frame selection - Use video.metadata.end_stream_seconds for duration instead of
computing num_frames / average_fps (handles VFR videos correctly) - When both fps and num_frames are None, return all frames as-is
instead of raising an error - PE-AV and random baseline explicitly set fps=2.0 to avoid decoding
all frames unnecessarily
- refactor: expose collator params in PE-AV init
Allow fps, max_frames, num_frames, and max_samples to be configured
via the PE-AV wrapper constructor instead of being hardcoded.
Defaults to fps=2.0 matching the standard video understanding rate.
-
fix: address PR review - rename max_frames to max_fps_frames, raise on conflicting args
-
fix: rename max_fps_frames back to max_frames, clarify docstrings
-
fix: pass fps=None, num_frames=16 to 16-frame PE-AV variants
The *-16-frame checkpoints were trained with fixed 16-frame uniform sampling
(processor config has do_sample_frames=true, num_frames=16). Without
explicit loader_kwargs, the collator used the default fps=2.0, producing
~40 frames on typical clips that the processor then re-sampled down to 16 —
a distribution shift from training. Setting num_frames=16 makes the
collator do the sampling directly, and the processor's built-in sample
becomes an identity no-op.
-
fix: clarify fps docstrings - downsamples only, no upsampling (
9363ea7) -
Don't display license links in the documentation (#4465)
-
leaderboard: add MTEB(spa, v1) to Language-specific section (#4217)
Add MTEB(spa, v1) to leaderboard language-specific menu
Co-authored-by: Clemente <clemente@Clementes-MacBook-Pro.local> (e5521a6)
-
Add VALOR-32K retrieval tasks (#4453)
-
Add VALOR-32K retrieval tasks (v2t, t2v, va2t, t2va)
Adds four bidirectional multimodal retrieval tasks for the VALOR-32K
dataset (mteb/VALOR-32K), a vision-audio-language benchmark with 3,491
test samples.
Made-with: Cursor
- fix: correct BibTeX field order for VALOR-32K citation
Made-with: Cursor (792f61f)
2.12.25
2.12.25 (2026-04-20)
Fix
-
fix: drop unused modality columns in dataloader for cross-modal tasks (#4440)
-
fix: handle None text/image in multimodal retrieval tasks
Cross-modal retrieval tasks (CIRRIT2IRetrieval, NIGHTSI2IRetrieval,
Fashion200kI2TRetrieval, VisualNewsI2TRetrieval) have corpus/query
items where text or image can be None for single-modality entries.
- _corpus_to_dict: handle None text/title gracefully
- _custom_collate_fn: allow None values in batches instead of raising
- _combine_queries_with_instruction_text: skip string ops on None text
- random_baseline: skip None items when encoding each modality
Closes #4436
- fix: handle None query text in dataloader instead of downstream models
Normalize None text to "" in _combine_queries_with_instruction_text,
matching the existing pattern in _corpus_to_dict. Revert random_baseline
and collation changes as they're no longer needed.
- fix: drop unused modality columns in dataloader to prevent None errors
Cross-modal retrieval tasks have None values for modalities not used by
that side of the retrieval (e.g. text=None in image-only corpus for it2i
tasks). Instead of adding None-guards throughout the collate function and
models, drop columns for modalities not needed for the current prompt
type in _prepare_dataset. The task category (e.g. it2i) already encodes
which modalities each side needs.
Closes #4436
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> (eaf2c9d)
2.12.24
2.12.24 (2026-04-20)
Fix
-
fix: remove columns with none (#4446)
-
remove columns with none
-
Apply suggestions from code review
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me>
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Kenneth Enevoldsen <kenevoldsen@pm.me> (63c0af6)
Unknown
- [Mveb] vggsound classification (#4444)
[MVEB] Add VGGSound audio-visual classification tasks
Add VGGSoundVAClassification (video+audio, va2c) and VGGSoundVClassification (video-only, v2c) for the VGGSound audio-visual dataset (Chen et al., ICASSP 2020).
Dataset contains 9,888 test clips across 308 sound classes from YouTube videos. Audio is the primary signal in the original task; the v2c variant serves as a
video-only baseline. Uses 5-fold cross-validation since the released split only contains test. Follows the standard MVEB classification task structure. Addresses
part of #4130 (MVEB Overview - Classification).
Co-authored-by: Yashwanth Devavarapu <yashwanthdevavarapu@Yashwanths-MacBook-Pro.local> (bf113b4)
- add mteb/Shot2Story20K dataset (#4449)
add mteb/Shot2Story20K_test dataset (b597f37)
-
Add YouCook2_val retrieval tasks (#4432)
-
Add YouCook2_val retrieval tasks (V2T, T2V, A2T, T2A)
Made-with: Cursor
- Update mteb/tasks/retrieval/eng/youcook2_retrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
add stats
-
update
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (f0e5ab6)
-
Add VATEX retrieval tasks (#4433)
-
Add VATEX_test_1k retrieval tasks (V2T, T2V, A2T, T2A)
Made-with: Cursor
- Update mteb/tasks/retrieval/eng/vatex_retrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
- Update mteb/tasks/retrieval/eng/vatex_retrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
- update
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (f93dc75)
-
model: add BidirLM/BidirLM-Omni-2.5B-Embedding (#4370)
-
feat: add BidirLM/BidirLM-Omni-2.5B-Embedding model implementation
-
fix: address reviewer comments on BidirLM-Omni-2.5B-Embedding:
- Load model via SentenceTransformer with trust_remote_code=True
- Remove the _get_instruction() description fallback. Add explicit
prompts for 22 MIEB/MAEB tasks
-
refactor: update to sentence transformers 5.4 and rely on encode function to get embedding
-
Apply suggestions from code review
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
Feat: instruct with prompt args chat template
-
Fix: rely on EncodeKwargs for encoder function
-
Feat: improve readability
-
Refactor: Change how modality are passed to encode
-
Fix: lint error
-
Refactor: args encode
-
Refactor: Import from Bidir
-
comments update
-
Simplify get instruction (ne need for _lookup_prompt stripped)
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (9e5c592)
2.12.23
2.12.23 (2026-04-19)
Fix
- fix: query filtering for audio (#4430)
update query filtering (65025bb)
Unknown
-
[MVEB] Add SomethingSomethingV2 video classification task (#4434)
-
[MVEB] Add SomethingSomethingV2 video classification task
-
fix: correct bibtex authors for SomethingSomethingV2
- Fix Peter Yiber -> Peter Yianilos
- Fix Florian Bax -> Ingo Bax
- Remove non-authors Manuel Gallo and Ahmed Mehri
- Add missing authors: Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau
- Fix Materzynska -> Materzy{'n}ska (proper diacritics)
Co-authored-by: zach <zacharie@example.com> (f5775fc)
2.12.22
2.12.22 (2026-04-18)
Fix
- fix: KeyError on aggregated tasks with eval_langs (#4439)
Fix KeyError on aggregated tasks with dict eval_langs
When aggregated tasks (e.g. VisualSTS17Multilingual) have eval_langs
as a dict, hf_subsets_to_langscripts lacks a "default" key. The
aggregated score uses "default" as subset, causing a KeyError in
TaskResult.from_task_results. Fall back to collecting all languages
from the mapping when the subset key is missing.
Closes #4437
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> (5fc2867)
Unknown
-
Fix: apply skip_first_result when computing hit_rate metric (#4427)
-
Fix skip_first_result not applied to hit_rate metric
-
lint
Co-authored-by: Rakshitha Ireddi <rakshithaireddi@Rakshithas-MacBook-Pro.local>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (8570b74)
-
[MVEB] Adding MUSIC-AVQA Task (Clustering) (#4426)
-
[MVEB] Adding MUSIC-AVQA Task (Clustering)
-
simplify description (
04f2f4b) -
[MVEB] Add Breakfast video classification task (#4431)
-
[MVEB] Add Breakfast video classification task\n\nAdd BreakfastClassification task for the Breakfast Actions dataset (Kuehne et al., CVPR 2014). The dataset contains 433 videos of 10 breakfast-related activities recorded in 18 kitchens. Uses 5-fold cross-validation since the dataset only has a test split.\n\nRandom baseline accuracy: 0.1247 (near-random for 10 classes).\n\nAddresses part of #4130 (MVEB Overview - Classification).
-
lint
Co-authored-by: Yashwanth Devavarapu <yashwanthdevavarapu@Yashwanths-MacBook-Pro.local>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (98d682c)
-
Add ActivityNet_Captions_val2 video retrieval tasks (V2T and T2V) (#4429)
-
Add ActivityNet_Captions_val2 video retrieval tasks (V2T and T2V)
Made-with: Cursor
- Fix all sort order for isort-style linting (RUF033)
Made-with: Cursor
- Update mteb/tasks/retrieval/eng/activitynet_captions_t2v_retrieval.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
- Consolidate ActivityNet Captions retrieval tasks into a single file
Made-with: Cursor
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (7539339)
-
add mteb/DiDeMo dataset (#4425)
-
add mteb/MSVD dataset
-
add mteb/DiDeMo dataset
-
add comb
-
update
-
Consolidate DiDeMo retrieval tasks into a single file
Merge 4 separate DiDeMo task files into didemo_retrieval.py with a
shared _load_didemo helper, reducing duplication while preserving
all task names and metadata.
Made-with: Cursor
- Remove unused Dataset import from didemo_retrieval
Made-with: Cursor (2e00e5b)
- add mteb/TUNA-Bench_1K dataset (#4428)
Add TUNA-Bench_1K video retrieval tasks (V2T and T2V)
Made-with: Cursor (b14dcc5)
- Update vllm_wrapper.py (#4418)
fix compatibility with newer vllm versions (5ef64c3)
2.12.21
2.12.21 (2026-04-18)
Ci
- ci: add workflow to auto-update leaderboard model list (#4402)
Adds a standalone script that generates the model list from scratch
and a CI workflow that pushes it to the HF leaderboard space weekly,
on model file changes, or via manual dispatch.
Fix
-
fix: Add required_dependencies to model meta (#4356)
-
add required_dependencies to model meta
-
add extra group name
-
add to model to python
-
update handling dependencies
-
fix deps
-
fix test
-
remove usage of requires_package
-
remove image/audio dependencies
-
fixes after merge
-
add deprecated function
-
fix test
-
skip check for baseline
-
fix test
-
update lock
-
optionally check torchaudio in test (
e2e7174)
Unknown
- Remove video folder (#4424)
remove video folder (011bbf5)
update dataset card (43d1b21)
-
tests: Add test to ensure coverage of reference models (#4216)
-
Reference models tests
-
Reference models tests
-
Reference models tests
-
fix: address PR review comments for reference model tests
- Use cache.load_results() instead of manually walking cache directories
- Dynamically compute target benchmarks from all leaderboard benchmarks
minus an exclusion list, so new benchmarks are automatically tested - Add text-only modality check for task-model compatibility
- Filter retrieval-only models by task type AND text modalities
- fix: use isinstance check for retrieval subtypes
Check isinstance(task, AbsTaskRetrieval) instead of string comparison
with task.metadata.type, so reranking and instruction retrieval tasks
are correctly included for retrieval-only models like bm25s.
- fix: handle empty sim_scores in confidence_scores
Return zero confidence scores when sim_scores list is empty,
which can happen when BM25 returns no results for a query
in reranking tasks.
- fix: address PR review comments for reference model tests
- Remove RTEB variant exclusions to test all RTEB benchmarks
(per Kenneth's feedback to include the full RTEB set)
- fix: use benchmark_selector.py as source of truth for leaderboard benchmarks
Address Kenneth's review comments:
- Use GP_BENCHMARK_ENTRIES + R_BENCHMARK_ENTRIES from benchmark_selector.py
instead of display_on_leaderboard flag (which includes benchmarks not
actually shown on the leaderboard) - Clean up EXCLUDED_BENCHMARKS to only contain actual leaderboard benchmarks
(multimodal ones that text-only reference models can't run) - Remove RTEB variant exclusions to test the full RTEB set
- fix: remove all benchmark exclusions, rely on task-level filtering
Task-level filtering (_is_text_only_task, RETRIEVAL_ONLY_MODELS) already
handles model-task compatibility. No need to exclude entire benchmarks —
non-text tasks within multimodal benchmarks are skipped automatically.
- fix: use display_on_leaderboard flag now that PR #4288 is merged
Simplify _get_target_benchmarks to use display_on_leaderboard=True,
which now correctly reflects the actual leaderboard (fixed in #4288).
Remove benchmark_selector imports and exclusion list — task-level
filtering handles model-task compatibility.
- fix: pass Benchmark objects directly instead of names
Address Samoed's review: use Benchmark objects in parametrize
instead of looking up by name twice.
-
speedup test
-
fix issue with aggregate
-
fix: address review - reuse _check_model_modalities, trim workflow triggers
-
fix: restore TARGET_BENCHMARKS definition, remove stale _get_target_benchmarks call
-
fix: inline modality check to avoid private import, filter image-only tasks
-
fix: use strict modality subset check to exclude image/multimodal tasks
-
fix: restore RETRIEVAL_ONLY_MODELS for BM25 task filtering
-
fix: add mteb/benchmarks/** to workflow triggers
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (4a90b28)
-
[MVEB] Adding WorldSense1Min Task (Clustering) (#4393)
-
[MVEB] Adding WorldSense1Min Task (Clustering)
-
remove local test
-
Update mteb/tasks/init.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
removing stats
-
moving video clustering tasks to clustering
-
uncomment Video task
-
add results
-
update license
-
remove results
Co-authored-by: wissam-KH <wissam.siblini@komodohealth.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (f46cb7b)
-
[MVEB] Adding AVE-Dataset Task (Clustering) (#4416)
-
[MVEB] Adding AVE-Dataset Task (Clustering)
-
uncomment video clustering task
-
remove results (
61e7f3f) -
tests: add regression test for double loading (#4407)
add regression test (e946e1e)
-
add HMDB51 dataset (#4398)
-
add HMDB51 dataset
-
update
-
Update mteb/abstasks/task_metadata.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
- Update mteb/tasks/classification/eng/hmdb51_classification.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
- fix lint
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (22bc680)
2.12.20
2.12.20 (2026-04-16)
Fix
-
fix: handle Transformers v5 BaseModelOutputWithPooling return types i… (#4328)
-
fix: handle Transformers v5 BaseModelOutputWithPooling return types
Transformers v5 changed get_text_features, get_image_features, and
get_audio_features to return BaseModelOutputWithPooling instead of
plain tensors. This caused AttributeError when tensor operations
like .norm() were applied directly to the output.
Added isinstance(output, BaseModelOutputWithPooling) checks to
extract pooler_output when needed, maintaining backward compatibility
with Transformers v4 tensor returns.
Affected model wrappers:
- clap_models.py: text path (audio path already handled)
- align_models.py: text and image paths
- wav2clip_model.py: text path (CLIP encoder)
- llm2clip_models.py: text and image paths
- siglip_models.py: text and image paths (previously accessed
.pooler_output directly without fallback)
Closes #4081
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- lint
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (367d554)
- fix: double retrieval dataset loading (#4399)
fix retrieval dataset loading (316fca3)
2.12.19
2.12.19 (2026-04-16)
Documentation
-
docs: Update adding dataset checklist (#4394)
-
docs: Update adding dataset checklist
fix the checklist to make it less text-specific
- docs: Add score reproduction to PR reqirenments (#4396)
add score reproduction to description
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> (dc58c76)
Fix
-
fix: Auto add base model to
ModelMeta(#4395) -
fetch source model from hub
-
fix tests
-
check if model has model card attr (
ed1833a)
Unknown
-
model: Add Google Gemini embedding 2 (#4247)
-
Adding Google Gemini embedding 2 model
-
feat: add per-task prompt mapping and multimodal support for Gemini Embedding 2
- Create GEMINI_EMBEDDING_2_PROMPTS dict with 132 per-task Google API
task type mappings from issue #4260 - Add GoogleGeminiEmbeddingModel class using google-genai SDK with
support for text, image, and interleaved text+image inputs - Update ModelMeta to use new class, set modalities=["image", "text"]
- Add google_genai optional dependency to pyproject.toml
- feat: add audio modality support for Gemini Embedding 2
- Add _audio_to_wav_bytes helper to convert numpy audio arrays to WAV
- Handle audio inputs in encode() via Part.from_bytes with audio/wav MIME
- Update modalities to ["audio", "image", "text"]
- fix: strip google/ prefix from model name for Gemini API
The google-genai SDK's embed_content doesn't handle the "google/"
prefix format. Strip it in the constructor like Voyage does.
- fix: add exponential backoff retry for 429 rate limits
Retry up to 10 times with exponential backoff (60s, 120s, 240s...
up to 600s) when hitting API quota limits. Essential for large
multilingual benchmarks like MIRACL.
- refactor: address PR review comments
- Replace 132-entry per-task dict with task-type defaults + 62
per-task overrides (KennethEnevoldsen: use metadata) - Add embed_dim parameter to GoogleGeminiEmbeddingModel (Samoed)
- Add title formatting for retrieval corpus docs (Samoed)
- Add batch size comment referencing API limits (Samoed)
- Simplify encode() control flow
-
fix: replace print with logger.warning for lint compliance
-
fix: handle audio+text interleaved input and note MRL embed_dim support
- Add audio+text branch in encode() for interleaved content
- Note embed_dim supports [768, 1536, 3072] once PR #4170 is merged
-
fix: use MRL embed_dim list and remove duplicate logger
-
fix unused param
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com> (633e41c)
-
Move Kinetics400 out of video and add zeroshot version (#4383)
-
distinguish between AV and V tasks
-
move out of video folder and add zeroshot version
-
fix task type
-
update task metadata based on discussions
-
fix mveb task type mapping
-
fix: Add
is_betato task metadata (#4392) -
fix: Add
is_betato task metadata
- Added *is_beta to the task metadata
- Added a warning on initializing a dataset when is_beta is True
- Added exclude_beta to get_tasks and filter_tasks, for now I set it to False
todo:
- add tests
-
add test and updates metadata
-
format
-
re-enable tests for beta datasets
-
format
-
feat: comment out MVEB task types without existing tasks
VideoClustering, VideoPairClassification, and VideoCentricQA are defined
in task_metadata but have no corresponding task implementations yet,
causing create_available_tasks.py to fail. Comment them out until tasks
are added. Also regenerate available_tasks docs and add qwen_omni_utils
optional dependency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fix: allow beta-only task types in create_available_tasks.py
Change assertion to <= so task types that only have beta tasks don't
break the docs generation. Use .get() with continue to skip task types
with no non-beta tasks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- fix: skip video tasks missing descriptive_stats in metadata test
Skip Kinetics400 video tasks in test_all_metadata_is_filled_and_valid
until descriptive stats are added. Regenerate available_tasks docs.
-
revert: restore docs/overview/available_tasks to main
-
revert: remove all generated available_tasks changes from branch
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> (6ec3f40)
-
Add nicher92/saga-embed_v1 to MTEB models (#4371)
-
Add nicher92/saga-embed_v1 to MTEB models
-
Update training_datasets in ModelMeta
-
fix: fixed naming
-
Replace custom SagaModel class with standard SentenceTransformerEncoderWrapper and model_prompts dict
-
chore: remove lingering comment
-
Update mteb/models/model_implementations/saga_models.py
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
-
update meta
-
change parameters and memory usage
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> (d7c521c)
2.12.18
2.12.17
2.12.17 (2026-04-16)
Fix
- fix: Corrected incorrect model rename (#4391)
This gave the following incorrect warning:
DeprecationWarning: The model 'mteb/baseline-random-encoder' has been renamed to 'mteb/baseline-random-encoder'. To prevent this warning use the new name.
model = mteb.get_model_meta("mteb/baseline-random-encoder") ([`b65730d`](https://github.com/embeddings-benchmark/mteb/commit/b65730d833b3e321be759b243f62090066faad45))
## Unknown
* model: add BidirLM text embedding family (270M, 0.6B, 1B, 1.7B) (#4374)
* model: add BidirLM text embedding family (270M, 0.6B, 1B, 1.7B)
* Apply suggestions from code review
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com>
* run lint
---------
Co-authored-by: Isaac Chung <chungisaac1217@gmail.com>
Co-authored-by: Roman Solomatin <samoed.roman@gmail.com> ([`e8a4069`](https://github.com/embeddings-benchmark/mteb/commit/e8a40693b00005cb0362bd6c5798d61287a196f3))
* [MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset (#4199)
* fix: Reclassify SIBFLEURS as AudioClassification instead of AudioMultilabelClassification
* fix: Move SIBFLEURS descriptive stats to AudioClassification
* Adding video modality
* Add Kinetics-400 dataset
* Add pe_av model
* fix typo
* fix collator bug
* Edit selecting column in classification abstask
* Properly handle frames in PE_AV
* add self kwarg to method
* Add audio collator
* fix type error
* fix audio_video embeds object handling
* Add Ravdess_av clustering
* fix task metadata
* start video integration
* start video integration
* upd task structure
* upd video input type
* combine video and audio to dict
* fix task side
* fix pe_av model
* lower writer batch size
* fix col labels
* lint
* add pe_av model metadata
* fix datasets metadata
* remove accidently commited files
* remove nested list structure from datasets
* edit collator to handle one video item
* multimodal collator + fix comments
* lint
* metadata update
* using forward pass to get embeds
* replace forward pass + add audio to msrvtt
* fix category metadata
* edit get embeddings
* add n_embedding_parameters
* change input col name to list
* lint + type check
* add classvar
* add str to classvar
* Change list to sequence
* lint + type check error
* edit dataloader and msrvtt handling of input column
* move seqeuence out of type checking
* fix random baseline
* add collator to random baseline
* restore previous dict structure + make audio optional
* clean structure
* lint
* safety check
* decrease writer batch size
* match msrvtt format
* type check fix
* refactor: keep video and audio as separate dataset columns
* fix: handle single-string input_column correctly in _prepare_dataset
* review fixes
* lint
* type hins fix
* address review: simplify input_column_name, remove VideoInputItem, fix collator output
- Revert input_column_name from Mapping[str, str] to str | Sequence[str]
- Remove VideoInputItem wrapper, pass frames tensor directly
- Make VideoCollator return BatchedInput (consistent with AudioCollator)
- MultimodalCollator uses static methods instead of chaining collators
* fix: update clustering_evaluator to use Sequence instead of Mapping
* fix: handle Sequence input_column_name in second create_dataloader call
* fix: skip statistics and text cleaning for multi-column video tasks
* fix: pass explicit None for TypedDict fields in multi-column statistics
* address Kenneth review: rename collators, update docs, simplify annotations
- Rename VideoCollator -> FramesCollator, MultimodalCollator -> VideoCollator
- Update VideoInput docstring to clarify frames-only, audio in AudioInput
- Update input_column_name docs in classification/clustering base classes
- Use ClassVar[Sequence[str]] for video task input_column_name
- Extract isinstance check to top of zeroshot evaluator __call__
- Improve task_pipelines.py skip comment for multi-column tasks
- Add TODO for MSR-VTT dataset reupload
* docs: link to encoder I/O types for default column names in input_column_name
* fix: raise NotImplementedError for multi-column task cleaning
* refactor: use tuples for input_column_name to avoid ClassVar
* refactor: move Sequence handling into create_dataloader, simplify callers
---------
Co-authored-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com> ([`5d3c845`](https://github.com/embeddings-benchmark/mteb/commit/5d3c8453db615a1a4ae4d53033711af21c1d502e))
* dataset: add BrowseComp-Plus (#4226)
* dataset: Add BrowseComp-Plus
* fix linting errors
* fixing bibtext formatting
* Split BrowseCompPlusRetrieval into gold_only and gold_and_evidence subsets
* fix: remove qa as a valid tag for metadata files
* simplify data loading by reuploading the data
---------
Co-authored-by: Kenneth <kennethenevoldsen@gmail.com> ([`e722b76`](https://github.com/embeddings-benchmark/mteb/commit/e722b7640ed1abee68c3df5023a186b36a15325f))