Gmail-style advanced search for dandiset listing#2814
Conversation
Recognize key:value operators inside the existing /api/dandisets/?search= parameter so users can filter by structured properties without leaving the listing's search box. Same syntax works from the dandi CLI, curl, or any HTTP client. v1 operators (AND-combined; quoted multi-word values supported): - created_before / created_after — filter on Dandiset.created - modified_before / modified_after — filter on most-recent Version.modified - published_before / published_after — filter on most-recent published Version.created (draft-only dandisets excluded) - has_species — substring match on AssetSearch.species - has_approach / has_technique / has_standard — case-insensitive substring match against any element of the asset_metadata JSON array's `.name`, via Postgres jsonb_path_exists with a parameterized like_regex - has_file_type — encodingFormat istartswith, with the existing nwb/image/text/video aliases Free text outside operators continues through the existing full-text path. Unknown operators fall through to free text (Gmail behavior); malformed dates fail closed (return 0 results) so typos don't silently expand the result set. Frontend search box gets an updated placeholder and a help popover listing the available operators with examples. The `searchDandisets` route already forwards the raw `search=` string, so no other UI changes were needed. Includes a `create_search_test_data` management command that seeds a matrix of dandisets covering every operator (idempotent — safe to re-run). Tests: - 13 unit tests for the parser (tokenization, quoted phrases, unknown operators, malformed dates, regex metacharacters in values) - 17 integration tests against /api/dandisets/?search= covering each operator, multi-element JSON array matching, repeated operators (intersection / annotation reuse), free text + operator combinations, embargoed-asset visibility, and fail-closed date parsing
The seeder used to call delete_dandiset(user=owner, ...) but switched to raw ORM .delete() to bypass the published-version guard. The parameter was left behind with a noqa: ARG001 comment, but the project's ruff config doesn't enable ARG001, so the noqa itself failed RUF100 in CI. Just drop the parameter and the comment.
The Playwright test located the search field by accessibility name "Search Dandisets by name,", which was the original full placeholder. The advanced-search PR shortened that placeholder to surface operator hints, breaking the locator. Switch to the shorter substring "Search Dandisets" so the test survives future copy tweaks.
|
@yarikoptic I'd love your thoughts on this. I thought this would be enough for a v1. In the future, I'd like to expand this to other types of metadata, including brain area, but I'd like to take it one step at a time. |
|
If there is interest, I'd be happy to create a full semantics guide in the docs |
yarikoptic
left a comment
There was a problem hiding this comment.
My initial human review, then would hunt AI on it in the next iteration after initial concerns addressed/responded to
| @@ -0,0 +1,247 @@ | |||
| """Seed a matrix of dandisets that exercises every advanced-search operator. | |||
|
|
|||
| Idempotent — re-running deletes any prior dandisets it created (those whose | |||
There was a problem hiding this comment.
this helper command seems to just for local testing, and nowhere used in the unit tests, is that correct?
could/should it be used in the tests or why we need it then, since as soon as feature is "merged" we have access to wide range of data we have in the archive, and unit tests should cover testing: without regular excercising this code regularly would rot and eventually become either not runnable or even could be done by mistake on running main instance and cause "troubles" (imagine test prefix not matching and going through deleting... ) . So, I would be vary of adding this in "final merge"
There was a problem hiding this comment.
this is for generating fake data so you can test out the feature. I'd be fine with deleting or somehow archiving it before merge.
There was a problem hiding this comment.
Removed in e087cdf — the seeder is gone from the PR.
There was a problem hiding this comment.
oh no! we will rely then on your demos ;) that is why I noted to remove it before final merge. ok either way if you keep showing the demos (unlikely I would try directly myself although who knows)
|
|
||
| from dataclasses import dataclass, field | ||
| import re | ||
| import shlex |
There was a problem hiding this comment.
hm, potentially could be cute as to "relate" to good old @djarecka 's
as to specify those queries in actual posix shell CLI ;)
| 'has_species', | ||
| 'has_approach', | ||
| 'has_technique', | ||
| 'has_standard', |
There was a problem hiding this comment.
do you see any benefit from explicit has_ prefix?
There was a problem hiding this comment.
IMO, if you omit the verb the implication is "is", which implies that the options are mutually exclusive. "has" implies (correctly) that a dandiset may be associated with more than one standard/approach/technique/species. I'd be fine with removing the "has_" if you think it's confusing
There was a problem hiding this comment.
Replied above — the verb explicitly distinguishes "is" (mutually exclusive) from "has" (one of many). Happy to drop it if you'd rather, but I'd prefer to keep it.
There was a problem hiding this comment.
good counter-point. I did a little digging. GMail, GitHub, and Google Scholar all have tag-like fields and none of them use the prefix "has" in the way I used it here. I'm removing it.
| if (match := _OPERATOR_RE.match(token)) and match.group(1) in OPERATOR_KEYS: | ||
| parsed.operators.append((match.group(1), match.group(2))) | ||
| else: | ||
| parsed.free_text.append(token) |
There was a problem hiding this comment.
here, again I would prefer to be "explicit better than implicit" as
if there is "some-operator:value" , using some unknown (today!) some-operator would not make it "transparently" into a free_text -- should also return error that such operator is unknown and potentially use libdiff (or whatever thta battery in python) to return candidate command close by - this helps on typos great deal,
❯ git commmit
git: 'commmit' is not a git command. See 'git --help'.
The most similar command is
commit
so envision people mistyping operators...
There was a problem hiding this comment.
Yes, that makes sense. I'm just worried about the case where the user actually wants to include a colon character in the free text. I think that would be pretty rare so I am not so worried about it. Maybe we can protect that ability by allowing the user to enter it in quotes?
There was a problem hiding this comment.
Done in e087cdf. Unknown operators raise SearchSyntaxError with a difflib-based "Did you mean?" suggestion. To preserve the colon-in-free-text use case, wrapping the token in quotes ("foo:bar") is the documented escape — covered by a parser test and an integration test.
There was a problem hiding this comment.
yeap, coolio, could you show screenshot on how it would look channeled back to the user in web ui?
| return {r['identifier'] for r in response.json()['results']} | ||
|
|
||
|
|
||
| @pytest.mark.django_db |
There was a problem hiding this comment.
there is no policy/agreement in dandi-archive on this, but I do prefer explicit annotation of ai-produced tests, with @pytest.mark.ai_generated , see e.g. dandi-cli. This allows later to decide on how close attention to pay to the tests body as -- here I immediately would have spotted lots of duplication and recommended to use parametric... I tried to find where in my CLAUDE*mds I have instructions for that but failed! But I do frequently prompt claude code to establish proper parametric tests where testing various values/boundary conditions etc. I feel like here it could be the case too
There was a problem hiding this comment.
Done in e087cdf. Every advanced-search test now carries @pytest.mark.ai_generated, and the marker is registered in pyproject.toml so --strict-markers accepts it. Also collapsed the parser tests to two parametrized tests (10 happy-path cases, 6 error cases).
There was a problem hiding this comment.
FYI and FTR: in dandi-cli we switched from pyproject.toml markers annotation to https://github.com/dandi/dandi-cli/blob/master/dandi/pytest_plugin.py with additional goodnesses to it... may be we should come up with some dandi-core library to use uniformly for such across repos...
Per @yarikoptic's review of dandi#2814: - Drop the create_search_test_data management command. It was a dev-only seeder, never exercised by the test suite, and posed a risk of bit-rot and accidental destructive use against a real DB. - Tighten the parser's error handling: unbalanced quotes and unknown operator keys now raise SearchSyntaxError instead of silently degrading to free text. Unknown keys get a "Did you mean?" suggestion via difflib.get_close_matches. Malformed dates also raise (was previously fail-closed). The view layer translates SearchSyntaxError into a DRF ValidationError so the user sees a 400 with the error message. - Add an explicit escape hatch: wrapping a token in double quotes opts out of operator parsing, so users can search for a literal colon (e.g. `"foo:bar"`). - Add a 1024-char length cap on the search term as DoS hardening (an unauthenticated endpoint). - Move the function-local `from django.db import connection` to the top-of-file imports. - Mark every advanced-search test with `@pytest.mark.ai_generated` and register the marker in pyproject so `--strict-markers` accepts it. - Restructure the parser tests as two parametrized tests (10 happy-path cases, 6 error cases) instead of 13 individual tests. Test count: 35 (16 parametrized parser cases + 19 integration). All pass; ruff check + format clean.
Round-2 reviewTL;DR: round-1 feedback addressed cleanly in e087cdf — strict parser, AI-marked tests, cleanup. Two small open threads from round-1 ( Round-1 status
New round-2 findings1.
|
Per @yarikoptic's review: GitHub, Jira, Linear, Notion, etc. all use bare operators like `label:bug` for multi-valued filters; Gmail reserves `has:` for boolean checks (`has:attachment`), not for valued ones. The `has_` prefix here was a redundant disambiguator that the broader ecosystem doesn't need. Renames: has_species → species has_approach → approach has_technique → technique has_standard → standard has_file_type → file_type Date operators (`created_before:` etc.) keep their suffix because that suffix conveys comparison direction, not is/has semantics. Also surface backend SearchSyntaxError 400s inline in DandisetsPage.vue instead of letting them bubble up to the global "something went wrong" snackbar.
`species:` was still using the AssetSearch.species denormalized column, which the materialized view defines as `wasAttributedTo[0].species.name` (only the first attributed subject). That diverged from how the other asset operators (approach/technique/standard) now match — they scan the full array via Postgres jsonpath. An asset attributed to multiple species (xenotransplantation, multi-species recordings) would silently miss `species:` matches against any but the first. Switch `species:` to the same `jsonb_path_exists` path-scan against `$.wasAttributedTo[*].species.name`. Drops the special case in `_apply_asset_filter` and unifies the four NAME_PATH operators behind a single jsonpath dispatch. Adds a regression test that asserts a multi-subject asset matches `species:Human` AND `species:mouse`.
- Drop the `\"` claim from the unbalanced-quote message; the parser doesn't support that escape (no_R2#2). - Empty operator value (e.g. `species:" "`) now raises 400 instead of being silently dropped (no_R2#6). - Override OpenAPI help text on the `search` parameter to document the operator vocabulary instead of "Search terms to filter the results" (no_R2#5). - Acknowledge sequential-scan trade-off on jsonpath operators with a comment, and document cross-key AND-on-same-asset semantics in the apply_search_filters body (no_R2#3, dandi#4 — comment-only, indexes deferred until needed). - Pin same-key AND semantics with a new test test_advanced_search_repeated_same_key_operator_combines_with_and. - Nits: drop no-op re.DOTALL on _BARE_OP_RE; fix "incompatible previously generated" → "incompatible with previously generated" comment in views/dandiset.py; trim __all__ in services.search to the public surface (parse_search, SearchSyntaxError); add a clarifying comment to the cross-key intersect test.
Round-2 statusThanks for the thorough re-review. Pushed
Test count: 38 (16 parametrized parser cases + 22 integration tests, all carrying Frontend error rendering — typing
(Will attach a screenshot once the deploy preview rebuilds.) |
Screen.Recording.2026-05-06.at.3.57.03.PM.mov |
|
|
||
| def _parse_date(value: str) -> datetime | None: | ||
| try: | ||
| return datetime.strptime(value, '%Y-%m-%d').replace(tzinfo=UTC) |
There was a problem hiding this comment.
since one time use -- just move this try/except with datetime.strptime in place of using _parse_date, and thus directly there raise after catch instead of doing the dance of returning None to raise out there
There was a problem hiding this comment.
Done in 334938c — strptime + try/except inlined directly, raises SearchSyntaxError from the except. Helper deleted.
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Per @yarikoptic: it's a one-time-use helper that returns None just so the caller can raise. Drop the indirection — strptime + try/except in place, raise SearchSyntaxError directly from the except.
|
we will though need to test drive it while in sandbox well, may be even give to some claude or skill to see how behaves etc going into the limits etc -- better on sandbox than on a real instance. |
On the apparent duplication with
|
| Operator | Pre-existing REST equivalent |
|---|---|
created_before/after, modified_before/after, published_before/after |
none — server-side date filtering didn't exist |
owner:<anything but me> |
none — ?user=me is a ChoiceField(choices=['me']), deliberately limited |
owner:me |
?user=me (full overlap, two spellings) |
species:, approach:, technique:, standard:, file_type: |
partial — discrete params on /api/dandisets/search/ |
Even within the asset-filter overlap, the matching semantics differ: the discrete params do exact match against a controlled vocabulary (MultipleChoiceField with choices populated from AssetSearch.species), while the operators do case-insensitive substring / jsonpath scans. ?species=mouse returns 400 (not in choices); ?search=species:mouse matches "House mouse", "Mus musculus - House mouse", etc. Different tools for different jobs.
Two URL shapes is a normal design pattern
Major search platforms commonly expose both an operator string and a structured-form/discrete-param shape, addressing different UX:
- Google Scholar: top search bar accepts
intitle:,author:,source:operators insideq=; the advanced search form uses discrete params (as_q=,as_sauthors=PJ,as_ylo=2020, ...). Both work, both shipped. - Gmail: simple search box accepts
from:,subject:,has:; the advanced search dropdown is a form that emits the same operator string under the hood. One canonical URL shape, two ways to construct it. - GitHub: search box uses
?q=is:open+assignee:foo. They removed the discrete-param advanced form but the operator string survived. - Twitter/X, Slack, Linear, Sourcegraph: operator-string search box; a few have separate filter UIs that produce the same query string.
DANDI today already has the dual entry points:
- Top search box on the dandiset list → operator string (this PR)
- Faceted SearchView with checkboxes → discrete REST params on
/api/dandisets/search/(pre-existing, drives the sidebar checkbox UI)
The duplication is the cost of supporting both UX patterns. Scholar pays it. Gmail pays it (via different mechanism). It's the standard trade-off when you want both quick free-form typing and a guided checkbox experience.
Why not consolidate now?
The cleanest long-term move is a Gmail-style convergence: have the faceted UI emit the operator string under the hood, so there's one canonical filter language with two front-ends. That requires:
- Loosening
?species=fromMultipleChoiceField(exact) toCharField(substring) — a behavior change for existing API consumers. - Rewriting
web/src/views/SearchView/SearchView.vueto construct the operator string instead of building discrete params. - Deciding whether
/api/dandisets/search/itself stays or gets folded into the listing endpoint.
That's a worthwhile follow-up, but it's a separate decision about the faceted UI and changes pre-existing API behavior. Bundling it with this PR would conflate "introduce operator search" with "change how the existing search endpoint works" — two reviews in one. Easier to ship the additive feature, see how people actually use the operator syntax, then consolidate intentionally.
Net
The PR adds a search syntax users have learned everywhere else (Gmail/GitHub/Slack), addresses one missing capability (date filtering, anyone-not-me owner filtering), and overlaps with one existing endpoint in a narrow, semantically-distinct way that mirrors what Scholar and Gmail already do. The "single source of truth" cleanup is a follow-up worth doing on its own merits.
|
ok great, let's move this to the sandbox so we can all test it! |





Summary
Adds Gmail-style
key:valueoperator parsing to the existing/api/dandisets/?search=parameter, so users can filter the dandisetlisting by structured properties (dates, species, approach, etc.) from
the same search box. The same syntax works from any HTTP client.
Example query:
v1 operators
created_before:/created_after:(YYYY-MM-DD)Dandiset.createdmodified_before:/modified_after:Version.modifiedpublished_before:/published_after:Version.created(draft-only dandisets excluded)has_species:AssetSearch.specieshas_approach:/has_technique:/has_standard:.name(Postgresjsonb_path_exists+like_regex)has_file_type:encodingFormatstartswith, withnwb/image/text/videoaliasesFree text outside operators continues through the existing full-text path. Unknown operators fall through to free text (Gmail behavior); malformed dates fail closed (return zero results) so a typo doesn't silently expand the set. Multi-word values support quoting (
has_technique:"spike sorting").Negation (
-key:value) andORare intentionally deferred to v2.Frontend
The search field on the dandiset listing gets an updated placeholder and a help popover listing the operators with examples. No routing changes — the existing
searchDandisetsroute forwards the rawsearch=string, so the parser does the work.Other
Adds a
./manage.py create_search_test_data --owner <email>management command that seeds a matrix of dandisets covering every operator. Idempotent — safe to re-run while iterating.Test plan
tox -e test -- dandiapi/api/tests/test_search_parser.py— 13 unit tests for the parsertox -e test -- dandiapi/api/tests/test_dandiset.py -k advanced_search— 17 integration tests covering each operator, multi-element JSON array matching, repeated operators (intersection / annotation reuse), free text + operator combinations, embargoed-asset visibility, and fail-closed date parsingtox -e lint