feat: Filtering by the presence of a value in an array#873
feat: Filtering by the presence of a value in an array#873caicancai wants to merge 1 commit intoVictoriaMetrics:masterfrom
Conversation
|
I'm not sure if I understand the functionality of this issue correctly. |
There was a problem hiding this comment.
Pull request overview
This pull request adds a new array_contains filter to VictoriaLogs, enabling users to filter log entries where a JSON array field contains a specific value. The implementation follows the established patterns in the codebase for filter implementations.
Key Changes
- Adds new
array_containsfilter with support for matching values within JSON arrays - Implements efficient filtering using dictionary-based optimization for repeated values
- Includes comprehensive unit tests covering various edge cases (empty arrays, mixed types, non-array values)
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
lib/logstorage/parser.go |
Integrates the new array_contains filter into the parser, adding it to both the filter selection logic and reserved keywords list |
lib/logstorage/filter_array_contains.go |
Implements the core filter logic with methods for row matching, block result filtering, and block search optimization using JSON parsing |
lib/logstorage/filter_array_contains_test.go |
Provides unit tests for the matchArrayContains function and filter behavior across different column types (const, dict, strings) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Tricky cases | ||
| f(`["foo bar"]`, "foo", false) // partial match | ||
| f(`["foobar"]`, "foo", false) // partial match | ||
| f(`["foo"]`, "fo", false) // partial match | ||
| } |
There was a problem hiding this comment.
The tests should include edge cases for nested arrays and objects to document the expected behavior. Currently, the implementation skips array and object elements (line 166 uses continue for default case), but this behavior isn't explicitly tested.
Consider adding test cases like:
// Nested structures
f(`[["foo"]]`, "foo", false) // nested array - should not match
f(`[{"key":"foo"}]`, "foo", false) // object in array - should not match
f(`[null, "foo", null]`, "foo", true) // null values - should match "foo"This helps document the intended behavior and prevents regressions.
| // TODO: optimize by avoiding full JSON parsing. | ||
| // Maybe using fastjson.Parser pool? | ||
| // For now, use a temporary parser. | ||
| var p fastjson.Parser | ||
| v, err := p.Parse(s) | ||
| if err != nil { | ||
| return false | ||
| } |
There was a problem hiding this comment.
The matchArrayContains function creates a new fastjson.Parser on every call, which can cause performance issues when filtering large datasets. The codebase already uses fastjson.ParserPool (see lib/logstorage/pipe_unroll.go:288) for efficient parser reuse.
Consider creating a global fastjson.ParserPool variable similar to the pattern used in pipe_unroll.go:
var jsonParserPool fastjson.ParserPoolThen modify matchArrayContains to use the pool:
p := jsonParserPool.Get()
defer jsonParserPool.Put(p)
v, err := p.Parse(s)This avoids repeated allocations and improves performance for frequently-called filter operations.
| // TODO: optimize by avoiding full JSON parsing. | ||
| // Maybe using fastjson.Parser pool? | ||
| // For now, use a temporary parser. |
There was a problem hiding this comment.
The TODO comment should be removed or addressed before merging, as it refers to the optimization already suggested (using fastjson.ParserPool). If the pool optimization is implemented, this TODO becomes obsolete. If not implemented now, consider creating a GitHub issue to track this improvement instead of leaving a TODO in the code.
| // TODO: optimize by avoiding full JSON parsing. | |
| // Maybe using fastjson.Parser pool? | |
| // For now, use a temporary parser. |
| if !strings.Contains(s, value) { | ||
| return false | ||
| } | ||
|
|
There was a problem hiding this comment.
There's an extra blank line at line 130 that should be removed to maintain consistency with the codebase style.
| case lex.isKeyword("array_contains"): | ||
| return parseFilterArrayContains(lex, fieldName) |
There was a problem hiding this comment.
The new array_contains filter is missing parser tests. All similar filters (e.g., contains_any, contains_all, ipv4_range) have corresponding parser tests in parser_test.go (see lines 866-924). These tests verify that the parser correctly creates the filter with expected field names and values.
Consider adding a test function like:
func TestParseFilterArrayContains(t *testing.T) {
f := func(s, fieldNameExpected, valueExpected string) {
t.Helper()
q, err := ParseQuery(s)
if err != nil {
t.Fatalf("unexpected error: %s", err)
}
fa, ok := q.f.(*filterArrayContains)
if !ok {
t.Fatalf("unexpected filter type; got %T; want *filterArrayContains; filter: %s", q.f, q.f)
}
if fa.fieldName != fieldNameExpected {
t.Fatalf("unexpected fieldName; got %q; want %q", fa.fieldName, fieldNameExpected)
}
if fa.value != valueExpected {
t.Fatalf("unexpected value; got %q; want %q", fa.value, valueExpected)
}
}
f(`array_contains("foo")`, `_msg`, "foo")
f(`tags:array_contains("prod")`, `tags`, "prod")
f(`array_contains(123)`, `_msg`, "123")
}| // functions | ||
| "contains_all", | ||
| "contains_any", | ||
| "array_contains", |
There was a problem hiding this comment.
The array_contains filter is missing serialization tests in the main query parsing test. All similar filters have String() serialization tests (see lines 1579-1608 for contains_any and contains_all). These tests verify that the filter correctly serializes back to LogsQL format.
Consider adding tests like:
// array_contains filter
f(`array_contains(foo)`, `array_contains(foo)`)
f(`array_contains("foo bar")`, `array_contains("foo bar")`)
f(`tags:array_contains(prod)`, `tags:array_contains(prod)`)
f(`array_contains(123)`, `array_contains(123)`)These should be added in the appropriate section of the test that starts around line 1579.
006f77b to
8940cc1
Compare
Signed-off-by: cancaicai <[email protected]>
8940cc1 to
2ee8354
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
func25
left a comment
There was a problem hiding this comment.
Looks good, please check a few comments
| if s == "" { | ||
| return false | ||
| } | ||
| // Fast check: if the value is not present as a substring, it definitely won't be in the array. | ||
| if !strings.Contains(s, value) { | ||
| return false | ||
| } | ||
|
|
||
| // Fast check 2: must start with [ | ||
| if s[0] != '[' { | ||
| return false | ||
| } |
There was a problem hiding this comment.
This fast path is probably not correct:
if !strings.Contains(s, value) {
return false
}
e.g., the json array contains the string a\\\"b (which represents the actual value a"b after json unescaping), but the searched value is a"b
|
|
||
| // Use shared fastjson.ParserPool in order to avoid per-call parser allocations. | ||
| p := jspp.Get() | ||
| defer jspp.Put(p) |
There was a problem hiding this comment.
This function is a super hot path, so it may be worth duplicating jspp.Put(p) at every exit point.
| } | ||
|
|
||
| // Check if it is an array | ||
| a, err := v.Array() |
There was a problem hiding this comment.
| a, err := v.Array() | |
| jsa, err := v.Array() |
|
|
||
| if sElem == value { | ||
| return true | ||
| } |
There was a problem hiding this comment.
This loop's body is unoptimized. If you want to maximize performance, pass a byte buffer into this function. +, try using bytesutil.ToUnsafeString and *fastjson.Value.MarshalTo for non-string cases.
Also to follow conventions, you can use *fastjson.Value.StringBytes and check the error instead of using GetStringBytes.
Describe Your Changes
close: #853
Checklist
The following checks are mandatory: