Skip to content

feat: Filtering by the presence of a value in an array#873

Open
caicancai wants to merge 1 commit intoVictoriaMetrics:masterfrom
caicancai:array_contains
Open

feat: Filtering by the presence of a value in an array#873
caicancai wants to merge 1 commit intoVictoriaMetrics:masterfrom
caicancai:array_contains

Conversation

@caicancai
Copy link
Contributor

@caicancai caicancai commented Dec 8, 2025

Describe Your Changes

close: #853

image

Checklist

The following checks are mandatory:

@caicancai
Copy link
Contributor Author

I'm not sure if I understand the functionality of this issue correctly.

@caicancai caicancai marked this pull request as draft December 8, 2025 15:51
@caicancai caicancai marked this pull request as ready for review December 8, 2025 16:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new array_contains filter to VictoriaLogs, enabling users to filter log entries where a JSON array field contains a specific value. The implementation follows the established patterns in the codebase for filter implementations.

Key Changes

  • Adds new array_contains filter with support for matching values within JSON arrays
  • Implements efficient filtering using dictionary-based optimization for repeated values
  • Includes comprehensive unit tests covering various edge cases (empty arrays, mixed types, non-array values)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
lib/logstorage/parser.go Integrates the new array_contains filter into the parser, adding it to both the filter selection logic and reserved keywords list
lib/logstorage/filter_array_contains.go Implements the core filter logic with methods for row matching, block result filtering, and block search optimization using JSON parsing
lib/logstorage/filter_array_contains_test.go Provides unit tests for the matchArrayContains function and filter behavior across different column types (const, dict, strings)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 42 to 51
// Tricky cases
f(`["foo bar"]`, "foo", false) // partial match
f(`["foobar"]`, "foo", false) // partial match
f(`["foo"]`, "fo", false) // partial match
}
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests should include edge cases for nested arrays and objects to document the expected behavior. Currently, the implementation skips array and object elements (line 166 uses continue for default case), but this behavior isn't explicitly tested.

Consider adding test cases like:

// Nested structures
f(`[["foo"]]`, "foo", false)           // nested array - should not match
f(`[{"key":"foo"}]`, "foo", false)     // object in array - should not match
f(`[null, "foo", null]`, "foo", true)  // null values - should match "foo"

This helps document the intended behavior and prevents regressions.

Copilot uses AI. Check for mistakes.
Comment on lines 136 to 143
// TODO: optimize by avoiding full JSON parsing.
// Maybe using fastjson.Parser pool?
// For now, use a temporary parser.
var p fastjson.Parser
v, err := p.Parse(s)
if err != nil {
return false
}
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matchArrayContains function creates a new fastjson.Parser on every call, which can cause performance issues when filtering large datasets. The codebase already uses fastjson.ParserPool (see lib/logstorage/pipe_unroll.go:288) for efficient parser reuse.

Consider creating a global fastjson.ParserPool variable similar to the pattern used in pipe_unroll.go:

var jsonParserPool fastjson.ParserPool

Then modify matchArrayContains to use the pool:

p := jsonParserPool.Get()
defer jsonParserPool.Put(p)
v, err := p.Parse(s)

This avoids repeated allocations and improves performance for frequently-called filter operations.

Copilot uses AI. Check for mistakes.
Comment on lines 136 to 138
// TODO: optimize by avoiding full JSON parsing.
// Maybe using fastjson.Parser pool?
// For now, use a temporary parser.
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment should be removed or addressed before merging, as it refers to the optimization already suggested (using fastjson.ParserPool). If the pool optimization is implemented, this TODO becomes obsolete. If not implemented now, consider creating a GitHub issue to track this improvement instead of leaving a TODO in the code.

Suggested change
// TODO: optimize by avoiding full JSON parsing.
// Maybe using fastjson.Parser pool?
// For now, use a temporary parser.

Copilot uses AI. Check for mistakes.
if !strings.Contains(s, value) {
return false
}

Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an extra blank line at line 130 that should be removed to maintain consistency with the codebase style.

Suggested change

Copilot uses AI. Check for mistakes.
Comment on lines +2010 to +2011
case lex.isKeyword("array_contains"):
return parseFilterArrayContains(lex, fieldName)
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new array_contains filter is missing parser tests. All similar filters (e.g., contains_any, contains_all, ipv4_range) have corresponding parser tests in parser_test.go (see lines 866-924). These tests verify that the parser correctly creates the filter with expected field names and values.

Consider adding a test function like:

func TestParseFilterArrayContains(t *testing.T) {
	f := func(s, fieldNameExpected, valueExpected string) {
		t.Helper()
		q, err := ParseQuery(s)
		if err != nil {
			t.Fatalf("unexpected error: %s", err)
		}
		fa, ok := q.f.(*filterArrayContains)
		if !ok {
			t.Fatalf("unexpected filter type; got %T; want *filterArrayContains; filter: %s", q.f, q.f)
		}
		if fa.fieldName != fieldNameExpected {
			t.Fatalf("unexpected fieldName; got %q; want %q", fa.fieldName, fieldNameExpected)
		}
		if fa.value != valueExpected {
			t.Fatalf("unexpected value; got %q; want %q", fa.value, valueExpected)
		}
	}

	f(`array_contains("foo")`, `_msg`, "foo")
	f(`tags:array_contains("prod")`, `tags`, "prod")
	f(`array_contains(123)`, `_msg`, "123")
}

Copilot uses AI. Check for mistakes.
// functions
"contains_all",
"contains_any",
"array_contains",
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array_contains filter is missing serialization tests in the main query parsing test. All similar filters have String() serialization tests (see lines 1579-1608 for contains_any and contains_all). These tests verify that the filter correctly serializes back to LogsQL format.

Consider adding tests like:

// array_contains filter
f(`array_contains(foo)`, `array_contains(foo)`)
f(`array_contains("foo bar")`, `array_contains("foo bar")`)
f(`tags:array_contains(prod)`, `tags:array_contains(prod)`)
f(`array_contains(123)`, `array_contains(123)`)

These should be added in the appropriate section of the test that starts around line 1579.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@func25 func25 self-requested a review December 18, 2025 16:52
Copy link
Contributor

@func25 func25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please check a few comments

Comment on lines +123 to +134
if s == "" {
return false
}
// Fast check: if the value is not present as a substring, it definitely won't be in the array.
if !strings.Contains(s, value) {
return false
}

// Fast check 2: must start with [
if s[0] != '[' {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fast path is probably not correct:

if !strings.Contains(s, value) {
	return false
}

e.g., the json array contains the string a\\\"b (which represents the actual value a"b after json unescaping), but the searched value is a"b


// Use shared fastjson.ParserPool in order to avoid per-call parser allocations.
p := jspp.Get()
defer jspp.Put(p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a super hot path, so it may be worth duplicating jspp.Put(p) at every exit point.

}

// Check if it is an array
a, err := v.Array()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a, err := v.Array()
jsa, err := v.Array()


if sElem == value {
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop's body is unoptimized. If you want to maximize performance, pass a byte buffer into this function. +, try using bytesutil.ToUnsafeString and *fastjson.Value.MarshalTo for non-string cases.

Also to follow conventions, you can use *fastjson.Value.StringBytes and check the error instead of using GetStringBytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filtering by the presence of a value in an array

2 participants