Describe the bug
Observed panics due to segmentation faults in the ruler.
To Reproduce
Steps to reproduce the behavior:
Run Cortex 1.10.0 & run ruler
Expected behavior
Ruler should not panic
Environment:
- Infrastructure: kubernetes - AKS
- Deployment tool: customized yaml manifests
Storage Engine
Additional Context
We are seeing consistent panics from the ruler, with errors like
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e20df3]
goroutine 14595 [running]:
github.com/cortexproject/cortex/pkg/querier.querier.Select(0x2bdf130, 0xc0047b2820, 0xc0046fc440, 0x2, 0x2, 0x28908e0, 0x2bdeaa0, 0xc00206be60, 0x17ba6ec5d74, 0x17ba7234bf4, ...)
/__w/cortex/cortex/pkg/querier/querier.go:323 +0x193
github.com/cortexproject/cortex/pkg/querier/lazyquery.LazyQuerier.Select.func1(0xc0020e13e0, 0x2be0da0, 0xc000157c00, 0xc004122900, 0x0, 0xc004122900, 0xa, 0x10)
/__w/cortex/cortex/pkg/querier/lazyquery/lazyquery.go:52 +0x72
created by github.com/cortexproject/cortex/pkg/querier/lazyquery.LazyQuerier.Select
/__w/cortex/cortex/pkg/querier/lazyquery/lazyquery.go:51 +0xad
The below is the configuration diff from the defaults, as emitted from the ruler.
Note that I also tried with blocks_storage.bucket_store.index_header_lazy_loading_enabled: false and experienced the same error.
alertmanager:
enable_api: true
external_url: https://alertmanager.cluster-monitor.*******.com/alertmanager
sharding_enabled: true
sharding_ring:
kvstore:
etcd:
endpoints:
- client.etcd.svc.cluster.local:2379
prefix: cortex-alertmanagers/
store: etcd
alertmanager_storage:
s3:
access_key_id: ******
bucket_name: cortex-alertmanager
endpoint: s3.storage.svc.cluster.local:9000
insecure: true
secret_access_key: '********'
api:
response_compression_enabled: true
blocks_storage:
bucket_store:
bucket_index:
enabled: true
chunks_cache:
backend: memcached
memcached:
addresses: dnssrv+_memcached._tcp.chunks-cache.cluster-monitor-cortex.svc.cluster.local
index_cache:
backend: memcached
memcached:
addresses: dnssrv+_memcached._tcp.index-cache.cluster-monitor-cortex.svc.cluster.local
index_header_lazy_loading_enabled: true
metadata_cache:
backend: memcached
bucket_index_content_ttl: 2m0s
memcached:
addresses: dnssrv+_memcached._tcp.metadata-cache.cluster-monitor-cortex.svc.cluster.local
metafile_doesnt_exist_ttl: 2m0s
tenant_blocks_list_ttl: 2m0s
sync_interval: 5m0s
s3:
access_key_id: *****
bucket_name: cortex
endpoint: s3.storage.svc.cluster.local:9000
insecure: true
secret_access_key: '********'
tsdb:
close_idle_tsdb_timeout: 15m0s
dir: /var/cortex/tsdb
max_exemplars: 1000
compactor:
block_deletion_marks_migration_enabled: false
cleanup_interval: 5m0s
distributor:
ha_tracker:
enable_ha_tracker: true
kvstore:
etcd:
endpoints:
- client.etcd.svc.cluster.local:2379
prefix: cortex-ha-tracker/
store: etcd
ring:
kvstore:
etcd:
endpoints:
- client.etcd.svc.cluster.local:2379
prefix: cortex-collectors/
store: etcd
shard_by_all_labels: true
frontend:
grpc_client_config:
grpc_compression: snappy
log_queries_longer_than: 1s
query_stats_enabled: true
frontend_worker:
frontend_address: query-frontend.cluster-monitor-cortex.svc.cluster.local:9095
grpc_client_config:
grpc_compression: snappy
max_send_msg_size: 33554432
ingester:
lifecycler:
availability_zone: westeurope-2
observe_period: 3s
ring:
kvstore:
etcd:
endpoints:
- client.etcd.svc.cluster.local:2379
prefix: cortex-collectors/
store: etcd
walconfig:
wal_enabled: true
ingester_client:
grpc_client_config:
grpc_compression: snappy
limits:
accept_ha_samples: true
ingestion_burst_size: 75000
ingestion_rate: 55000
max_series_per_metric: 70000
querier:
at_modifier_enabled: true
query_store_for_labels_enabled: true
query_range:
align_queries_with_step: true
cache_results: true
results_cache:
cache:
memcached:
expiration: 12h0m0s
memcached_client:
addresses: dnssrv+_memcached._tcp.index-cache.cluster-monitor-cortex.svc.cluster.local
split_queries_by_interval: 24h0m0s
ruler:
alertmanager_url: http://alertmanager.cluster-monitor-cortex.svc.cluster.local:3100/alertmanager
enable_api: true
enable_sharding: true
external_url: https://alertmanager.cluster-monitor.******.com
ring:
kvstore:
etcd:
endpoints:
- client.etcd.svc.cluster.local:2379
prefix: cortex-rulers/
store: etcd
ruler_client:
grpc_compression: snappy
ruler_storage:
s3:
access_key_id: ********
bucket_name: cortex-ruler
endpoint: s3.storage.svc.cluster.local:9000
insecure: true
secret_access_key: '********'
server:
http_listen_port: 3100
log_level: debug
storage:
engine: blocks
store_gateway:
sharding_enabled: true
sharding_ring:
kvstore:
etcd:
endpoints:
- client.etcd.svc.cluster.local:2379
prefix: cortex-collectors/
store: etcd
zone_awareness_enabled: true
target: ruler
Describe the bug
Observed panics due to segmentation faults in the ruler.
To Reproduce
Steps to reproduce the behavior:
Run Cortex 1.10.0 & run ruler
Expected behavior
Ruler should not panic
Environment:
Storage Engine
Additional Context
We are seeing consistent panics from the ruler, with errors like
The below is the configuration diff from the defaults, as emitted from the ruler.
Note that I also tried with
blocks_storage.bucket_store.index_header_lazy_loading_enabled: falseand experienced the same error.