Skip to content

Patch release 2.3.3#20099

Closed
stelfrag wants to merge 48 commits intonetdata:v2.3from
stelfrag:patch_release
Closed

Patch release 2.3.3#20099
stelfrag wants to merge 48 commits intonetdata:v2.3from
stelfrag:patch_release

Conversation

@stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Apr 9, 2025

Summary

ktsaou and others added 30 commits April 9, 2025 16:00
added DMI HW information (anonymous - no serial numbers)

(cherry picked from commit 82f6681)
* allow filtering on fields, without making them facets

* show all values on histogram

* histogram now has both ids and names for dimensions

* added colors for PRIORITY on linux and LEVEL on windows

* Revert "added colors for PRIORITY on linux and LEVEL on windows"

This reverts commit f93ce72.

(cherry picked from commit 60c0047)
(cherry picked from commit 4ea9384)
* make trim_all() never return null

* no need to memmove() after trim_all()

(cherry picked from commit 25a62da)
)

* Add locks to protect the datafile disk space calculation

* Avoid queueing RRDENG_OPCODE_DATABASE_ROTATE operations needlessly

(cherry picked from commit 9019b56)
Add null checks and timer cleanup in aclk configuration

(cherry picked from commit 239c340)
* improved dmi information

* prepare it for multi-os support

* move the spawn server before the crash check

* undo last

* added freebsd, macos and windows versions of the dmi strings

* new version for windows

* new macos version

* error checking

* fix mac compile

* code review and cleanup

* refactor(status-file): improve Windows SMBIOS parsing with better memory management and error handling

- Create dedicated structure types for SMBIOS header and data
- Refactor string parsing with improved boundary checks
- Separate structure processing into type-specific functions
- Replace malloc/free with mallocz/freez for better integration
- Improve safety checks against malformed SMBIOS data
- Add reasonable limits to prevent infinite loops
- Use const qualifiers for read-only data

(cherry picked from commit e01e27b)
* Add libuv close callback to improve handle management and replace duplicate code
Check if the handle is closing before attempting a close

* Do not close handle

* Disable handle check for now

* Fix typo

* Delete commented out code

* Simplify code

(cherry picked from commit c688d51)
* detect issue at uv_worker job 13

* assertion at journalfile_migrate_to_v2_callback()

(cherry picked from commit 934c4cc)
* polishing of dmi decoding

* added system vendor, product name and product type to host labels

* rename added rrdhost labels

(cherry picked from commit 5e2d756)
more polishing

(cherry picked from commit 28d40e1)
* filter signal handler from stack traces

When capturing stack traces from crashes where signal handlers are involved,
filter out the signal handler function itself and all functions after it.
This provides cleaner and more focused stack traces by showing only the
code path leading up to the crash.

* improve stack trace processing for crashes

1. Add root cause tracking to store the first netdata source function in a stack trace - only for libbacktrace
2. Update status-file.c to use the root cause function for crash reporting

* mark functions as never inline to ensure they are always in the stack traces

* show the logger function in the stack traces

* never skip frames from stack traces

(cherry picked from commit 568766f)
Set SQLite heap memory limits and log their values on initialization

(cherry picked from commit 111fba2)
* fix cgroup netdev renames when multiple renames exist for the same device

* remove obsolete members and functions

(cherry picked from commit bead882)
do log the logger functions are root cause; always set the root cause function to something, so that we can easily find the stack traces

(cherry picked from commit 3f689e6)
add deduplicating web server

(cherry picked from commit e032ee0)
* dmi polishing

* more code cleanup

* keep original dmi information and add product labels separately

* isolate dmi from the status file

* dmi module is now standalone

* DMI polishing

* remove serial numbers and asset tags from the status file

* prefer product family over product name

* combine product family, name and board name to build the final product label

* use more space for product name

* fix(agent-events): prevent field leakage in server.go JSON processing

Fix cross-contamination issue in agent-events server by using explicit map
allocation when unmarshaling JSON. Previously, fields could leak between
requests when using an empty interface due to memory reuse, causing mixed
fields from different providers to appear in the same log entry.

(cherry picked from commit 1de029e)
* feat(dbengine): add ARAL memory accounting for EPDL_EXTENT structures

  Fixes memory leaks detected by address and leak sanitizers in the database
  engine's EPDL_EXTENT structures.

  - Create an ARAL for EPDL_EXTENT with init/get/release functions in pdc.c
  - Add the structure to the global initialization in rrdengine.c
  - Add a common cleanup function for both datafile deletion and finalization
  - Add RRDENG_MEM_EPDL_EXTENT to the enum in rrdengineapi.h
  - Integrate with Pulse for proper memory monitoring in netdata.dbengine_buffers

  This improves memory management by properly tracking and accounting for all
  EPDL_EXTENT allocations, ensuring consistent cleanup in all code paths.

fix(dbengine): properly free EPDL_EXTENT structures during shutdown

  This change ensures proper cleanup of EPDL_EXTENT structures during database
  shutdown when FSANITIZE_ADDRESS is enabled. Although this wasn't causing
  actual memory leaks during normal operation, it was reported as leaks by
  the address and leak sanitizers.

  The fix extracts the EPDL_EXTENT cleanup code into a common function
  that's called from both datafile deletion and finalization paths, ensuring
  consistent memory management across all shutdown scenarios.

* Fix memory leak in health alarm entries with ARAL-based management

   This commit addresses a memory leak in health alarm entries by:

   1. Implementing ARAL memory management for ALARM_ENTRY structures
      - Added ARAL initialization in health_plugin_init()
      - Created health_alarm_entry_get() and health_alarm_entry_release() functions
      - Modified health_create_alarm_entry() to use ARAL allocation

   2. Adding automatic cleanup of old alarm entries based on retention settings
      - Implemented health_alarm_log_cleanup() to remove old in-memory entries
      - Integrated with sql_health_alarm_log_cleanup() to ensure both database
        and memory entries are cleaned up together

   3. Adding Pulse monitoring for health alarm entries
      - Added health log memory tracking to netdata.memory chart

   This fix ensures that alarm entries are properly tracked and freed when
   they become older than the configured retention period.

* Fix memory leak in worker spinlock contention tracking

  This commit fixes a memory leak in the pulse worker monitoring system where memory allocated for spinlock contention tracking wasn't properly freed. The leak occurred in worker_utilization_charts_callback where spinlock structures were allocated with callocz()
   but never freed when the thread exited.

  Added a proper cleanup mechanism in pulse_workers_cleanup() with a helper function that:
  1. Iterates through each spinlock structure in the Judy arrays
  2. Frees individual spinlock entries via callback
  3. Properly cleans up both per-worker and global spinlock collections

  This ensures all memory is properly freed when the pulse thread terminates.

* Fix memory leak in health alert prototypes pattern matching

  This commit addresses a memory leak in the health alerting system where pattern arrays containing SIMPLE_PATTERN structures were not being properly freed when the health plugin was destroyed. The memory leak occurred when these structures were created during
  the health_plugin_init() function but never cleaned up.

  The fix implements proper cleanup in health_plugin_destroy() by:
  1. Destroying the health_globals.prototypes.dict dictionary
  2. Setting the initialization state to false
  3. Ensuring proper thread safety with spinlocks

  This addresses the 1248 bytes memory leak in 39 objects allocated during pattern matching for health alert prototypes.

* add basic aral accounting under FSANITIZE_ADDRESS

* cleanup popen_instance from alarm entry and freeing it

* minor cleanup to ensure memory is zeroed

* rework the destructor of pattern-arrays and split them to a separate file

* increase the vendor name field

* do not match computer when searching for compute

* fix rrdlabels memory accounting

* make remove_this_page_from_index_unsafe() log pointers in hex

* the shutdown watcher should wait any amount of time with FSANITIZE_ADDRESS

* use void pointers when printing cache indexes

* remove no_status from fatal function

* make alarm entries use double linked list

* use double linked list for alarm entries in progress

* assertion when trying add in progress an alert twice

* fix sensors message

* set journal file unmount inactivity to 10 minutes

* prevent the dbengine unittest from exiting prematurely

(cherry picked from commit cfd98f4)
- Add strict JSON validation to correctly reject malformed input
- Modernize logging using structured logging (slog)
- Add comprehensive OTEL & Prometheus metrics
- Extract and include Cloudflare headers in output
- Implement graceful shutdown handling
- Add health endpoint with server status
- Improve error handling and status reporting
- Update tests to cover various scenarios and edge cases
- Update .gitignore to exclude agent-events binary and test files

(cherry picked from commit b68d6cc)
* fix metric names

* fix logs path

(cherry picked from commit 4eb6012)
* agent-events: fix metrics

* fix also the test

(cherry picked from commit 60d8e6b)
…ata#20067)

- Replace multiple individual metrics with a unified 'agent_events_requests_ratio_total' metric
- Uses status as a label to distinguish between different types of requests: total, success, duplicate, error, etc.
- Makes visualization in dashboards easier with a single status-based chart
- Maintains full metric coverage with improved efficiency

(cherry picked from commit 6680cc8)
* cleanuo

* make all counters have analysis per status

(cherry picked from commit 0df2060)
* 1. Enhanced run.sh with multiple dedup keys and a dedicated log file
  2. Added Cloudflare headers processing for all requests, not just successful ones
  3. Implemented logging of duplicated requests to a separate file
  4. Added health check endpoint improvements with more deduplication info
  5. Improved test code structure with output capture

* dont log duplicates by default

(cherry picked from commit 59cba7e)
* Use one uv worker to per tier to populate journal v2 files
Each worker will share a pool of netdata_conf_cpus() threads

* Improve MRG population logging and handle empty datafile scenario

(cherry picked from commit eb6942a)
Release mrg load thread list structure / Destroy semaphore on exit

(cherry picked from commit 0b93209)
…etdata#20080)

* Notify the ACLK sync event loop that the MQTT connection is shutting down
Prevent queueing messages if mqtt client has been destroyed / free resources

* Refactor MQTT client usage in HTTP API V2 handling

(cherry picked from commit 3f82436)
ktsaou and others added 7 commits April 9, 2025 16:07
* have a more meaningful fatal.function on signals without a root cause

* set worker job id idle id to WORKER_UTILIZATION_MAX_JOB_TYPES

(cherry picked from commit 2b88dd8)
…#20092)

* protected access against SIGBUS/SIGSEGV for journal v2 files

* fix get_page_list_from_journal_v2

(cherry picked from commit 1f0615c)
zero last_session_status.fatal when there is no previous status

(cherry picked from commit b82d202)
* Release memory

* Avoid division by zero

(cherry picked from commit 4bf664f)
* Add support for nested protected access regions

This change enhances the protected_access mechanism to support nested calls:
- Replace single state with a stack of frames (up to 8 levels deep)
- Modified signal handler to find and recover from the correct nesting level
- Proper unwinding of frames when errors occur
- Maintains original API and performance characteristics on the happy path

* Move diagnostic functions to C file and update all PROTECTED_ACCESS_SETUP calls

- Move protected_access_format_error and protected_access_get_last_fault from header to .c file
- Fix remaining call in pagecache.c to use new API with resource name and operation
- Use SIGNAL_CODE_2str_h directly for proper signal code printing
- Clean up header file and improve implementation organization

Add enhanced diagnostic information to protected access

This change adds detailed diagnostic information to the protected access mechanism:
- Extend API to require resource name and operation type for all callers
- Capture signal codes using SIGNAL_CODE type with human-readable representations
- Automatically log detailed error messages at the macro level
- Format errors with contextual information about signal, memory region and offsets
- Update journalfile.c to use the enhanced diagnostics

(cherry picked from commit 3145497)
* Enable release memory via sqlite3_release_memory(bytes)

* Enable sqlite3_release_memory

(cherry picked from commit ad39ad3)
Do not release statements if sqlite library has been shutdown

(cherry picked from commit 02fcff8)
ktsaou added 2 commits April 9, 2025 17:52
* Fix Windows registry name crashes and optimize memory with Judy arrays

This commit addresses two key issues with Windows performance counter registry handling:

1. Fixes crashes due to improper registry data parsing:
   - Added comprehensive validation of registry data structure
   - Added bounds checking to prevent buffer overruns
   - Added validation of ID string conversion
   - Improved error logging for malformed registry entries

2. Optimizes memory usage with Judy arrays:
   - Replaced fixed-size array with memory-efficient Judy array
   - Memory usage now scales with actual data rather than highest ID
   - Eliminated arbitrary limit on registry ID values
   - Added proper resource cleanup on shutdown

Added unit tests to verify functionality and demonstrate memory efficiency
with sparseness statistics. Run with: -W perflibnamestest

* Enhance Windows registry validation and add comprehensive tests

(cherry picked from commit 6024620)
stelfrag and others added 3 commits April 9, 2025 18:51
…tion (netdata#20098)

* Improve journal file access error logging and handle corrupted journal files during retention calculation

* Limit log frequency in PROTECTED_ACCESS_SETUP

(cherry picked from commit 76d0679)
…0030)

go.d virtual node send _hostname label

(cherry picked from commit fe87192)
while shutting down, keep track of the shutdown timings

(cherry picked from commit 19444ec)
@ilyam8 ilyam8 closed this Apr 11, 2025
@stelfrag stelfrag deleted the patch_release branch April 14, 2025 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants