Skip to content

Fix: persist environments with long IDs safely#948

Open
omkarsureshs wants to merge 2 commits intofossasia:masterfrom
omkarsureshs:fix-long-env-persistence
Open

Fix: persist environments with long IDs safely#948
omkarsureshs wants to merge 2 commits intofossasia:masterfrom
omkarsureshs:fix-long-env-persistence

Conversation

@omkarsureshs
Copy link

@omkarsureshs omkarsureshs commented Jan 24, 2026

Description

This PR fixes issue #648: "Envs with >255 chars do not get saved".

When environment names (eid) exceed filesystem filename length limits (typically 255 characters), the serialize_env function fails silently because it cannot create the corresponding .json file. This fix ensures environments with arbitrarily long names are persisted safely.

Technical Changes:

  1. Filename Truncation: Environment IDs longer than 200 characters are truncated
  2. Hash-Based Uniqueness: A 10-character MD5 hash suffix prevents collisions between different long names with similar prefixes
  3. Backward Compatibility: The load_env function is updated to locate truncated files
  4. Path Correction: Fixed incorrect file path construction in load_env (changed from eid/.json to eid.json)

Implementation Details:

  • Safe Length: 200 characters (leaves buffer for .json extension and directory paths)
  • Hash Algorithm: MD5, using first 10 characters (sufficient for this use case)
  • Filename Pattern: {first_190_chars}_{10_char_hash}.json
  • Scope: Changes limited to py/visdom/utils/server_utils.py

Motivation and Context

Filesystems impose filename length limitations (255 bytes on Linux ext4, 260 characters total path length on Windows). Without this fix:

  • Environments with long names fail to persist silently
  • Users lose data without error messages
  • The issue disproportionately affects auto-generated or UUID-based environment names

This addresses a real-world usability issue where environment names naturally exceed safe limits.

How Has This Been Tested?

Test Environment:

  • Windows 10, Python 3.12
  • Visdom server running on port 8097
  • Both short and long environment names tested

Test Cases:

  1. Short Names: "test_env" - Saves as test_env.json (unchanged behavior)
  2. Long Names: "a" * 300 - Saves as truncated file with hash suffix
  3. Multiple Long Names: Different 300-character names produce different hashes
  4. Load Verification: Confirmed truncated files can be loaded successfully

Test Results:

  • ✅ Short environment names work as before
  • ✅ 300-character environment names are successfully persisted
  • ✅ Different long names produce unique hashes (no collisions)
  • ✅ Truncated files can be loaded correctly
  • ✅ No regression in existing functionality

Example:

# Before fix: Environment with 300 'a's fails silently
# After fix: Saves as "aaaaaaaa...aaa_4e5475d125.json" (truncated + hash)

## Summary by Sourcery

Ensure environments with very long IDs are safely persisted and can be reloaded without filesystem filename length issues.

Bug Fixes:
- Prevent serialization from failing silently when environment IDs exceed filesystem filename length limits by truncating and hashing long IDs for filenames.
- Correct the environment file path used during load to reference `<eid>.json` instead of an incorrect directory-based path.
- Allow loading of environments whose IDs were truncated at save time by resolving the corresponding hashed filename when the original long ID is requested.

- Truncate environment names longer than 200 characters
- Append MD5 hash for uniqueness
- Update load_env to handle truncated filenames
- Fix incorrect path in load_env (eid/.json -> eid.json)

Fixes fossasia#648
@sourcery-ai
Copy link

sourcery-ai bot commented Jan 24, 2026

Reviewer's Guide

Updates Visdom environment serialization to safely persist and reload environments whose IDs exceed typical filesystem filename limits by truncating long IDs, appending a hash-based suffix for uniqueness, and fixing the load path logic.

Sequence diagram for environment save and load with safe filenames

sequenceDiagram
    actor User
    participant VisdomServer
    participant serialize_env
    participant load_env
    participant Filesystem

    User->>VisdomServer: request_save_env(eids)
    VisdomServer->>serialize_env: serialize_env(state, eids, env_path)
    loop for_each_env_id
        serialize_env->>serialize_env: check_length(env_id)
        alt env_id_length <= 200
            serialize_env->>serialize_env: env_path_file = env_path/env_id.json
        else env_id_length > 200
            serialize_env->>serialize_env: hash_part = md5(env_id)[:10]
            serialize_env->>serialize_env: safe_env_id = env_id[:190] + _ + hash_part
            serialize_env->>serialize_env: env_path_file = env_path/safe_env_id.json
        end
        serialize_env->>Filesystem: open(env_path_file, write)
        serialize_env->>Filesystem: write(env_state_json)
        Filesystem-->>serialize_env: file_written
    end
    serialize_env-->>VisdomServer: save_complete

    User->>VisdomServer: request_load_env(eid)
    VisdomServer->>load_env: load_env(state, eid, socket, env_path)
    alt eid_in_memory
        load_env-->>VisdomServer: env_from_state
    else eid_not_in_memory
        load_env->>load_env: p = env_path/eid.json
        alt file_exists(p)
            load_env->>Filesystem: open(p, read)
            Filesystem-->>load_env: env_json
            load_env-->>VisdomServer: env_from_file
        else not_exists(p) and len(eid) > 200
            load_env->>load_env: hash_part = md5(eid)[:10]
            load_env->>load_env: safe_eid = eid[:190] + _ + hash_part
            load_env->>load_env: p = env_path/safe_eid.json
            alt file_exists(p)
                load_env->>Filesystem: open(p, read)
                Filesystem-->>load_env: env_json
                load_env-->>VisdomServer: env_from_truncated_file
            else file_not_found
                load_env-->>VisdomServer: env_not_found
            end
        end
    end
Loading

File-Level Changes

Change Details Files
Introduce safe, deterministic filename generation for long environment IDs during serialization.
  • Add a maximum filename length threshold (200 chars) for environment IDs when persisting to disk.
  • For IDs exceeding the threshold, compute a 10-character MD5 hash of the full ID and construct a safe ID as the first 190 characters plus an underscore and the hash.
  • Use the safe ID instead of the raw environment ID when constructing the .json file path in serialize_env while preserving existing behavior for short IDs.
py/visdom/utils/server_utils.py
Update environment loading logic to handle both original and truncated filenames and fix the file path construction.
  • Correct the existing load_env path construction from joining eid as a directory with ".json" to directly using "{eid}.json" in the environment path.
  • When the direct filename lookup fails and the environment ID is longer than the threshold, reconstruct the truncated safe ID using the same 190-char + 10-char MD5 pattern and attempt to load that file instead.
  • Ensure that long-ID environments saved with the new scheme can be reloaded without breaking existing short-ID behavior.
py/visdom/utils/server_utils.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The filename truncation and hashing logic (including the 200/190 limits and MD5 suffix) is duplicated between serialize_env and load_env; consider extracting a shared helper (e.g., get_env_filename(eid)) so behavior stays consistent if you ever adjust the strategy.
  • You import hashlib inside both serialize_env and load_env; moving this to a single top-level import would simplify the functions and avoid repeated imports.
  • In serialize_env you introduce MAX_FILENAME_LENGTH = 200 but load_env uses the literal 200 and 190; using the same named constant (or helper) in both places would reduce the risk of mismatches if the limit needs to change later.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The filename truncation and hashing logic (including the 200/190 limits and MD5 suffix) is duplicated between `serialize_env` and `load_env`; consider extracting a shared helper (e.g., `get_env_filename(eid)`) so behavior stays consistent if you ever adjust the strategy.
- You import `hashlib` inside both `serialize_env` and `load_env`; moving this to a single top-level import would simplify the functions and avoid repeated imports.
- In `serialize_env` you introduce `MAX_FILENAME_LENGTH = 200` but `load_env` uses the literal `200` and `190`; using the same named constant (or helper) in both places would reduce the risk of mismatches if the limit needs to change later.

## Individual Comments

### Comment 1
<location> `py/visdom/utils/server_utils.py:121-122` </location>
<code_context>
     if env_path is not None:
         for env_id in env_ids:
-            env_path_file = os.path.join(env_path, "{0}.json".format(env_id))
+            # Truncate long environment IDs to prevent filesystem errors
+            MAX_FILENAME_LENGTH = 200  # Leave room for .json extension
+            if len(env_id) > MAX_FILENAME_LENGTH:
+                # Keep first 190 chars + 10 char hash for uniqueness
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Avoid re-defining truncation constants inline and make the filename-length logic self-consistent.

The current math doesn’t match the comment: `190 + 1 (underscore) + 10 (hash) + 5 (.json) = 206`, so `MAX_FILENAME_LENGTH = 200` isn’t actually respected. Also, the truncation values (`200`, `190`, `10`) are hard-coded here and in `load_env`. Consider centralizing this into a helper that, given an `env_id`, returns a safe filename using a single `MAX_FILENAME_LENGTH`, and derives the slice length from that constant instead of duplicating magic numbers.

Suggested implementation:

```python
def _safe_env_filename(env_id, max_length=200, extension=".json"):
    """
    Return a filesystem-safe filename for the given env_id, enforcing max_length.

    The resulting filename (including extension) will not exceed max_length
    characters. If truncation is required, a hash suffix is appended to
    preserve uniqueness.
    """
    base_length = max_length - len(extension)
    if len(env_id) <= base_length:
        return f"{env_id}{extension}"

    # Reserve space for "_{hash}"
    HASH_LEN = 10
    SEP_LEN = 1  # "_"
    prefix_length = base_length - (HASH_LEN + SEP_LEN)
    if prefix_length <= 0:
        # Fallback: if configuration is inconsistent, just hash the whole id
        import hashlib
        hash_part = hashlib.md5(env_id.encode()).hexdigest()[:HASH_LEN]
        return f"{hash_part}{extension}"

    import hashlib
    hash_part = hashlib.md5(env_id.encode()).hexdigest()[:HASH_LEN]
    safe_env_id = f"{env_id[:prefix_length]}_{hash_part}"
    return f"{safe_env_id}{extension}"


def serialize_env(state, eids, env_path=DEFAULT_ENV_PATH):
    env_ids = [i for i in eids if i in state]
    if env_path is not None:
        for env_id in env_ids:
            safe_filename = _safe_env_filename(env_id)
            env_path_file = os.path.join(env_path, safe_filename)

            with open(env_path_file, "w") as fn:
                if isinstance(state[env_id], LazyEnvData):
                    fn.write(json.dumps(state[env_id]._raw_dict))

```

There is similar truncation logic in `load_env` that should also use `_safe_env_filename` for consistency. Update that code to:
1. Call `_safe_env_filename(env_id)` (or derive the safe env id from the returned filename if needed) instead of duplicating the truncation logic and magic numbers.
2. Ensure any tests that assume a specific truncation pattern are updated to use the centralized helper so they remain aligned with the new behavior.
</issue_to_address>

### Comment 2
<location> `py/visdom/utils/server_utils.py:396-401` </location>
<code_context>
-        p = os.path.join(env_path, eid.strip(), ".json")
+        # Try to find the environment file
+        # First try exact filename
+        p = os.path.join(env_path, "{0}.json".format(eid.strip()))
+        
+        # If not found and env_id is long, try truncated version
+        if not os.path.exists(p) and len(eid) > 200:
+            import hashlib
+            hash_part = hashlib.md5(eid.encode()).hexdigest()[:10]
+            safe_eid = eid[:190] + "_" + hash_part
+            p = os.path.join(env_path, "{0}.json".format(safe_eid))
</code_context>

<issue_to_address>
**issue:** Align whitespace handling between the saved filenames and the load-time lookup.

The first lookup strips whitespace (`eid.strip()`), but the hashed fallback and `serialize_env` both use the raw `env_id`. This leads to inconsistent filenames when IDs have leading/trailing spaces: plain lookups use the stripped name, while saved and hashed filenames use the unstripped one. Normalize the ID once (strip or not) and use that consistently for both save and load paths, including the hashed fallback.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

- Extract shared _safe_env_filename helper function
- Fix math error in filename length calculation (190+1+10+5=206 > 200)
- Use consistent whitespace stripping
- Move hashlib import to top of file
- Remove duplicate truncation logic

Maintains same functionality with cleaner implementation.
@omkarsureshs
Copy link
Author

@fossasia/visdom-maintainers @maintainers

Could someone familiar with py/visdom/utils/server_utils.py please review this fix for issue #648?

The changes:

  • Add _safe_env_filename() helper for consistent truncation+hashing of long environment IDs
  • Update serialize_env() and load_env() to handle filenames >255 chars
  • Fix path construction bug in load_env (was eid/.json, now eid.json)

This fixes the silent failure when saving environments with long names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant