Initial support for remote debugging with VSCode by romain-intel · Pull Request #3033 · Netflix/metaflow

romain-intel · 2026-03-14T08:08:14Z

This allows you to remote debug your Metaflow flow from the comfort of your IDE. It will behave exactly as if it was debugging something locally and all launched tasks will appear as subprocesses in the VSCode debugging call stack

To try it out:

write a simple flow (or anything really)
In the directory with your flow, run `metaflow debug vscode install-config --remote-root
Set a few breakpoints in VSCode in your file
Launch your flow using `python ./myflow.py run --with debugger
The flow will launch and will then stop for you to attach the debugger. The install-config step will have created a Metaflow: Attach debugging configuration. Launch that
Enjoy!

This needs more testing and making sure a mix of local and remote nodes also work (unclear yet) but it's a start and shows that it is possible.

PR Type

Bug fix
[ X] New feature
Core Runtime change (higher bar -- see CONTRIBUTING.md)
Docs / tooling
Refactoring

Summary

Remote debugging capabilities

…at uses debugpy) This allows you to remote debug your Metaflow flow from the comfort of your IDE. It will behave exactly as if it was debugging something locally and all launched tasks will appear as subprocesses in the VSCode debugging call stack To try it out: - write a simple flow (or anything really) - In the directory with your flow, run `metaflow debug vscode install-config --remote-root <root> - Set a few breakpoints in VSCode in your file - Launch your flow using `python ./myflow.py run --with debugger - The flow will launch and will then stop for you to attach the debugger. The install-config step will have created a `Metaflow: Attach` debugging configuration. Launch that - Enjoy! This needs more testing and making sure a mix of local and remote nodes also work (unclear yet) but it's a start and shows that it is possible.

metaflow/plugins/debugger_step_decorator.py

+    """
+    server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+    server.bind(("0.0.0.0", 0))


In general, to fix this type of issue you avoid binding listening sockets to 0.0.0.0 unless you explicitly require remote access from arbitrary hosts, and instead bind to a specific interface (most commonly 127.0.0.1 for local-only access). If multiple interfaces must be supported, you create one socket per interface instead of a single socket bound to all.

For this specific function _start_callback_server in metaflow/plugins/debugger_step_decorator.py, the best minimal change is to bind the callback server to the loopback interface. This keeps callbacks available to local tools (e.g., VSCode/debugpy) while preventing exposure on external interfaces. Concretely, change line 261 from server.bind(("0.0.0.0", 0)) to server.bind(("127.0.0.1", 0)). No new imports or additional methods are required, and the rest of the logic (accept loop, threading, return of dynamically assigned port) remains unchanged.

metaflow/plugins/debugger_step_decorator.py

+    """Create a TCP server socket bound to *host*:*port*."""
+    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
+    sock.bind((host, port))


In general, to fix this issue you either (1) restrict the binding address to a safe, explicit interface (like 127.0.0.1 or a configured host), or (2) enforce that callers cannot pass values that mean “all interfaces” ("0.0.0.0", "", etc.). Since this helper is meant to bind to an arbitrary host provided by callers, the best fix without changing external behavior is to validate host and raise a clear error if an unsafe “all interfaces” value is supplied. Callers that already pass a proper interface will be unaffected; callers that were implicitly relying on “all interfaces” will now fail fast instead of silently exposing the service.

Concretely, in metaflow/plugins/debugger_step_decorator.py around line 281, update _create_listen_socket so that it checks host against disallowed values before calling bind. For example, treat None, "", "0.0.0.0", and "::" as invalid and raise a MetaflowException (already imported at the top of the file) with a descriptive message. Then keep the rest of the socket creation logic unchanged. No new imports are required, and existing behavior for valid host values remains the same.

greptile-apps · 2026-03-14T08:10:53Z

Greptile Summary

This PR introduces remote debugging support for Metaflow flows via VSCode and debugpy. It adds a @debugger step decorator, a new metaflow debug vscode install-config CLI command for generating .vscode/launch.json, and plumbing in the Batch/Kubernetes CLI layers to forward METAFLOW_DEBUGPY_* environment variables to remote containers. The extension command registry is also refactored to support multiple providers for the same command name via subcommand merging.

Key changes:

metaflow/plugins/debugger_step_decorator.py — Core implementation: local mode uses debugpy.connect() back to the adapter; remote mode starts a raw pydevd listener and implements a custom TCP bridge + DAP handshake to make the remote task appear as a child process to VSCode.
metaflow/cmd/debug_cli.py — New CLI to write/merge a Metaflow: Attach configuration into .vscode/launch.json.
metaflow/extension_support/cmd.py — Command registry refactored from last-wins dict to a merge-capable list, allowing extensions to contribute subcommands to the same top-level group.
metaflow/plugins/aws/batch/batch_cli.py / kubernetes_cli.py — Forward debugger env vars to remote containers.

Notable issues found:

The pydevdSystemInfo DAP handshake response in _handle_callback uses hardcoded values ("version": "3.11.0", "platform": {"name": "linux"}, "bitness": 64) regardless of the actual remote container's Python version and OS.
Multiple concurrent remote tasks that land on the same worker node will all try to bind the same base_port, causing all but the first to fail with "Address already in use".
task = cli_args.task in runtime_step_cli is dead code.
No read timeout is set on incoming callback connections in _handle_callback, which can cause thread stalls on slow/partial senders.

Confidence Score: 2/5

Not yet safe to merge — two logic issues in the core decorator need to be resolved before this is reliable in production.
The ancillary files (CLI, registry, Batch/K8s forwarding) are clean and well-implemented. However, the core debugger_step_decorator.py has two concrete logic bugs: hardcoded Python/platform metadata in the DAP handshake will mislead the debugger for non-3.11/non-Linux containers, and the fixed base_port in remote mode will cause silent failures whenever two tasks share a node. The PR description also acknowledges it needs more testing ("needs more testing and making sure a mix of local and remote nodes also work"), which warrants caution before merging.
Pay close attention to metaflow/plugins/debugger_step_decorator.py — specifically _handle_callback (hardcoded system info) and _task_listen (port collision).

Important Files Changed

Filename	Overview
metaflow/plugins/debugger_step_decorator.py	New file implementing the core debugpy integration. Has two significant logic issues: hardcoded Python version/platform in the `pydevdSystemInfo` DAP handshake response, and a port collision risk when multiple remote tasks run on the same node. Also contains an unused `task` variable and a missing read timeout on callback connections.
metaflow/cmd/debug_cli.py	New CLI file for `metaflow debug vscode install-config`. Correctly handles creating/merging/overwriting `.vscode/launch.json`. Logic is clean and well-structured.
metaflow/extension_support/cmd.py	Changed command registry from last-wins dict to a list-based multi-provider merge. Logic is sound but `_load_cmd_cls` is redundantly redefined on every loop iteration — minor style issue.
metaflow/plugins/aws/batch/batch_cli.py	Small addition to forward `METAFLOW_DEBUGPY_*` env vars to the remote Batch container. Change is minimal and correct.
metaflow/plugins/kubernetes/kubernetes_cli.py	Mirrors the Batch change — forwards `METAFLOW_DEBUGPY_*` env vars to the remote Kubernetes pod. Minimal and correct.

Sequence Diagram

sequenceDiagram
    participant VSCode
    participant Adapter as debugpy adapter (local :5678)
    participant Runtime as Metaflow Runtime
    participant CallbackSrv as Callback Server (ephemeral)
    participant Task as Remote Task (Batch/K8s)

    Note over Runtime: runtime_init()
    Runtime->>Adapter: debugpy.listen(:5678)
    Runtime->>CallbackSrv: _start_callback_server()
    Runtime->>VSCode: Wait for attach (if wait_for_client=True)
    VSCode->>Adapter: DAP attach

    Note over Task: task_pre_step() — listen mode
    Task->>Task: bind server socket on base_port
    Task->>CallbackSrv: TCP callback: {host, port}
    CallbackSrv->>Adapter: bridge.connect() (internal pydevd port)
    CallbackSrv->>Adapter: pydevdAuthorize handshake
    CallbackSrv->>Adapter: pydevdSystemInfo handshake
    CallbackSrv->>Task: remote.connect(task_host:base_port)
    Note over CallbackSrv: pipe threads: Adapter ↔ Task

    VSCode-->>Adapter: auto-attach new session
    Adapter-->>Task: DAP debug traffic (via bridge)

_{Last reviewed commit: 39a49d3}

greptile-apps · 2026-03-14T08:10:57Z

metaflow/plugins/debugger_step_decorator.py

+        if self._wait_for_client:
+            cli_args.env[_ENV_WAIT_FOR_CLIENT] = "1"
+
+        task = cli_args.task


Unused variable task

task = cli_args.task is assigned but never referenced again. This appears to be dead code left over from development. It should be removed.

Suggested change

task = cli_args.task

greptile-apps · 2026-03-14T08:10:57Z

metaflow/plugins/debugger_step_decorator.py

+                "body": {
+                    "python": {
+                        "version": "3.11.0",
+                        "implementation": {"name": "cpython", "version": "3.11.0"},
+                    },
+                    "platform": {"name": "linux"},
+                    "process": {
+                        "pid": _next_fake_pid(),
+                        "ppid": adapter_info["parent_pid"],
+                        "executable": "python",
+                        "bitness": 64,
+                    },
+                },
+            },
+        )


Hardcoded Python version and platform in pydevdSystemInfo

The pydevdSystemInfo response sent to the adapter is fully hardcoded with "version": "3.11.0", "platform": {"name": "linux"}, and "bitness": 64. This will be incorrect for:

Remote containers running a different Python version (3.9, 3.10, 3.12, etc.)

Non-Linux containers (e.g., Windows-based images)

32-bit environments

The adapter uses this information to populate the IDE's debug session metadata. While the bridge approach may still forward packets correctly, VSCode/debugpy could behave unexpectedly or display misleading process information. These values should be dynamically populated from the actual remote task environment (e.g., sent back as part of the callback JSON payload alongside host and port).

greptile-apps · 2026-03-14T08:10:59Z

metaflow/plugins/debugger_step_decorator.py

+        base_port = int(os.environ[_ENV_BASE_PORT])
+
+        import debugpy._vendored.force_pydevd  # noqa: F401
+        import pydevd
+
+        # Pre-create the listening socket before sending the callback to
+        # eliminate the race where the runtime tries to connect before
+        # pydevd.settrace() has bound the port.
+        server_sock = _create_listen_socket("", base_port)


Port collision for concurrent remote tasks on the same node

In remote mode, every task binds to the same base_port (default 5678). If multiple tasks from a parallel step (e.g., @parallel or a foreach) are scheduled on the same worker node, every task after the first will fail at _create_listen_socket("", base_port) with an "Address already in use" error.

Even if Kubernetes/Batch typically schedules one task per pod, the single fixed port also means a second --with debugger run that reuses a node could collide with a lingering debugpy listener from the previous run.

Consider allocating a per-task port, for example by using port 0 (letting the OS pick an ephemeral port) and then reporting the actual bound port in the callback payload — the rest of the machinery already supports this since task_port is read from the callback JSON.

greptile-apps · 2026-03-14T08:11:00Z

metaflow/plugins/debugger_step_decorator.py

+    try:
+        # Read endpoint JSON from the task.
+        data = b""
+        while True:
+            chunk = conn.recv(4096)
+            if not chunk:
+                break
+            data += chunk
+        conn.close()


No read timeout on callback connection

The conn socket accepted from _accept_loop has no receive timeout. If a remote task connects to the callback server but stalls before finishing the JSON payload (e.g., due to a network glitch), _handle_callback will block indefinitely on conn.recv(4096). Since this runs in a daemon thread the main process won't hang, but it will silently consume a thread slot and delay bridge setup.

Consider adding a conn.settimeout(30) before the read loop, similar to the 30-second timeout applied to the outgoing callback socket in _task_listen.

greptile-apps · 2026-03-14T08:11:01Z

metaflow/extension_support/cmd.py

+        def _load_cmd_cls(class_path, name):
+            path, cls_name = class_path.rsplit(".", 1)
+            try:
+                cmd_module = importlib.import_module(path)
+            except ImportError:
+                raise ValueError("Cannot locate command '%s' at '%s'" % (name, path))
+            cls = getattr(cmd_module, cls_name, None)
+            if cls is None:
+                raise ValueError(
+                    "Cannot locate '%s' class for command at '%s'" % (cls_name, path)
+                )
+            all_cmds = list(cls.commands)
+            if len(all_cmds) > 1:
+                raise ValueError(
+                    "%s defines more than one command -- use a group" % path
+                )
+            if all_cmds[0] != name:
+                raise ValueError(
+                    "%s: expected name to be '%s' but got '%s' instead"
+                    % (path, name, all_cmds[0])
+                )
+            return cls


Inner function defined inside loop on every iteration

_load_cmd_cls is defined anew on every iteration of for name in set_of_commands:. While this doesn't cause a bug here (it doesn't close over the loop variable), it unnecessarily creates a new function object each time and makes the structure harder to follow.

Consider hoisting _load_cmd_cls to module level or at least to the top of resolve_cmds() so it's defined once.

romain-intel requested a review from npow March 14, 2026 08:08

github-advanced-security bot found potential problems Mar 14, 2026

View reviewed changes

greptile-apps bot reviewed Mar 14, 2026

View reviewed changes

@@ -280,6 +280,13 @@
             def _create_listen_socket(host, port):
                 """Create a TCP server socket bound to *host*:*port*."""
+                # Avoid binding to all interfaces (e.g. "0.0.0.0" or an empty string),
+                # which would expose the debug listener on every network interface.
+                if host in (None, "", "0.0.0.0", "::"):
+                    raise MetaflowException(
+                        "Refusing to bind debug listener to all interfaces; "
+                        "please provide a specific interface address (e.g. 127.0.0.1)."
+                    )
                 sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                 sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
                 sock.bind((host, port))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for remote debugging with VSCode#3033

Initial support for remote debugging with VSCode#3033
romain-intel wants to merge 1 commit intomasterfrom
romain/worktree/remote_debug

romain-intel commented Mar 14, 2026

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

greptile-apps bot commented Mar 14, 2026

Uh oh!

greptile-apps bot Mar 14, 2026

Uh oh!

greptile-apps bot Mar 14, 2026

Uh oh!

greptile-apps bot Mar 14, 2026

Uh oh!

greptile-apps bot Mar 14, 2026

Uh oh!

greptile-apps bot Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

@@ -258,7 +258,7 @@
                 """
                 server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
                 server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
-                server.bind(("0.0.0.0", 0))
+                server.bind(("127.0.0.1", 0))
                 server.listen(16)
                 _, port = server.getsockname()

Conversation

romain-intel commented Mar 14, 2026

PR Type

Summary

Uh oh!

Check warning

Copilot Autofix

Check warning

Copilot Autofix

greptile-apps bot commented Mar 14, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant