Skip to content

Initial support for remote debugging with VSCode#3033

Open
romain-intel wants to merge 1 commit intomasterfrom
romain/worktree/remote_debug
Open

Initial support for remote debugging with VSCode#3033
romain-intel wants to merge 1 commit intomasterfrom
romain/worktree/remote_debug

Conversation

@romain-intel
Copy link
Contributor

This allows you to remote debug your Metaflow flow from the comfort of your IDE. It will behave exactly as if it was debugging something locally and all launched tasks will appear as subprocesses in the VSCode debugging call stack

To try it out:

  • write a simple flow (or anything really)
  • In the directory with your flow, run `metaflow debug vscode install-config --remote-root
  • Set a few breakpoints in VSCode in your file
  • Launch your flow using `python ./myflow.py run --with debugger
  • The flow will launch and will then stop for you to attach the debugger. The install-config step will have created a Metaflow: Attach debugging configuration. Launch that
  • Enjoy!

This needs more testing and making sure a mix of local and remote nodes also work (unclear yet) but it's a start and shows that it is possible.

PR Type

  • Bug fix
  • [ X] New feature
  • Core Runtime change (higher bar -- see CONTRIBUTING.md)
  • Docs / tooling
  • Refactoring

Summary

Remote debugging capabilities

…at uses debugpy)

This allows you to remote debug your Metaflow flow from the comfort of your
IDE. It will behave exactly as if it was debugging something locally and all
launched tasks will appear as subprocesses in the VSCode debugging call stack

To try it out:
  - write a simple flow (or anything really)
  - In the directory with your flow, run `metaflow debug vscode install-config --remote-root <root>
  - Set a few breakpoints in VSCode in your file
  - Launch your flow using `python ./myflow.py run --with debugger
  - The flow will launch and will then stop for you to attach the debugger. The install-config step
    will have created a `Metaflow: Attach` debugging configuration. Launch that
  - Enjoy!

This needs more testing and making sure a mix of local and remote nodes also work (unclear yet)
but it's a start and shows that it is possible.
@romain-intel romain-intel requested a review from npow March 14, 2026 08:08
"""
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(("0.0.0.0", 0))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'0.0.0.0' binds a socket to all interfaces.

Copilot Autofix

AI 2 days ago

In general, to fix this type of issue you avoid binding listening sockets to 0.0.0.0 unless you explicitly require remote access from arbitrary hosts, and instead bind to a specific interface (most commonly 127.0.0.1 for local-only access). If multiple interfaces must be supported, you create one socket per interface instead of a single socket bound to all.

For this specific function _start_callback_server in metaflow/plugins/debugger_step_decorator.py, the best minimal change is to bind the callback server to the loopback interface. This keeps callbacks available to local tools (e.g., VSCode/debugpy) while preventing exposure on external interfaces. Concretely, change line 261 from server.bind(("0.0.0.0", 0)) to server.bind(("127.0.0.1", 0)). No new imports or additional methods are required, and the rest of the logic (accept loop, threading, return of dynamically assigned port) remains unchanged.

Suggested changeset 1
metaflow/plugins/debugger_step_decorator.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/metaflow/plugins/debugger_step_decorator.py b/metaflow/plugins/debugger_step_decorator.py
--- a/metaflow/plugins/debugger_step_decorator.py
+++ b/metaflow/plugins/debugger_step_decorator.py
@@ -258,7 +258,7 @@
     """
     server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
     server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
-    server.bind(("0.0.0.0", 0))
+    server.bind(("127.0.0.1", 0))
     server.listen(16)
     _, port = server.getsockname()
 
EOF
@@ -258,7 +258,7 @@
"""
server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
server.bind(("0.0.0.0", 0))
server.bind(("127.0.0.1", 0))
server.listen(16)
_, port = server.getsockname()

Copilot is powered by AI and may make mistakes. Always verify output.
"""Create a TCP server socket bound to *host*:*port*."""
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind((host, port))

Check warning

Code scanning / CodeQL

Binding a socket to all network interfaces Medium

'' binds a socket to all interfaces.

Copilot Autofix

AI 2 days ago

In general, to fix this issue you either (1) restrict the binding address to a safe, explicit interface (like 127.0.0.1 or a configured host), or (2) enforce that callers cannot pass values that mean “all interfaces” ("0.0.0.0", "", etc.). Since this helper is meant to bind to an arbitrary host provided by callers, the best fix without changing external behavior is to validate host and raise a clear error if an unsafe “all interfaces” value is supplied. Callers that already pass a proper interface will be unaffected; callers that were implicitly relying on “all interfaces” will now fail fast instead of silently exposing the service.

Concretely, in metaflow/plugins/debugger_step_decorator.py around line 281, update _create_listen_socket so that it checks host against disallowed values before calling bind. For example, treat None, "", "0.0.0.0", and "::" as invalid and raise a MetaflowException (already imported at the top of the file) with a descriptive message. Then keep the rest of the socket creation logic unchanged. No new imports are required, and existing behavior for valid host values remains the same.

Suggested changeset 1
metaflow/plugins/debugger_step_decorator.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/metaflow/plugins/debugger_step_decorator.py b/metaflow/plugins/debugger_step_decorator.py
--- a/metaflow/plugins/debugger_step_decorator.py
+++ b/metaflow/plugins/debugger_step_decorator.py
@@ -280,6 +280,13 @@
 
 def _create_listen_socket(host, port):
     """Create a TCP server socket bound to *host*:*port*."""
+    # Avoid binding to all interfaces (e.g. "0.0.0.0" or an empty string),
+    # which would expose the debug listener on every network interface.
+    if host in (None, "", "0.0.0.0", "::"):
+        raise MetaflowException(
+            "Refusing to bind debug listener to all interfaces; "
+            "please provide a specific interface address (e.g. 127.0.0.1)."
+        )
     sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
     sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
     sock.bind((host, port))
EOF
@@ -280,6 +280,13 @@

def _create_listen_socket(host, port):
"""Create a TCP server socket bound to *host*:*port*."""
# Avoid binding to all interfaces (e.g. "0.0.0.0" or an empty string),
# which would expose the debug listener on every network interface.
if host in (None, "", "0.0.0.0", "::"):
raise MetaflowException(
"Refusing to bind debug listener to all interfaces; "
"please provide a specific interface address (e.g. 127.0.0.1)."
)
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sock.bind((host, port))
Copilot is powered by AI and may make mistakes. Always verify output.
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 14, 2026

Greptile Summary

This PR introduces remote debugging support for Metaflow flows via VSCode and debugpy. It adds a @debugger step decorator, a new metaflow debug vscode install-config CLI command for generating .vscode/launch.json, and plumbing in the Batch/Kubernetes CLI layers to forward METAFLOW_DEBUGPY_* environment variables to remote containers. The extension command registry is also refactored to support multiple providers for the same command name via subcommand merging.

Key changes:

  • metaflow/plugins/debugger_step_decorator.py — Core implementation: local mode uses debugpy.connect() back to the adapter; remote mode starts a raw pydevd listener and implements a custom TCP bridge + DAP handshake to make the remote task appear as a child process to VSCode.
  • metaflow/cmd/debug_cli.py — New CLI to write/merge a Metaflow: Attach configuration into .vscode/launch.json.
  • metaflow/extension_support/cmd.py — Command registry refactored from last-wins dict to a merge-capable list, allowing extensions to contribute subcommands to the same top-level group.
  • metaflow/plugins/aws/batch/batch_cli.py / kubernetes_cli.py — Forward debugger env vars to remote containers.

Notable issues found:

  • The pydevdSystemInfo DAP handshake response in _handle_callback uses hardcoded values ("version": "3.11.0", "platform": {"name": "linux"}, "bitness": 64) regardless of the actual remote container's Python version and OS.
  • Multiple concurrent remote tasks that land on the same worker node will all try to bind the same base_port, causing all but the first to fail with "Address already in use".
  • task = cli_args.task in runtime_step_cli is dead code.
  • No read timeout is set on incoming callback connections in _handle_callback, which can cause thread stalls on slow/partial senders.

Confidence Score: 2/5

  • Not yet safe to merge — two logic issues in the core decorator need to be resolved before this is reliable in production.
  • The ancillary files (CLI, registry, Batch/K8s forwarding) are clean and well-implemented. However, the core debugger_step_decorator.py has two concrete logic bugs: hardcoded Python/platform metadata in the DAP handshake will mislead the debugger for non-3.11/non-Linux containers, and the fixed base_port in remote mode will cause silent failures whenever two tasks share a node. The PR description also acknowledges it needs more testing ("needs more testing and making sure a mix of local and remote nodes also work"), which warrants caution before merging.
  • Pay close attention to metaflow/plugins/debugger_step_decorator.py — specifically _handle_callback (hardcoded system info) and _task_listen (port collision).

Important Files Changed

Filename Overview
metaflow/plugins/debugger_step_decorator.py New file implementing the core debugpy integration. Has two significant logic issues: hardcoded Python version/platform in the pydevdSystemInfo DAP handshake response, and a port collision risk when multiple remote tasks run on the same node. Also contains an unused task variable and a missing read timeout on callback connections.
metaflow/cmd/debug_cli.py New CLI file for metaflow debug vscode install-config. Correctly handles creating/merging/overwriting .vscode/launch.json. Logic is clean and well-structured.
metaflow/extension_support/cmd.py Changed command registry from last-wins dict to a list-based multi-provider merge. Logic is sound but _load_cmd_cls is redundantly redefined on every loop iteration — minor style issue.
metaflow/plugins/aws/batch/batch_cli.py Small addition to forward METAFLOW_DEBUGPY_* env vars to the remote Batch container. Change is minimal and correct.
metaflow/plugins/kubernetes/kubernetes_cli.py Mirrors the Batch change — forwards METAFLOW_DEBUGPY_* env vars to the remote Kubernetes pod. Minimal and correct.

Sequence Diagram

sequenceDiagram
    participant VSCode
    participant Adapter as debugpy adapter (local :5678)
    participant Runtime as Metaflow Runtime
    participant CallbackSrv as Callback Server (ephemeral)
    participant Task as Remote Task (Batch/K8s)

    Note over Runtime: runtime_init()
    Runtime->>Adapter: debugpy.listen(:5678)
    Runtime->>CallbackSrv: _start_callback_server()
    Runtime->>VSCode: Wait for attach (if wait_for_client=True)
    VSCode->>Adapter: DAP attach

    Note over Task: task_pre_step() — listen mode
    Task->>Task: bind server socket on base_port
    Task->>CallbackSrv: TCP callback: {host, port}
    CallbackSrv->>Adapter: bridge.connect() (internal pydevd port)
    CallbackSrv->>Adapter: pydevdAuthorize handshake
    CallbackSrv->>Adapter: pydevdSystemInfo handshake
    CallbackSrv->>Task: remote.connect(task_host:base_port)
    Note over CallbackSrv: pipe threads: Adapter ↔ Task

    VSCode-->>Adapter: auto-attach new session
    Adapter-->>Task: DAP debug traffic (via bridge)
Loading

Last reviewed commit: 39a49d3

if self._wait_for_client:
cli_args.env[_ENV_WAIT_FOR_CLIENT] = "1"

task = cli_args.task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable task

task = cli_args.task is assigned but never referenced again. This appears to be dead code left over from development. It should be removed.

Suggested change
task = cli_args.task

Comment on lines +228 to +242
"body": {
"python": {
"version": "3.11.0",
"implementation": {"name": "cpython", "version": "3.11.0"},
},
"platform": {"name": "linux"},
"process": {
"pid": _next_fake_pid(),
"ppid": adapter_info["parent_pid"],
"executable": "python",
"bitness": 64,
},
},
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded Python version and platform in pydevdSystemInfo

The pydevdSystemInfo response sent to the adapter is fully hardcoded with "version": "3.11.0", "platform": {"name": "linux"}, and "bitness": 64. This will be incorrect for:

  • Remote containers running a different Python version (3.9, 3.10, 3.12, etc.)
  • Non-Linux containers (e.g., Windows-based images)
  • 32-bit environments

The adapter uses this information to populate the IDE's debug session metadata. While the bridge approach may still forward packets correctly, VSCode/debugpy could behave unexpectedly or display misleading process information. These values should be dynamically populated from the actual remote task environment (e.g., sent back as part of the callback JSON payload alongside host and port).

Comment on lines +532 to +540
base_port = int(os.environ[_ENV_BASE_PORT])

import debugpy._vendored.force_pydevd # noqa: F401
import pydevd

# Pre-create the listening socket before sending the callback to
# eliminate the race where the runtime tries to connect before
# pydevd.settrace() has bound the port.
server_sock = _create_listen_socket("", base_port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port collision for concurrent remote tasks on the same node

In remote mode, every task binds to the same base_port (default 5678). If multiple tasks from a parallel step (e.g., @parallel or a foreach) are scheduled on the same worker node, every task after the first will fail at _create_listen_socket("", base_port) with an "Address already in use" error.

Even if Kubernetes/Batch typically schedules one task per pod, the single fixed port also means a second --with debugger run that reuses a node could collide with a lingering debugpy listener from the previous run.

Consider allocating a per-task port, for example by using port 0 (letting the OS pick an ephemeral port) and then reporting the actual bound port in the callback payload — the rest of the machinery already supports this since task_port is read from the callback JSON.

Comment on lines +176 to +184
try:
# Read endpoint JSON from the task.
data = b""
while True:
chunk = conn.recv(4096)
if not chunk:
break
data += chunk
conn.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No read timeout on callback connection

The conn socket accepted from _accept_loop has no receive timeout. If a remote task connects to the callback server but stalls before finishing the JSON payload (e.g., due to a network glitch), _handle_callback will block indefinitely on conn.recv(4096). Since this runs in a daemon thread the main process won't hang, but it will silently consume a thread slot and delay bridge setup.

Consider adding a conn.settimeout(30) before the read loop, similar to the 30-second timeout applied to the outgoing callback socket in _task_listen.

Comment on lines +93 to +114
def _load_cmd_cls(class_path, name):
path, cls_name = class_path.rsplit(".", 1)
try:
cmd_module = importlib.import_module(path)
except ImportError:
raise ValueError("Cannot locate command '%s' at '%s'" % (name, path))
cls = getattr(cmd_module, cls_name, None)
if cls is None:
raise ValueError(
"Cannot locate '%s' class for command at '%s'" % (cls_name, path)
)
all_cmds = list(cls.commands)
if len(all_cmds) > 1:
raise ValueError(
"%s defines more than one command -- use a group" % path
)
if all_cmds[0] != name:
raise ValueError(
"%s: expected name to be '%s' but got '%s' instead"
% (path, name, all_cmds[0])
)
return cls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner function defined inside loop on every iteration

_load_cmd_cls is defined anew on every iteration of for name in set_of_commands:. While this doesn't cause a bug here (it doesn't close over the loop variable), it unnecessarily creates a new function object each time and makes the structure harder to follow.

Consider hoisting _load_cmd_cls to module level or at least to the top of resolve_cmds() so it's defined once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant