Initial support for remote debugging with VSCode#3033
Initial support for remote debugging with VSCode#3033romain-intel wants to merge 1 commit intomasterfrom
Conversation
…at uses debugpy)
This allows you to remote debug your Metaflow flow from the comfort of your
IDE. It will behave exactly as if it was debugging something locally and all
launched tasks will appear as subprocesses in the VSCode debugging call stack
To try it out:
- write a simple flow (or anything really)
- In the directory with your flow, run `metaflow debug vscode install-config --remote-root <root>
- Set a few breakpoints in VSCode in your file
- Launch your flow using `python ./myflow.py run --with debugger
- The flow will launch and will then stop for you to attach the debugger. The install-config step
will have created a `Metaflow: Attach` debugging configuration. Launch that
- Enjoy!
This needs more testing and making sure a mix of local and remote nodes also work (unclear yet)
but it's a start and shows that it is possible.
| """ | ||
| server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| server.bind(("0.0.0.0", 0)) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general, to fix this type of issue you avoid binding listening sockets to 0.0.0.0 unless you explicitly require remote access from arbitrary hosts, and instead bind to a specific interface (most commonly 127.0.0.1 for local-only access). If multiple interfaces must be supported, you create one socket per interface instead of a single socket bound to all.
For this specific function _start_callback_server in metaflow/plugins/debugger_step_decorator.py, the best minimal change is to bind the callback server to the loopback interface. This keeps callbacks available to local tools (e.g., VSCode/debugpy) while preventing exposure on external interfaces. Concretely, change line 261 from server.bind(("0.0.0.0", 0)) to server.bind(("127.0.0.1", 0)). No new imports or additional methods are required, and the rest of the logic (accept loop, threading, return of dynamically assigned port) remains unchanged.
| @@ -258,7 +258,7 @@ | ||
| """ | ||
| server = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| server.bind(("0.0.0.0", 0)) | ||
| server.bind(("127.0.0.1", 0)) | ||
| server.listen(16) | ||
| _, port = server.getsockname() | ||
|
|
| """Create a TCP server socket bound to *host*:*port*.""" | ||
| sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| sock.bind((host, port)) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 2 days ago
In general, to fix this issue you either (1) restrict the binding address to a safe, explicit interface (like 127.0.0.1 or a configured host), or (2) enforce that callers cannot pass values that mean “all interfaces” ("0.0.0.0", "", etc.). Since this helper is meant to bind to an arbitrary host provided by callers, the best fix without changing external behavior is to validate host and raise a clear error if an unsafe “all interfaces” value is supplied. Callers that already pass a proper interface will be unaffected; callers that were implicitly relying on “all interfaces” will now fail fast instead of silently exposing the service.
Concretely, in metaflow/plugins/debugger_step_decorator.py around line 281, update _create_listen_socket so that it checks host against disallowed values before calling bind. For example, treat None, "", "0.0.0.0", and "::" as invalid and raise a MetaflowException (already imported at the top of the file) with a descriptive message. Then keep the rest of the socket creation logic unchanged. No new imports are required, and existing behavior for valid host values remains the same.
| @@ -280,6 +280,13 @@ | ||
|
|
||
| def _create_listen_socket(host, port): | ||
| """Create a TCP server socket bound to *host*:*port*.""" | ||
| # Avoid binding to all interfaces (e.g. "0.0.0.0" or an empty string), | ||
| # which would expose the debug listener on every network interface. | ||
| if host in (None, "", "0.0.0.0", "::"): | ||
| raise MetaflowException( | ||
| "Refusing to bind debug listener to all interfaces; " | ||
| "please provide a specific interface address (e.g. 127.0.0.1)." | ||
| ) | ||
| sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
| sock.bind((host, port)) |
Greptile SummaryThis PR introduces remote debugging support for Metaflow flows via VSCode and Key changes:
Notable issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant VSCode
participant Adapter as debugpy adapter (local :5678)
participant Runtime as Metaflow Runtime
participant CallbackSrv as Callback Server (ephemeral)
participant Task as Remote Task (Batch/K8s)
Note over Runtime: runtime_init()
Runtime->>Adapter: debugpy.listen(:5678)
Runtime->>CallbackSrv: _start_callback_server()
Runtime->>VSCode: Wait for attach (if wait_for_client=True)
VSCode->>Adapter: DAP attach
Note over Task: task_pre_step() — listen mode
Task->>Task: bind server socket on base_port
Task->>CallbackSrv: TCP callback: {host, port}
CallbackSrv->>Adapter: bridge.connect() (internal pydevd port)
CallbackSrv->>Adapter: pydevdAuthorize handshake
CallbackSrv->>Adapter: pydevdSystemInfo handshake
CallbackSrv->>Task: remote.connect(task_host:base_port)
Note over CallbackSrv: pipe threads: Adapter ↔ Task
VSCode-->>Adapter: auto-attach new session
Adapter-->>Task: DAP debug traffic (via bridge)
Last reviewed commit: 39a49d3 |
| if self._wait_for_client: | ||
| cli_args.env[_ENV_WAIT_FOR_CLIENT] = "1" | ||
|
|
||
| task = cli_args.task |
There was a problem hiding this comment.
Unused variable task
task = cli_args.task is assigned but never referenced again. This appears to be dead code left over from development. It should be removed.
| task = cli_args.task |
| "body": { | ||
| "python": { | ||
| "version": "3.11.0", | ||
| "implementation": {"name": "cpython", "version": "3.11.0"}, | ||
| }, | ||
| "platform": {"name": "linux"}, | ||
| "process": { | ||
| "pid": _next_fake_pid(), | ||
| "ppid": adapter_info["parent_pid"], | ||
| "executable": "python", | ||
| "bitness": 64, | ||
| }, | ||
| }, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Hardcoded Python version and platform in pydevdSystemInfo
The pydevdSystemInfo response sent to the adapter is fully hardcoded with "version": "3.11.0", "platform": {"name": "linux"}, and "bitness": 64. This will be incorrect for:
- Remote containers running a different Python version (3.9, 3.10, 3.12, etc.)
- Non-Linux containers (e.g., Windows-based images)
- 32-bit environments
The adapter uses this information to populate the IDE's debug session metadata. While the bridge approach may still forward packets correctly, VSCode/debugpy could behave unexpectedly or display misleading process information. These values should be dynamically populated from the actual remote task environment (e.g., sent back as part of the callback JSON payload alongside host and port).
| base_port = int(os.environ[_ENV_BASE_PORT]) | ||
|
|
||
| import debugpy._vendored.force_pydevd # noqa: F401 | ||
| import pydevd | ||
|
|
||
| # Pre-create the listening socket before sending the callback to | ||
| # eliminate the race where the runtime tries to connect before | ||
| # pydevd.settrace() has bound the port. | ||
| server_sock = _create_listen_socket("", base_port) |
There was a problem hiding this comment.
Port collision for concurrent remote tasks on the same node
In remote mode, every task binds to the same base_port (default 5678). If multiple tasks from a parallel step (e.g., @parallel or a foreach) are scheduled on the same worker node, every task after the first will fail at _create_listen_socket("", base_port) with an "Address already in use" error.
Even if Kubernetes/Batch typically schedules one task per pod, the single fixed port also means a second --with debugger run that reuses a node could collide with a lingering debugpy listener from the previous run.
Consider allocating a per-task port, for example by using port 0 (letting the OS pick an ephemeral port) and then reporting the actual bound port in the callback payload — the rest of the machinery already supports this since task_port is read from the callback JSON.
| try: | ||
| # Read endpoint JSON from the task. | ||
| data = b"" | ||
| while True: | ||
| chunk = conn.recv(4096) | ||
| if not chunk: | ||
| break | ||
| data += chunk | ||
| conn.close() |
There was a problem hiding this comment.
No read timeout on callback connection
The conn socket accepted from _accept_loop has no receive timeout. If a remote task connects to the callback server but stalls before finishing the JSON payload (e.g., due to a network glitch), _handle_callback will block indefinitely on conn.recv(4096). Since this runs in a daemon thread the main process won't hang, but it will silently consume a thread slot and delay bridge setup.
Consider adding a conn.settimeout(30) before the read loop, similar to the 30-second timeout applied to the outgoing callback socket in _task_listen.
| def _load_cmd_cls(class_path, name): | ||
| path, cls_name = class_path.rsplit(".", 1) | ||
| try: | ||
| cmd_module = importlib.import_module(path) | ||
| except ImportError: | ||
| raise ValueError("Cannot locate command '%s' at '%s'" % (name, path)) | ||
| cls = getattr(cmd_module, cls_name, None) | ||
| if cls is None: | ||
| raise ValueError( | ||
| "Cannot locate '%s' class for command at '%s'" % (cls_name, path) | ||
| ) | ||
| all_cmds = list(cls.commands) | ||
| if len(all_cmds) > 1: | ||
| raise ValueError( | ||
| "%s defines more than one command -- use a group" % path | ||
| ) | ||
| if all_cmds[0] != name: | ||
| raise ValueError( | ||
| "%s: expected name to be '%s' but got '%s' instead" | ||
| % (path, name, all_cmds[0]) | ||
| ) | ||
| return cls |
There was a problem hiding this comment.
Inner function defined inside loop on every iteration
_load_cmd_cls is defined anew on every iteration of for name in set_of_commands:. While this doesn't cause a bug here (it doesn't close over the loop variable), it unnecessarily creates a new function object each time and makes the structure harder to follow.
Consider hoisting _load_cmd_cls to module level or at least to the top of resolve_cmds() so it's defined once.
This allows you to remote debug your Metaflow flow from the comfort of your IDE. It will behave exactly as if it was debugging something locally and all launched tasks will appear as subprocesses in the VSCode debugging call stack
To try it out:
Metaflow: Attachdebugging configuration. Launch thatThis needs more testing and making sure a mix of local and remote nodes also work (unclear yet) but it's a start and shows that it is possible.
PR Type
Summary
Remote debugging capabilities