[For discussion only]tolerate empty __init__.py in namespace package discovery#3007
[For discussion only]tolerate empty __init__.py in namespace package discovery#3007liuyinglao wants to merge 1 commit intomasterfrom
Conversation
Build systems like Bazel's rules_python auto-generate empty (0-byte)
__init__.py files for every directory, including those that are meant
to be implicit namespace packages. This causes metaflow's extension
discovery to reject the metaflow_extensions namespace package with:
RuntimeError: Package '...' providing 'metaflow_extensions' is not
an implicit namespace package as required
An empty __init__.py is functionally equivalent to no __init__.py for
namespace package purposes, so we can safely skip it.
Two changes:
1. At the metaflow_extensions/ top level: skip empty __init__.py
instead of raising RuntimeError, since the file carries no package
initialization code.
2. At extension point directories (e.g. alias/, config/): do not treat
an empty __init__.py as a configuration file. Only __init__.py
files with actual content should be considered config modules.
Fixes compatibility with Bazel's rules_python which sets
enable_implicit_namespace_pkgs=True but still generates empty
__init__.py files in extracted pip wheels.
Made-with: Cursor
Greptile SummaryThis PR fixes Metaflow's extension discovery to tolerate empty (0-byte) Key changes:
Issues found:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[process_file called] --> B{parts is EXT_PKG\nand len greater than 1?}
B -- No --> Z[Ignore file]
B -- Yes --> C{Top-level file?\nlen == 2 and parts-1 == init}
C -- Yes --> D[Build init_path via\nos.path.join root_dir + file]
D --> E{isfile and size > 0?}
E -- Yes, non-empty --> F[raise RuntimeError\nnot a NS package]
E -- No, empty or bad path --> G[return early]
C -- No --> H{Meta regexp match?}
H -- Yes --> I[Record meta_module]
H -- No --> J[Add to state files]
J --> K{Extension point depth?\nconfig file check}
K -- EXT_CONFIG_REGEXP --> L[Set config_module]
K -- init.py + isfile + size > 0 --> L
K -- init.py + isfile returns False\nbad path for pth or PYTHONPATH --> M[config_module stays None]
L --> N[Register extension package]
M --> O[RuntimeError later:\nno config file defined]
Last reviewed commit: c8db26e |
| init_path = os.path.join(root_dir, file) | ||
| if os.path.isfile(init_path) and os.path.getsize(init_path) > 0: |
There was a problem hiding this comment.
Incorrect path construction for non-distribution (PYTHONPATH / .pth) packages
os.path.join(root_dir, file) produces the wrong absolute path when process_file is called from the additional-dirs walk (call site 2, line 756) or the PYTHONPATH walk (call site 3, line 871).
In those two call sites, root_dir already ends with EXT_PKG (e.g. /path/to/metaflow_extensions), while file starts with EXT_PKG (e.g. metaflow_extensions/__init__.py). The join therefore becomes /path/to/metaflow_extensions/metaflow_extensions/__init__.py, a path that does not exist. os.path.isfile(init_path) returns False, the RuntimeError is never raised, and the function silently returns — even when the __init__.py is non-empty. This is a regression: the original code would have raised the error unconditionally.
The rest of the function consistently uses os.path.join(root_dir, *parts[1:]) (stripping the leading metaflow_extensions component) when it needs a real on-disk path (full_path_files, line 571). The same pattern is correct here:
init_path = os.path.join(root_dir, *parts[1:])With root_dir = "/path/to/metaflow_extensions" and parts = ["metaflow_extensions", "__init__.py"]:
os.path.join(root_dir, *parts[1:])→/path/to/metaflow_extensions/__init__.py✓os.path.join(root_dir, file)→/path/to/metaflow_extensions/metaflow_extensions/__init__.py✗
Note that for the distribution call site (line 723), root_dir is the parent of metaflow_extensions, so os.path.join(root_dir, *parts[1:]) would give /dist_root/__init__.py which is also wrong for that call site. A safe approach is to probe both candidates and pick whichever exists:
init_path_v1 = os.path.join(root_dir, file) # correct for dist
init_path_v2 = os.path.join(root_dir, *parts[1:]) # correct for addl_dirs / PYTHONPATH
init_path = init_path_v1 if os.path.isfile(init_path_v1) else init_path_v2or alternatively, accept the absolute path as a parameter from the call sites that already walk the filesystem.
| and os.path.isfile(os.path.join(root_dir, *parts)) | ||
| and os.path.getsize(os.path.join(root_dir, *parts)) > 0 |
There was a problem hiding this comment.
Same incorrect path for non-distribution packages in the config-module check
os.path.join(root_dir, *parts) has exactly the same double-EXT_PKG issue as the change above. When root_dir ends with metaflow_extensions, the join produces /path/to/metaflow_extensions/metaflow_extensions/X/alias/__init__.py instead of /path/to/metaflow_extensions/X/alias/__init__.py.
os.path.isfile(...) returns False, so a non-empty extension-point __init__.py in a PYTHONPATH/pth-installed package is no longer recognized as a configuration module. The package then hits the "does not define a configuration file" RuntimeError at line 933 — a breaking regression for any non-distribution consumer that uses __init__.py as the extension-point init file.
The correct path (consistent with line 571's full_path_files) is:
os.path.join(root_dir, *parts[1:])Because parts[0] is always EXT_PKG, stripping it yields the path relative to the EXT_PKG root, regardless of how root_dir was constructed.
| and os.path.isfile(os.path.join(root_dir, *parts)) | ||
| and os.path.getsize(os.path.join(root_dir, *parts)) > 0 | ||
| ) |
There was a problem hiding this comment.
Duplicate path computation
os.path.join(root_dir, *parts) is evaluated twice — once for isfile and again for getsize. Consider computing it once:
| and os.path.isfile(os.path.join(root_dir, *parts)) | |
| and os.path.getsize(os.path.join(root_dir, *parts)) > 0 | |
| ) | |
| or ( | |
| parts[-1] == "__init__.py" | |
| and os.path.getsize( | |
| p := os.path.join(root_dir, *parts) | |
| ) > 0 | |
| if os.path.isfile( | |
| p := os.path.join(root_dir, *parts) | |
| ) | |
| else False | |
| ) |
Or more readably, extract a local variable before the outer if:
_init_abs = os.path.join(root_dir, *parts[1:])
if len(parts) == len(ext_list) + 3 and (
EXT_CONFIG_REGEXP.match(parts[-1]) is not None
or (
parts[-1] == "__init__.py"
and os.path.isfile(_init_abs)
and os.path.getsize(_init_abs) > 0
)
):
Build systems like Bazel's rules_python auto-generate empty (0-byte) init.py files for every directory, including those that are meant to be implicit namespace packages. This causes metaflow's extension discovery to reject the metaflow_extensions namespace package with:
An empty init.py is functionally equivalent to no init.py for namespace package purposes, so we can safely skip it.
Two changes:
At the metaflow_extensions/ top level: skip empty init.py instead of raising RuntimeError, since the file carries no package initialization code.
At extension point directories (e.g. alias/, config/): do not treat an empty init.py as a configuration file. Only init.py files with actual content should be considered config modules.
Fixes compatibility with Bazel's rules_python which sets enable_implicit_namespace_pkgs=True but still generates empty init.py files in extracted pip wheels.
Made-with: Cursor
Here's the filled-in PR template:
PR Type
Summary
Tolerate empty (0-byte)
__init__.pyfiles inmetaflow_extensionsnamespace package discovery. Build systems like Bazel'srules_pythonauto-generate these for every directory, breaking metaflow's extension loading withRuntimeError: Package '...' providing 'metaflow_extensions' is not an implicit namespace package as required.Issue
No existing issue. This affects any metaflow user running under Bazel with
rules_pythonpip integration.Fixes compatibility with Bazel's
rules_python(tested with v1.8.4) which generates empty__init__.pyfiles in extracted pip wheels even whenenable_implicit_namespace_pkgs=True.Reproduction
Runtime: local (Bazel
rules_python1.8.4, Python 3.10)Commands to run:
Where evidence shows up: parent console
Before (error / log snippet)
Additionally, even after patching only the top-level check, a second error surfaces at extension point directories (e.g.
alias/,config/) where empty__init__.pyis incorrectly treated as a configuration file:After (evidence that fix works)
No workaround code needed in the flow file -- plain
from metaflow import ...works.Root Cause
Bazel's
rules_pythonextracts each pip wheel into its own isolatedsite-packagesdirectory. For namespace packages likemetaflow_extensions, it generates empty (0-byte)__init__.pyfiles even whenenable_implicit_namespace_pkgs=Trueis set (this is a knownrules_pythonissue).In
_get_extension_packages()→process_file():Line 534: Any
metaflow_extensions/__init__.pytriggers an unconditionalRuntimeError, even if the file is empty and carries no package initialization code.Line 597-599:
__init__.pyin extension point directories (e.g.alias/) is unconditionally treated as a configuration file. When Bazel putsnflx-metaflowandnflx-fastdatain separatesys.pathentries, each has its ownalias/__init__.py(auto-generated, empty). This causes a "more than one configuration file" conflict since the empty__init__.pyis counted alongside the realmfextinit_*.pyconfig.In a standard pip install, both packages merge into a single
site-packages/metaflow_extensions/directory with no__init__.pyat the namespace levels, so neither issue surfaces.Why This Fix Is Correct
An empty
__init__.pyis functionally equivalent to no__init__.pyfor namespace package purposes -- it contains no initialization code and Python treats the directory identically. The fix preserves the existing invariant (reject non-empty__init__.pythat would makemetaflow_extensionsa regular package) while tolerating the harmless empty files that build tools generate.The fix is minimal: two
os.path.getsize() == 0checks, no behavioral change for any non-empty__init__.py.Failure Modes Considered
Non-empty
__init__.pystill rejected: If someone intentionally creates ametaflow_extensions/__init__.pywith content (which would break namespace package merging), theRuntimeErrorstill fires becausegetsize() > 0.Missing file on disk: If
__init__.pyappears in a directory listing but doesn't exist on disk (e.g. broken symlink),os.path.isfile()returnsFalseand we skip it, which is the safe default -- a missing file can't break namespace package semantics.Backward compatibility with pip installs: In a standard pip/uv install,
metaflow_extensions/__init__.pydoesn't exist at all, so the new code path is never reached. Behavior is identical to before.Extension config file detection: Empty
__init__.pyin extension point dirs (e.g.alias/) is no longer treated as a config module. Only__init__.pywith actual content and files matchingEXT_CONFIG_REGEXPare considered. This matches the intended behavior since empty__init__.pywas never meant to be a config file.Tests
apply the fix locally and it works
Non-Goals
AI Tool Usage