-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Goal
I'd like to bring in new index types to anndata to support a few use-cases i.e.,
adata = AnnData(obs=pd.DataFrame(index=some_new_index...)...)
adata[some_subset_of_the_index] # workswithout the index being converted to strings as is usually done (and then allowing subsequent operations smoothly)
Use cases
- Geometry-based indexes: Points or shapes that represent "segmentations" or "observations" in an images can be used as indices. This is similar to https://geopandas.org/en/stable/ but instead of designating a column as the "geometry" and then doing operations on that, we'd make it a first-class citizen. So we need to investigate why Geopandas make this decision not to use indexes (i.e., is it because "An Index instance can only contain hashable object"? are geoarrow objects not hashable? shapely?).
- "Anonymous" indexes: A cartesian product of coordinates can be used for pixel-based annotation for proteomics - see https://github.com/complextissue/spatiomic. Or https://icb-pandas-uuid.readthedocs-hosted.com/en/latest/ definitely can be used for saving space on string indexes that lack semantics!
- Multi-level annotations: See Mapping between features for MS-proteomics support mudata#111 for the proteomics use-case, so maybe this is a
pandas.MultiIndexand maybe not, not clear.
Getting started
I've started a branch that should allow declaring a new AnnData object with these indexes and basic operations ideally: https://github.com/scverse/anndata/tree/ig/custom_index_objects
We'll need to work out the specifics of constructing a pandas.Index object in each of the above cases.
- Geometry-based indexes: We should probably start with
GeometryArraywhich is private inGeopandasand restrict ourselves only to arrow inputs (or maybe shapely, but just start with one of the two although I don't really get why you would use shapely ATM as the "backing format" for the array) - "Anonymous" indexes: This one is probably simple and can be done via a
pandas.Indexwrapping a customExtensionArraywhose__getitem__just generatesX-Ycoordinates in row-major order based on the initializing shape i.e.,XYArray(shape=[5, 5])[11] = (2, 1)if I got that right - Multi-level annotations: I think starting with a
MutliIndexand then seeing if it meets the needs makes sense before moving to something more complicated
Then once we are confident about the pandas.Index object, we can try putting it inside a AnnData object
Limitations
See pandas-dev/pandas#64889 for potential hiccups with unhashable arrow objects
Furthermore, this says nothing of serializability - none of these would be writable to disk, and will all need custom I/o handling. Luckily I don't think that's needed to create immediately useful things
- Geometry-based indexes: These are written with parquet anyway in the spatialdata format, so just need to be read in to the
AnnDataobject. We can then prevent writing on theAnnDataside via a new setting or something like feat:AnnData.can_writebased onAnnData.reduce+ refactoredAnnData.__sizeof__anndata#2372 - "Anonymous" indexes:These are totally anonymous anyway and you probably wouldn't want them serialized anyway
- Multi-level annotations: I am not clear how these are constructed ATM - from databases? So do they need to be serializable or just reconstructable from some simple metadata?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status