Skip to content

Investigate New Indexes for AnnData #1

@ilan-gold

Description

@ilan-gold

Goal

I'd like to bring in new index types to anndata to support a few use-cases i.e.,

adata = AnnData(obs=pd.DataFrame(index=some_new_index...)...)
adata[some_subset_of_the_index] # works

without the index being converted to strings as is usually done (and then allowing subsequent operations smoothly)

Use cases

Getting started

I've started a branch that should allow declaring a new AnnData object with these indexes and basic operations ideally: https://github.com/scverse/anndata/tree/ig/custom_index_objects

We'll need to work out the specifics of constructing a pandas.Index object in each of the above cases.

  • Geometry-based indexes: We should probably start with GeometryArray which is private in Geopandas and restrict ourselves only to arrow inputs (or maybe shapely, but just start with one of the two although I don't really get why you would use shapely ATM as the "backing format" for the array)
  • "Anonymous" indexes: This one is probably simple and can be done via a pandas.Index wrapping a custom ExtensionArray whose __getitem__ just generates X-Y coordinates in row-major order based on the initializing shape i.e., XYArray(shape=[5, 5])[11] = (2, 1) if I got that right
  • Multi-level annotations: I think starting with a MutliIndex and then seeing if it meets the needs makes sense before moving to something more complicated

Then once we are confident about the pandas.Index object, we can try putting it inside a AnnData object

Limitations

See pandas-dev/pandas#64889 for potential hiccups with unhashable arrow objects

Furthermore, this says nothing of serializability - none of these would be writable to disk, and will all need custom I/o handling. Luckily I don't think that's needed to create immediately useful things

  • Geometry-based indexes: These are written with parquet anyway in the spatialdata format, so just need to be read in to the AnnData object. We can then prevent writing on the AnnData side via a new setting or something like feat: AnnData.can_write based on AnnData.reduce + refactored AnnData.__sizeof__ anndata#2372
  • "Anonymous" indexes:These are totally anonymous anyway and you probably wouldn't want them serialized anyway
  • Multi-level annotations: I am not clear how these are constructed ATM - from databases? So do they need to be serializable or just reconstructable from some simple metadata?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    🫵 Not started

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions