Skip to content

Write out zarr with combined variables#2

Merged
abarciauskas-bgse merged 13 commits intodevelopmentseed:mainfrom
EarthStackLLC:combine_variables
Dec 22, 2025
Merged

Write out zarr with combined variables#2
abarciauskas-bgse merged 13 commits intodevelopmentseed:mainfrom
EarthStackLLC:combine_variables

Conversation

@jbusecke
Copy link

This PR is built on top of #1 and aims to produce a 'collapsed' group structure by combining all variables belonging to any combination of model/experiment into a single group.

Modifications

  • I am now parallelizing with lithops over the batches, instead of within the batch. This is needed for the combine variables but might give a speedup even to Virtualize with lithops #1 because there are 400+batches with about 15-30 members in each batch.
  • Temporary for testing: I am writing to a local filesystem store and using a local lithops runner config, until I can confirm which google project to use.

Results

I am currently running the code in a notebook like this:

from virtualize_with_lithops_combine_variables import  process_all_files
result = process_all_files()
  • I was able to run through all files in less than 2 hours. Note that my 1G internet connection was mostly saturated with 10 lithops workers, so I expect this to scale better on Google Functions!
  • I can open the store with a vanilla xr.open_datatree() call (which builds all the indices) in ~35s, which is a significant improvement over the more complex group structure
# test reading 
from virtualize_with_lithops_combine_variables import  open_or_create_repo
import xarray as xr
repo_read = open_or_create_repo()
session = repo_read.readonly_session('main')
dt = xr.open_datatree(session.store, engine='zarr')
  • The tree has 378 leaves currently (out of 440 expected).

remaining errors
There are errors mostly during the virtualization that I have started to fix, but also wanted to get some feedback on (@espg, @abarciauskas-bgse @maxrjones @sharkinsspatial). I believe most of these will carry over to #1 too!

  • Fixed: I found one model ("SICOPOLIS1") that used NetCDF3 files, and was able to fix this with a hardcoded parser selection. We could make this more sophisticated but I think this works for now.

I can group the remaining errors like this:

# hacky error parsing
# sort results
fails_by_error = {}
for r in result['failed']:
    if not r['success']:
        err = r['error']
        if err in fails_by_error.keys():
            fails_by_error[err].append(r)
        else:
            fails_by_error[err] = [r]
        
id_by_error = {}
for err, results in fails_by_error.items():
    def get_id_from_batch(result):
        b = result['batch']
        return f"{b['institution_id']}_{b['source_id']}/{b['experiment_id']}"
    id_by_error[err] = [get_id_from_batch(r) for r in results]
id_by_error
Details
{"Virtualizing failed with cannot align objects with join='override' with matching indexes along dimension 'time' that don't have the same size": ['CPOM_BISICLES/expT71_08',
  'CPOM_BISICLES/expT73_08',
  'LSCE_GRISLI2/ctrl_proj_std',
  'LSCE_GRISLI2/exp05',
  'LSCE_GRISLI2/exp06',
  'LSCE_GRISLI2/exp07',
  'LSCE_GRISLI2/exp08',
  'LSCE_GRISLI2/exp09',
  'LSCE_GRISLI2/exp10',
  'LSCE_GRISLI2/exp12',
  'LSCE_GRISLI2/exp13',
  'LSCE_GRISLI2/expA5',
  'LSCE_GRISLI2/expA6',
  'LSCE_GRISLI2/expA7',
  'LSCE_GRISLI2/expA8',
  'LSCE_GRISLI2/expB10',
  'LSCE_GRISLI2/expB6',
  'LSCE_GRISLI2/expB7',
  'LSCE_GRISLI2/expB8',
  'LSCE_GRISLI2/expB9',
  'LSCE_GRISLI2/expC1',
  'LSCE_GRISLI2/expC10',
  'LSCE_GRISLI2/expC12',
  'LSCE_GRISLI2/expC3',
  'LSCE_GRISLI2/expC4',
  'LSCE_GRISLI2/expC6',
  'LSCE_GRISLI2/expC7',
  'LSCE_GRISLI2/expC9',
  'LSCE_GRISLI2/expD1',
  'LSCE_GRISLI2/expD10',
  'LSCE_GRISLI2/expD11',
  'LSCE_GRISLI2/expD12',
  'LSCE_GRISLI2/expD13',
  'LSCE_GRISLI2/expD3',
  'LSCE_GRISLI2/expD51',
  'LSCE_GRISLI2/expD52',
  'LSCE_GRISLI2/expD53',
  'LSCE_GRISLI2/expD54',
  'LSCE_GRISLI2/expD55',
  'LSCE_GRISLI2/expD56',
  'LSCE_GRISLI2/expD57',
  'LSCE_GRISLI2/expD58',
  'LSCE_GRISLI2/expD6',
  'LSCE_GRISLI2/expD7',
  'LSCE_GRISLI2/expD8',
  'LSCE_GRISLI2/expD9',
  'LSCE_GRISLI2/expE10',
  'LSCE_GRISLI2/expE6',
  'LSCE_GRISLI2/expE7',
  'LSCE_GRISLI2/expE8',
  'LSCE_GRISLI2/expE9'],
 'Virtualizing failed with Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD14%2Fzvelsurf%5FAIS%5FLSCE%5FGRISLI2%5FexpD14%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD14%2Fzvelsurf%5FAIS%5FLSCE%5FGRISLI2%5FexpD14.nc) in 30.109337042s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD14'],
 'Virtualizing failed with Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD15%2Fzvelsurf%5FAIS%5FLSCE%5FGRISLI2%5FexpD15%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD15%2Fzvelsurf%5FAIS%5FLSCE%5FGRISLI2%5FexpD15.nc) in 30.108617459s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD15'],
 'Virtualizing failed with Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD16%2Fzvelsurf%5FAIS%5FLSCE%5FGRISLI2%5FexpD16%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD16%2Fzvelsurf%5FAIS%5FLSCE%5FGRISLI2%5FexpD16.nc) in 30.109858917s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD16'],
 'Virtualizing failed with Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD17%2Fzvelbase%5FAIS%5FLSCE%5FGRISLI2%5FexpD17%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD17%2Fzvelbase%5FAIS%5FLSCE%5FGRISLI2%5FexpD17.nc) in 30.108704s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD17'],
 'Virtualizing failed with Generic GCS error: Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD18%2Fzvelbase%5FAIS%5FLSCE%5FGRISLI2%5FexpD18%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD18%2Fzvelbase%5FAIS%5FLSCE%5FGRISLI2%5FexpD18.nc) in 30.108935709s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD18'],
 'Virtualizing failed with Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD2%2Fzvelbase%5FAIS%5FLSCE%5FGRISLI2%5FexpD2%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD2%2Fzvelbase%5FAIS%5FLSCE%5FGRISLI2%5FexpD2.nc) in 30.11049675s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD2'],
 'Virtualizing failed with Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD4%2Fligroundf%5FAIS%5FLSCE%5FGRISLI2%5FexpD4%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD4%2Fligroundf%5FAIS%5FLSCE%5FGRISLI2%5FexpD4.nc) in 3.222380541s - HTTP error: error sending request': ['LSCE_GRISLI2/expD4'],
 'Virtualizing failed with Generic GCS error: Generic GCS error: Error performing GET [https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD5%2Fligroundf%5FAIS%5FLSCE%5FGRISLI2%5FexpD5%2Enc](https://storage.googleapis.com/ismip6/Projection%2DAIS%2FLSCE%2FGRISLI2%2FexpD5%2Fligroundf%5FAIS%5FLSCE%5FGRISLI2%5FexpD5.nc) in 30.109925375s, after 1 retries, max_retries: 10, retry_timeout: 180s  - HTTP error: error sending request': ['LSCE_GRISLI2/expD5'],
 'Virtualizing failed with Unable to synchronously open file (file signature not found)': ['LSCE_GRISLI2/hist_std'],
 "Virtualizing failed with unable to determine if these variables should be coordinates or not in the merged result: {'lon', 'lat'}": ['NCAR_CISM/expD12'],
 'Virtualizing failed with early eof': ['ULB_fETISh_16km/expA6']}

There are a few general categories:

  • Virtualizing failed with cannot align objects with join='override' with matching indexes along dimension 'time' that don't have the same size: This simply means that not all files have the same size in the time dimension. I have not looked into this in detail, but it seems to be largely the 'LSCE_GRISLI2' model. @espg i would be curious if you have thoughts on how to handle this?
  • Generic GCS Errors: I hope these are just transient, but can investigate.
  • Possible corrupted files: I interpret 'Unable to synchronously open file (file signature not found)' and 'early eof' as signs of possible corrupted source files. I need to confirm that, but I wonder if it is within scope to redownload/check certain files via globus.

Todo

  • I need to implement the error handling and logging for the new logic, I currently only parse errors manually in a notebook.
  • Discuss and finish handling errors (or leave as is for now).

@espg
Copy link

espg commented Dec 16, 2025

Looks like we're already getting a pretty large performance boost opening things locally-- excited to see how this translates for opening a cloud based store, let us know how it goes!

@jbusecke for this issue with time dimension, I'm wondering if we can take a similar approach to what pandas does for joining two dataframes (say one representing salinity and the other maybe temperature) that are time indexed with disjoint / partially overlapping time indicies... the behavior there is to match rows between the dataframes that have the same timestamp, and then fill in nan for values that don't have a match for that row / column combo. My assumption (hope?) is that the LSCE_GRISLI2 model is probably missing a few dates-- so we should populate the datatime indexed arrays that match and exist in the rest of the datacube, and then for any missing dates fill with nan.

This is obviously harder to do if the mismatch is the other direction (i.e., if LSCE_GRISLI2 has extra dates) since my understanding is that we need to define the xarray output coordinate space ahead of time and can't change it. It's also an issue that we'll have to probably figure out for ISMIP7 down the line... we were at a meeting for that effort yesterday, and for the version 7 intercomparison, they're allowing flexibility for what year the model simulations start, so some might begin in 1990 and others could begin in 2015. My guess is we can accommodate this if we set a boundary on the end ranges of start and end dates, but it will only take one submission member that exceeds the range to break the grouped approach here, so sounds like we should set that limit for the group now and check to enforce it on dataset submission.

@jbusecke
Copy link
Author

Hey @espg,

I think that we can basically choose from the xarray built in options for join:

join ({"outer", "inner", "left", "right", "exact", "override"}, default: "outer") – String indicating how to combine differing indexes in objects.

    “outer”: use the union of object indexes

    “inner”: use the intersection of object indexes

    “left”: use indexes from the first object with each dimension

    “right”: use indexes from the last object with each dimension

    “exact”: instead of aligning, raise ValueError when indexes to be aligned are not equal

    “override”: if indexes are of same size, rewrite indexes to be those of the first object with that dimension. Indexes for the same dimension must have the same size in all objects.

I believe you are pointing to the 'outer' method. Let me see if that works with the ManifestArrays in principle, and report back.

@abarciauskas-bgse
Copy link

@jbusecke in order to keep working incrementally, do you think we can merge this, close #1 and then open issues for the ongoing work. That ongoing work would be what you have already enumerated above as outstanding errors and putting the output store in google cloud storage with a demonstration notebook?

@jbusecke
Copy link
Author

Yeah we totally could, but maybe I should verify that it actually runs via cloud functions?

@abarciauskas-bgse
Copy link

@jbusecke yes or if it's running fine locally on your laptop, perhaps just document how to run it locally, if it's not a priority to re-run it on the cloud (yet).

@jbusecke
Copy link
Author

Ok I have wrapped this up locally for now:
Key files and changes:

@jbusecke
Copy link
Author

Should I point this to #1 @abarciauskas-bgse? I based this PR on yours, so we can either merge this here or merge them in order. No preference from my end.

@abarciauskas-bgse
Copy link

I think we should just merge this one to developmentseed:main and close #1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants