Skip to content

Kerchunk

Warning

Unsorted notes

kerchunk supports cloud-friendly access of data with specific reference to netCDF4/HDF5 files.1

How? Kerchunk

  • extracts metadata in a single scan
  • arranges multiple chunks from multiple files
  • with dask and zarr, reads chunks in parallel and/or concurrently within a single indexible aggregate dataset

+ advantages

  • supports parallel and concurrent reads
  • memory efficiency
  • parallel processing
  • data locality

- drawbacks

  • ?

How does it work?

  • Combines fsspec, dask, and zarr
one_day_data = read_chunk("10_year_data_chunked.hdf5", chunk_index=0)
  • Reference file :
{
  "version": 1,
  "shapes": {"var1": [365, 24]},
  "refs": {
    "var1/0": ["file_1.nc", "0:24"],
    "var1/1": ["file_2.nc", "0:24"],
    // ...
  }
}

  1. Development supported by NASA fundung https://doi.org/10.6084/m9.figshare.22266433.v1 

  2. see Parallel 

  3. see Concurrency