Skip to content

rekx 🦖

License GitHub tag (with filter) Documentation ci

rekx1

Experimental

Everything is under heavy development and subject to change! Interested ? Peek over at the To Do list !

What ?

rekx seamlessly interfaces the Kerchunk library 2 in an interactive way through the command line. It assists in creating virtual aggregate datasets, also known as Kerchunk reference sets, which allows for an efficient, parallel and cloud-friendly way to access data in-situ without duplicating the original datasets.

More than a functional tool, rekx serves an educational purpose on matters around chunking, compression and efficient data reading from common scientific file formats such as NetCDF used extensively to store large time-series. While there is abundant documentation on such topics, it is often highly technical and oriented towards developers, rekx tries to simplify these concepts through practical examples.

Why ?

Similarly, existing tools for managing HDF and NetCDF data, such as cdo, nco, and others, often have overlapping functionalities and present a steep learning curve for non-experts. rekx focuses on practical aspects of efficient data access trying to simplify these processes.

It features simple command line tools to:

  • diagnose data structures
  • validate uniform chunking across files
  • suggest good chunking shapes
  • parameterise the rechunking of datasets.
  • create and aggregate Kerchunk reference sets
  • time data read operations for performance analysis

rekx dedicates to practicality, simplicity, and essence.

The Zen of Chunking

Chunks are equal-sized data blocks.

Chunks are required for compression, extendible data and subsetting large datasets.

Small chunks lead to a large number of chunks.

Large chunks lead to a small number of chunks.

Appropriately sized chunks can improve performance.

Unthoughtfully sized chunks will decrease performance.

Good chunk sizes depend on data access patterns.

Good chunk sizes balance read/write operations and computational efficiency.

There is no one size fits all.


  1. Original T-Rex drawn by pikisuperstar on Freepik 

  2. Martin Durant, Max Jones, Ryan Abernathey, David Hoese, and James Bednar. Pangeo-ML Augmentation - Enabling Cloud-native access to archival data with Kerchunk. 3 2023. URL: https://figshare.com/articles/preprint/Pangeo-ML_Augmentation_-_Enabling_Cloud-native_access_to_archival_data_with_Kerchunk/22266433, doi:10.6084/m9.figshare.22266433.v1