rekx 🦖¶

GitHub tag (with filter)

Experimental

Everything is under heavy development and subject to change! Interested ? Peek over at the To Do list !

What ?¶

rekx seamlessly interfaces the Kerchunk library ² in an interactive way through the command line. It assists in creating virtual aggregate datasets, also known as Kerchunk reference sets, which allows for an efficient, parallel and cloud-friendly way to access data in-situ without duplicating the original datasets.

More than a functional tool, rekx serves an educational purpose on matters around chunking, compression and efficient data reading from common scientific file formats such as NetCDF used extensively to store large time-series. While there is abundant documentation on such topics, it is often highly technical and oriented towards developers, rekx tries to simplify these concepts through practical examples.

Why ?¶

Similarly, existing tools for managing HDF and NetCDF data, such as cdo, nco, and others, often have overlapping functionalities and present a steep learning curve for non-experts. rekx focuses on practical aspects of efficient data access trying to simplify these processes.

It features simple command line tools to:

diagnose data structures
validate uniform chunking across files
suggest good chunking shapes
parameterise the rechunking of datasets.
create and aggregate Kerchunk reference sets
time data read operations for performance analysis

rekx dedicates to practicality, simplicity, and essence.

The Zen of Chunking

Chunks are equal-sized data blocks.

Chunks are required for compression, extendible data and subsetting large datasets.

Small chunks lead to a large number of chunks.

Large chunks lead to a small number of chunks.

Appropriately sized chunks can improve performance.

Unthoughtfully sized chunks will decrease performance.

Good chunk sizes depend on data access patterns.

Good chunk sizes balance read/write operations and computational efficiency.

There is no one size fits all.

Original T-Rex drawn by pikisuperstar on Freepik ↩
Martin Durant, Max Jones, Ryan Abernathey, David Hoese, and James Bednar. Pangeo-ML Augmentation - Enabling Cloud-native access to archival data with Kerchunk. 3 2023. URL: https://figshare.com/articles/preprint/Pangeo-ML_Augmentation_-_Enabling_Cloud-native_access_to_archival_data_with_Kerchunk/22266433, doi:10.6084/m9.figshare.22266433.v1. ↩