rekx 🦖¶
Experimental
Everything is under heavy development and subject to change! Interested ? Peek over at the To Do list !
What ?¶
rekx
seamlessly interfaces
the Kerchunk library 2
in an interactive way through the command line.
It assists in creating virtual aggregate datasets,
also known as Kerchunk reference sets,
which allows for an efficient, parallel and cloud-friendly way
to access data in-situ without duplicating the original datasets.
More than a functional tool,
rekx
serves an educational purpose on matters around
chunking, compression and efficient data reading
from common scientific file formats such as NetCDF
used extensively to store large time-series.
While there is abundant documentation on such topics,
it is often highly technical
and oriented towards developers,
rekx
tries to simplify these concepts through practical examples.
Why ?¶
Similarly,
existing tools for managing HDF and NetCDF data,
such as cdo
, nco
, and others,
often have overlapping functionalities
and present a steep learning curve for non-experts.
rekx
focuses on practical aspects of efficient data access
trying to simplify these processes.
It features simple command line tools to:
- diagnose data structures
- validate uniform chunking across files
- suggest good chunking shapes
- parameterise the rechunking of datasets.
- create and aggregate Kerchunk reference sets
- time data read operations for performance analysis
rekx
dedicates to practicality, simplicity, and essence.
The Zen of Chunking
Chunks are equal-sized data blocks.
Chunks are required for compression, extendible data and subsetting large datasets.
Small chunks lead to a large number of chunks.
Large chunks lead to a small number of chunks.
Appropriately sized chunks can improve performance.
Unthoughtfully sized chunks will decrease performance.
Good chunk sizes depend on data access patterns.
Good chunk sizes balance read/write operations and computational efficiency.
There is no one size fits all.
-
Original T-Rex drawn by pikisuperstar on Freepik ↩
-
Martin Durant, Max Jones, Ryan Abernathey, David Hoese, and James Bednar. Pangeo-ML Augmentation - Enabling Cloud-native access to archival data with Kerchunk. 3 2023. URL: https://figshare.com/articles/preprint/Pangeo-ML_Augmentation_-_Enabling_Cloud-native_access_to_archival_data_with_Kerchunk/22266433, doi:10.6084/m9.figshare.22266433.v1. ↩