Chunking shape¶
The chunking shape refers to the chunk sizes of the variables typically
found in a NetCDF file, or else any Xarray-supported file format.
rekx
can scan a source_directory
for files that match a given pattern
and report the chunking shapes across all of them.
Given the following files in the current directory
SISin200001010000004231000101MA_1_2600_2600.nc
SISin202001010000004231000101MA.nc
SRImm201301010000003231000101MA.nc
we scan for filenames starting with SIS
and having the suffix .nc
:
Variable Shapes Files C…
──────────────────────────────────────────────────────────────────────────────
SIS 1 x 1 x 26… SISin202001010000004231000101MA.nc 1
SIS 1 x 2600 x… SISin200001010000004231000101MA_1_2600_2600.… 1
lat_bnds 2600 x 2 SISin200001010000004231000101MA_1_2600_2600.… 2
time 512 SISin202001010000004231000101MA.nc 1
time 1 SISin200001010000004231000101MA_1_2600_2600.… 1
record_s… 48 SISin202001010000004231000101MA.nc 1
record_s… 1 SISin200001010000004231000101MA_1_2600_2600.… 1
lon_bnds 2600 x 2 SISin200001010000004231000101MA_1_2600_2600.… 2
lat 2600 SISin200001010000004231000101MA_1_2600_2600.… 2
lon 2600 SISin200001010000004231000101MA_1_2600_2600.… 2
or restrict the same scan to data variables only
Varia… Shapes Files Cou…
──────────────────────────────────────────────────────────────────────────────
SIS 1 x 1 x 2600 SISin202001010000004231000101MA.nc 1
SIS 1 x 2600 x 2… SISin200001010000004231000101MA_1_2600_2600… 1
Uniform chunking shape?¶
We can also verify the uniqueness of one chunking shape across all input files. To exemplify, in a directory containing :
SISin202001010000004231000101MA.nc
SISin202001010000004231000101MA_structure.csv
SISin202001020000004231000101MA.nc
sarah3_sis_kerchunk_reference_json
sarah3_sis_kerchunk_references_json
we navigate to the directory in question and scan for chunking shapes in the current directory
Variable Shapes Files Count
─────────────────────────────────────────────────────────────────────────
SIS 1 x 1 x 2600 SISin202001020000004231000101MA.nc .. 2
We can verify the one and only chunking shape via --validate-consistency
Else, let us scan another directory containing the files
For the following shapes
Varia… Shapes Files Cou…
──────────────────────────────────────────────────────────────────────────────
SIS 1 x 1 x 2600 SISin202001010000004231000101MA.nc 1
SIS 1 x 2600 x 2… SISin200001010000004231000101MA_1_2600_2600… 1
we check for chunking consistency and expect a negative response since we have more than one shape :
Interested for a long table ? Use the verbosity flag :
🗴 Variables are not consistently shaped across all files!
SIS
Variable Shape Files
─────────────────────────────────────────────────────────────────────────────
SIS 1 x 1 x 2600 SISin202001010000004231000101MA.nc
SIS 1 x 2600 x 2600 SISin200001010000004231000101MA_1_2600_2600.nc
Maximum common chunking shape¶
It might be useful to know the maximum common chunking shape across files of a product series, like the SIS or SID products from the SARAH3 climate data records.
Say for example a directory contains the following SARAH3 products :
SISin200001010000004231000101MA_1_2600_2600.nc
SISin202001010000004231000101MA.nc
SRImm201301010000003231000101MA.nc
rekx
will fetch the maximum common shapes like so :
Output format subject to change
The output format will likely be modified in one single table that features both the Shapes and Common Shape columns
Variable Common Shape
───────────────────────────────
SRI 1 x 4 x 401 x 401
kato_bnds 29 x 2
SIS 1 x 2600 x 2600
Variab… Shapes Files C…
──────────────────────────────────────────────────────────────────────────────
SRI 1 x 4 x 401 x… SRImm201301010000003231000101MA.nc 1
kato_b… 29 x 2 SRImm201301010000003231000101MA.nc 1
SIS 1 x 1 x 2600 SISin202001010000004231000101MA.nc 1
SIS 1 x 2600 x 26… SISin200001010000004231000101MA_1_2600_2600… 1
Consistency¶
Consider a case where we want only to know
if our NetCDF data are uniformely chunked or not.
No more or less than a yes or a no answer.
rekx
can [validate
][rekx.diagnose]
for a uniform chunking shape across multiple NetCDF files.
Let's list some NetCDF files which differ in terms of their chunking shapes :
data/multiple_files_multiple_products/SISin200001010000004231000101MA_1_2600_2600.nc
data/multiple_files_multiple_products/SISin202001010000004231000101MA.nc
data/multiple_files_multiple_products/SRImm201301010000003231000101MA.nc
From the file names, we expect to have at least two different chunking shapes. Let's try first with the files named after a common pattern :
cd data/multiple_files_multiple_products/
rekx shapes . --pattern "SIS*MA.nc" --validate-consistency
Indeed, the requested files are chunked identically.
What about the other file SISin200001010000004231000101MA_1_2600_2600.nc
?
Voilà, this is a no !