Skip to content

Chunking shape

The chunking shape refers to the chunk sizes of the variables typically found in a NetCDF file, or else any Xarray-supported file format. rekx can scan a source_directory for files that match a given pattern and report the chunking shapes across all of them.

Given the following files in the current directory

ls -1 data/multiple_files_multiple_products/
SISin200001010000004231000101MA_1_2600_2600.nc
SISin202001010000004231000101MA.nc
SRImm201301010000003231000101MA.nc

we scan for filenames starting with SIS and having the suffix .nc :

rekx shapes data/multiple_files_multiple_products/ --pattern "SIS*.nc"
  Variable     Shapes         Files                                         C…  
 ────────────────────────────────────────────────────────────────────────────── 
  SIS          1 x 1 x 2600   SISin202001010000004231000101MA.nc            1   
  SIS          1 x 2600 x …   SISin200001010000004231000101MA_1_2600_260…   1   
  lat_bnds     2600 x 2       SISin202001010000004231000101MA.nc ..         2   
  record_st…   48             SISin202001010000004231000101MA.nc            1   
  record_st…   1              SISin200001010000004231000101MA_1_2600_260…   1   
  lon_bnds     2600 x 2       SISin202001010000004231000101MA.nc ..         2   
  time         512            SISin202001010000004231000101MA.nc            1   
  time         1              SISin200001010000004231000101MA_1_2600_260…   1   
  lon          2600           SISin202001010000004231000101MA.nc ..         2   
  lat          2600           SISin202001010000004231000101MA.nc ..         2   

or restrict the same scan to data variables only

rekx shapes data/multiple_files_multiple_products/ --pattern "SIS*.nc" --variable-set data
  Varia…   Shapes          Files                                          Cou…  
 ────────────────────────────────────────────────────────────────────────────── 
  SIS      1 x 1 x 2600    SISin202001010000004231000101MA.nc             1     
  SIS      1 x 2600 x 2…   SISin200001010000004231000101MA_1_2600_2600…   1     

Uniform chunking shape?

We can also verify the uniqueness of one chunking shape across all input files. To exemplify, in a directory containing :

ls -1 data/multiple_files_unique_shape/
SISin202001010000004231000101MA.nc
SISin202001010000004231000101MA_structure.csv
SISin202001020000004231000101MA.nc
sarah3_sis_kerchunk_reference_json
sarah3_sis_kerchunk_references_json

we navigate to the directory in question and scan for chunking shapes in the current directory

cd data/multiple_files_unique_shape/
rekx shapes . --variable-set data
  Variable   Shapes         Files                                   Count  
 ───────────────────────────────────────────────────────────────────────── 
  SIS        1 x 1 x 2600   SISin202001020000004231000101MA.nc ..   2      

We can verify the one and only chunking shape via --validate-consistency

rekx shapes . --variable-set data --validate-consistency
✓ Variables are consistently shaped across all files!

Else, let us scan another directory containing the files

ls data/multiple_files_multiple_shapes/
SISin200001010000004231000101MA_1_2600_2600.nc
SISin202001010000004231000101MA.nc

For the following shapes

cd data/multiple_files_multiple_shapes/
rekx shapes . --variable-set data
  Varia…   Shapes          Files                                          Cou…  
 ────────────────────────────────────────────────────────────────────────────── 
  SIS      1 x 1 x 2600    SISin202001010000004231000101MA.nc             1     
  SIS      1 x 2600 x 2…   SISin200001010000004231000101MA_1_2600_2600…   1     

we check for chunking consistency and expect a negative response since we have more than one shape :

🗴 Variables are not consistently shaped across all files!

Interested for a long table ? Use the verbosity flag :

rekx shapes . --variable-set data --validate-consistency -v
🗴 Variables are not consistently shaped across all files!
                                      SIS                                      

  Variable   Shape             Files                                           
 ───────────────────────────────────────────────────────────────────────────── 
  SIS        1 x 1 x 2600      SISin202001010000004231000101MA.nc              
  SIS        1 x 2600 x 2600   SISin200001010000004231000101MA_1_2600_2600.nc  

Maximum common chunking shape

It might be useful to know the maximum common chunking shape across files of a product series, like the SIS or SID products from the SARAH3 climate data records.

Say for example a directory contains the following SARAH3 products :

ls data/multiple_files_multiple_products/
SISin200001010000004231000101MA_1_2600_2600.nc
SISin202001010000004231000101MA.nc
SRImm201301010000003231000101MA.nc

rekx will fetch the maximum common shapes like so :

Output format subject to change

The output format will likely be modified in one single table that features both the Shapes and Common Shape columns

rekx shapes . --variable-set data --common-shapes
  Variable    Common Shape       
 ─────────────────────────────── 
  SIS         1 x 2600 x 2600    
  kato_bnds   29 x 2             
  SRI         1 x 4 x 401 x 401  


  Variab…   Shapes           Files                                          C…  
 ────────────────────────────────────────────────────────────────────────────── 
  SIS       1 x 1 x 2600     SISin202001010000004231000101MA.nc             1   
  SIS       1 x 2600 x 26…   SISin200001010000004231000101MA_1_2600_2600…   1   
  kato_b…   29 x 2           SRImm201301010000003231000101MA.nc             1   
  SRI       1 x 4 x 401 x…   SRImm201301010000003231000101MA.nc             1   

Consistency

Consider a case where we want only to know if our NetCDF data are uniformely chunked or not. No more or less than a yes or a no answer. rekx can [validate][rekx.diagnose] for a uniform chunking shape across multiple NetCDF files.

Let's list some NetCDF files which differ in terms of their chunking shapes :

ls data/multiple_files_multiple_products/*.nc
data/multiple_files_multiple_products/SISin200001010000004231000101MA_1_2600_2600.nc
data/multiple_files_multiple_products/SISin202001010000004231000101MA.nc
data/multiple_files_multiple_products/SRImm201301010000003231000101MA.nc

From the file names, we expect to have at least two different chunking shapes. Let's try first with the files named after a common pattern :

cd data/multiple_files_multiple_products/
rekx shapes . --pattern "SIS*MA.nc" --validate-consistency
✓ Variables are consistently shaped across all files!

Indeed, the requested files are chunked identically. What about the other file SISin200001010000004231000101MA_1_2600_2600.nc ?

cd data/multiple_files_multiple_products/
rekx shapes . --validate-consistency
🗴 Variables are not consistently shaped across all files!

Voilà, this is a no !