Skip to content

Fix Me

Explain better the idea, the commands and the example !

Chunking parameterisation

Combining parameters

Identifying the appropriate configuration for data stored in NetCDF files, involves systematic experimentation with various structural parameters. The rechunk-generator command varies structural data parameters and options for NetCDF files that influence the file size and data access speed and generates a series of nccopy commands for rechunking NetCDF files. The list of commands can be fed to GNU Parallel which will take care to run them in parallel. Subsequently, rekx inspect can report on the new data structures and the average time it takes to retrieve data over a geographic location.

Given the initial NetCDF file SISin202001010000004231000101MA.nc, we can use rekx rechunk-generator to generate a series of experimental nccopy commands and save them in a file prefixed with rechunk_commands_for_ followed after the name of the input NetCDF file :

 rekx rechunk-generator SISin202001010000004231000101MA.nc rechunking_test --time 48 --latitude 64,128,256 --longitude 64,128,256 --compression-level 0,3,6,9 -v --shuffling --memory

Example

In this example, we ask for possible chunk sizes for :

  • time to be \(48\), that is only one size

  • latitude and longitude sizes \(64\), \(128\) and \(256\)

  • compression-levels \(0\), \(3\), \(64\), and \(9\)

In addition, we ask for --shuffling. Since shuffling wouldn't make sense for uncompressed data, rekx takes care to only add it along with compression levels greater than 0.

There are more options and we can list them with the typical --help option.

The above command will generate

Writing generated commands into rechunk_commands_for_SISin202001010000004231000101MA.txt
nccopy -c time/48,lat/64,lon/64 -d 0  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_0.nc
nccopy -c time/48,lat/64,lon/64 -d 3 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_3_shuffled.nc
nccopy -c time/48,lat/64,lon/64 -d 3  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_3.nc
nccopy -c time/48,lat/64,lon/64 -d 6 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_6_shuffled.nc
nccopy -c time/48,lat/64,lon/64 -d 6  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_6.nc
nccopy -c time/48,lat/64,lon/64 -d 9 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_9_shuffled.nc
nccopy -c time/48,lat/64,lon/64 -d 9  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_64_64_zlib_9.nc
nccopy -c time/48,lat/128,lon/128 -d 0  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_0.nc
nccopy -c time/48,lat/128,lon/128 -d 3 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_3_shuffled.nc
nccopy -c time/48,lat/128,lon/128 -d 3  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_3.nc
nccopy -c time/48,lat/128,lon/128 -d 6 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_6_shuffled.nc
nccopy -c time/48,lat/128,lon/128 -d 6  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_6.nc
nccopy -c time/48,lat/128,lon/128 -d 9 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_9_shuffled.nc
nccopy -c time/48,lat/128,lon/128 -d 9  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_128_128_zlib_9.nc
nccopy -c time/48,lat/256,lon/256 -d 0  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_0.nc
nccopy -c time/48,lat/256,lon/256 -d 3 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_3_shuffled.nc
nccopy -c time/48,lat/256,lon/256 -d 3  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_3.nc
nccopy -c time/48,lat/256,lon/256 -d 6 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_6_shuffled.nc
nccopy -c time/48,lat/256,lon/256 -d 6  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_6.nc
nccopy -c time/48,lat/256,lon/256 -d 9 -s -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_9_shuffled.nc
nccopy -c time/48,lat/256,lon/256 -d 9  -h 16777216 -e 4133 -w SISin202001010000004231000101MA.nc rechunking_test/SISin202001010000004231000101MA_48_256_256_zlib_9.nc

Process in parallel

We let the mighty GNU Parallel execute these commands in parallel

parallel < rechunk_commands_for_SISin202001010000004231000101MA.txt

While the output comprises as many new NetCDF files as the nccopy commands, for the sake of showcasing rekx, let us only inspect new NetCDF files whose filename contains the string 256 :

 rekx inspect rechunking_test --repetitions 3 --humanize --long-table --variable-set data --pattern "*256*"

The above command will return

  Name                   Size        Dimensions            Variable   Shape              Chunks           Cache      Elements   Preemption   Type    Scale   Offset   Compression     Level   Shuffling   Read Time
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  SISin20200101000000…   726.1 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -                        0       False       0.024
                                     2600
  SISin20200101000000…   137.6 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -        zlib            6       False       0.025
                                     2600
  SISin20200101000000…   137.4 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -        zlib            9       False       0.030
                                     2600
  SISin20200101000000…   123.7 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -        zlib, shuffle   3       True        0.032
                                     2600
  SISin20200101000000…   140.1 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -        zlib            3       False       0.032
                                     2600
  SISin20200101000000…   120.9 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -        zlib, shuffle   6       True        0.034
                                     2600
  SISin20200101000000…   119.8 MiB   2 x 48 x 2600 x       SIS        48 x 2600 x 2600   48 x 256 x 256   16777216   4133       0.75         int16   -       -        zlib, shuffle   9       True        0.034
                                     2600

                                               ^ Dimensions: lat x time x lon x bnds * Cache: Size in bytes, Number of elements, Preemption strategy ranging in [0, 1]

Info

Note, the reported reading times are averages of repeated reads of the data in the memory to ensure we are really retrieving data values! Look for the repetitions parameter in rekx inspect --help.

The output reports on dataset structure, chunking, compression levels, and the average time to read data over a geographic location. Analysing such results, can guide us in choosing an effective chunking shape and compression strategy in order to optimize our data structure.

To get a machine readable output of such an analysis, rekx can write this out in a CSV file via the --csv option.

Complete Me

Complete documentation of this example!

See also