Skip to content

Warning

To sort out!

What is Chunking ?

Chunking splits data for easier reading.

In physical terms, a data variable is contiguous. However, and especially in the case of NetCDF/HDF files, the physical storage of the data on disk, is chunked in fixed and equal-sized pieces.

Original

Hex Text
01 02 03 04 05 06 07 08 09 1, 2, 3, 4, 5, 6, 7, 8, 9

Chunked

Hex Text Size Number
AA 05 01 01 02 03 04 05 AA 05 01 06 07 08 09 - [1, 2, 3, 4, 5] [6, 7, 8, 9, -] 5 2
AA 04 01 01 02 03 04 AA 04 01 05 06 07 08 AA 04 01 09 - - - [1, 2, 3, 4] [5, 6, 7, 8] [9, -, -, -] 4 3
AA 03 01 01 02 03 AA 03 01 04 05 06 AA 03 01 07 08 09 [1, 2, 3] [4, 5, 6] [7, 8, 9] 3 3
AA 02 01 01 02 AA 02 01 03 04 AA 02 01 05 06 AA 02 01 07 08 AA 02 01 09 - [1, 2] [3, 4] [5, 6] [7, 8] [9, -] 2 5

AA 0? 01 sequences mark the start of a chunk

Why ?

While chunking does not affect the logical relationship of data of a variable, the read/write operations may be impacted heavily. Chunking can optimize read operations for accessing data in various ways :

- by rows
- by columns
- as a rectangular subgrid

The idea is to minimize the number of disk accesses.

How ?

By aligning chunk sizes/borders with the most common or preferred data access patterns.

Good chunking shapes

  • For 2D data, rectangular chunks help balance the disk access times for both row-wise and column-wise access.

  • For 3D data or higher, the chunking strategy may need to be adjusted based on the most common access patterns.

Optimal layout?

An algorithm discussed in "Chunking Data: Choosing Shapes", by RussRew : chunk_shape_3D.py

Seealso

An algebraic formulation for optimal chunking bases on equalizing the number of chunks accessed for a 1D time series and a 2D horizontal slice in a 3D dataset.

Note

rekx prefers to avoid the word optimal, as possible. There is no one-size-fits-all and hence, it may be more appropriate to speak about good, appropriate or preferred chunking.

Let \(D\) be the number of values you want in a chunk, \(N\) be the total number of chunks used to partition the array, and \(c\) be a scaling factor derived from \(D\) and the dimensions of the array. The chunk shape would then be given by the formula:

\[c = \left( \frac{D}{{25256 \times 37 \times 256 \times 512}} \right)^{1/4}\]

The resulting chunk shape is obtained by multiplying each dimension size by \(c\) and truncating to an integer. The chunk shape will thus be:

\[\text{chunk shape} = \left( \left\lfloor 25256 \times c \right\rfloor, \left\lfloor 37 \times c \right\rfloor, \left\lfloor 256 \times c \right\rfloor, \left\lfloor 512 \times c \right\rfloor \right)\]

This formula assumes that the optimal chunk shape will distribute the chunks equally along each dimension, and the scaling factor \(c\) is calculated to ensure the total number of values in a chunk is close to \(D\), without exceeding it.

Size and number of chunks

Tip

Important is the total number of chunks.

Chunks affect both memory usage and processing efficiency. An important distinction for efficient data handling is between the size and the number of chunks in a dataset. The chunks keyword in NetCDF, Xarray and similar libraries, specifies the size of each chunk in a dimension, not the number of chunks! Thus, the smaller a chunk size, the larger the number of chunks and vice versa.

Consequently, decreasing the size of chunks will increase the number of chunks which in turn can lead to higher memory overhead due to the larger number of chunks, rather than direct memory consumption of the data.

Chunking shape Size or Number of elements in chunk Number of chunks
{'time':1, 'y':768, 'x':922} (708,096) Smaller Larger
{'time':168, 'y':384, 'x':288} (18,579,456) Larger Smaller

While a larger number of smaller chunks might increase overhead, it doesn't necessarily mean a higher memory footprint. Memory usage is not exclusively determined by the size or number of chunks. Rather it depends by how many chunks are loaded into memory at once. Hence, even if we have a large number of small chunks, it won't necessarily increase the total memory used by the data, as long as we don't load all chunks into memory simultaneously.

Memory usage depends on the number of chunks loaded into memory at once.

The focus is on the efficiency and optimization of data processing and access, rather than memory usage implied by the number and size of the chunks.

Compression

  • Chunking splits the data in equal-sized blocks or chunks of a pre-defined size
  • Only the chunks of data required are accessed
  • The HDF5 file format stores compressed data in chunks
  • a chunk is the atomic unit of compression as well as disk access
  • thus, compressed data is forcedly chunked
  • rechunking compressed data involves several steps:

    read \(\rightarrow\) uncompress \(\rightarrow\) rechunk \(\rightarrow\) recompress \(\rightarrow\) write new chunks

  • rechunking compressed data can sometimes be faster due to savings in disk I/O!

Chunking is required for : - compression and other filters - creating extendible or unlimited dimension datasets - subsetting very large datasets to improve performance

While chunking can improve performance for large datasets, using a chunking layout without considering the consequences of the chunk size, can lead to poor performance. Unfortunately, it is easy to end up with some random and inefficient chunking layout due to ...

Caching

To Do

Discuss about caching in HDF5!

Problems

Issues that can cause performance problems with chunking include:

  • Very small chunks can create very large datasets which can degrade access performance.

    • The smaller the chunk size the more chunks that HDF5 has to keep track of, and the more time it will take to search for a chunk.
  • Very large chunks need to be read and uncompressed entirely before any access operation.

    • There can be a performance penalty for reading a small subset, if the chunk size is substantially larger than the subset. Also, a dataset may be larger than expected if there are chunks that only contain a small amount of data.
  • Smaller chunk-cache than the predefined chunk size.

    • A chunk does not fit in the Chunk Cache. Every chunked dataset has a chunk cache associated with it that has a default size of 1 MB. The purpose of the chunk cache is to improve performance by keeping chunks that are accessed frequently in memory so that they do not have to be accessed from disk.
    • If a chunk is too large to fit in the chunk cache, it can significantly degrade performance.

Recommendations

It is a good idea to:

  • Avoid very small chunk sizes
  • Be aware of the 1 MB chunk cache size default
  • Test the data with different chunk sizes to determine the optimal chunk size to use.
  • Consider the chunk size in terms of the most common access patterns for the data.

Resources

See also :

Information regarding chunking and various data structure characteristics is largely sourced from resources provided by Unidata. Recognized as an authoritative entity, Unidata is not only the developer but also the maintainer of NetCDF.


  1. Unidata. NetCDF-4 Chunking - Workshop 2011. \url https://www.unidata.ucar.edu/software/netcdf/workshops/2011/nc4chunking/. Accessed: 10. 01. 2024.