Skip to content

Massive JSON Kerchunking

Experimental

Following is an incomplete experiment, eventually worth revisiting.

Overview

  • Creating SARAH3 daily netCDF reference files can take \(4+\) hours
  • optimizing chunking can reduce this

Input

  • Daily NetCDF files from 1999 to 2021
    (actually missing a year, so normally it'd be up to 2022) that make up about about 1.2T

    • Each daily NetCDF file contains 48 half-hourly maps of 2600 x 2600 pixels
    • Noted here : mixed chunking shapes between years
      (e.g. time, lat, lon : 1 x 2600 x 2600, 1 x 1300 x 1300, and maybe more)
    • rechunked to 1 x 32 x 32, thus now 1.25T
  • A first set of JSON reference files (one reference file per rechunked input NetCDF file) is about ~377G.

  • A second step of (24 should be in total) yearly JSON reference files (based on the first reference set) is ~300G

  • Finally, the goal is to create a single reference file to cover the complete time series

Hardware

 free -hm
              total        used        free      shared  buff/cache   available
Mem:          503Gi       4.7Gi       495Gi       2.8Gi       3.1Gi       494Gi
Swap:            0B          0B          0B

Trials

13G Nov  2 10:07 sarah3_sid_reference_1999.json
13G Nov  2 09:58 sarah3_sid_reference_2000.json
13G Nov  2 11:00 sarah3_sid_reference_2001.json
13G Nov  2 11:08 sarah3_sid_reference_2002.json
13G Nov  2 12:04 sarah3_sid_reference_2003.json
13G Nov  2 12:12 sarah3_sid_reference_2004.json
13G Nov  2 13:07 sarah3_sid_reference_2005.json
13G Nov  2 14:29 sarah3_sid_reference_2006.json
13G Nov  2 15:27 sarah3_sid_reference_2007.json
13G Nov  2 16:45 sarah3_sid_reference_2008.json
13G Nov  2 17:43 sarah3_sid_reference_2009.json
13G Nov  2 19:02 sarah3_sid_reference_2010.json
13G Nov  2 19:58 sarah3_sid_reference_2011.json
13G Nov  2 21:25 sarah3_sid_reference_2012.json
13G Nov  2 22:13 sarah3_sid_reference_2013.json
13G Nov  2 23:43 sarah3_sid_reference_2014.json
13G Nov  3 00:36 sarah3_sid_reference_2015.json
13G Nov  3 02:03 sarah3_sid_reference_2016.json
13G Nov  3 02:58 sarah3_sid_reference_2017.json
13G Nov  3 04:24 sarah3_sid_reference_2018.json
13G Nov  3 05:21 sarah3_sid_reference_2019.json
13G Nov  3 06:48 sarah3_sid_reference_2020.json
13G Nov  3 07:41 sarah3_sid_reference_2021.json

Trying to combine the above to a single reference set, fails with the following error message :

JSONDecodeError: Could not reserve memory block

Take away message

The limiting factor for the size of the reference sets is not the total number of bytes but the total number of references. Hence, the chunking scheme is perhaps more important here.