Skip to content

Kerchunking to Parquet

Proof-of-Concept with an issue pending software updates

The example works with not-yet-released versions of :

Example data

This goes on with the same example data as in Kerchunking to JSON.

ls -1
SISin202001010000004231000101MA.nc
SISin202001020000004231000101MA.nc
SISin202001030000004231000101MA.nc
SISin202001040000004231000101MA.nc

Reference to Parquet store

We create Parquet stores using the Kerchunk engine

rekx reference-multi-parquet . -v
Creating the following Parquet stores in . :
  SISin202001020000004231000101MA.parquet
  SISin202001030000004231000101MA.parquet
  SISin202001040000004231000101MA.parquet
  SISin202001010000004231000101MA.parquet
Done!

Let's check for the new files :

ls -1
SISin202001010000004231000101MA.nc
SISin202001010000004231000101MA.parquet
SISin202001020000004231000101MA.nc
SISin202001020000004231000101MA.parquet
SISin202001030000004231000101MA.nc
SISin202001030000004231000101MA.parquet
SISin202001040000004231000101MA.nc
SISin202001040000004231000101MA.parquet

There is one .parquet store for each input file.

Combine references

We then combine the multiple Parquet stores into a single one

rekx combine-parquet-stores . -v
Combined reference name : combined_kerchunk.parquet

We verify the new file is there :

ls -1tr
SISin202001030000004231000101MA.nc
SISin202001010000004231000101MA.nc
SISin202001020000004231000101MA.nc
SISin202001040000004231000101MA.nc
SISin202001010000004231000101MA.parquet
SISin202001030000004231000101MA.parquet
SISin202001020000004231000101MA.parquet
SISin202001040000004231000101MA.parquet
combined_kerchunk.parquet

Verify

Does it work ? We verify the aggregated Parquet store is readable

rekx read-performance combined_kerchunk.parquet SIS 8 45 -v
Data read in memory in : 0.132 โšกโšก

read-performance won't show more than just the time it took to load the data in memory. Let's go a step further and print out the values :

rekx select-parquet combined_kerchunk.parquet SIS 8 45
๐Ÿ—ด Something went wrong in selecting the data : "not all values found in index 'lon'. Try setting the
`method` keyword argument (example: method='nearest')."

Error

No panic! The above error is a good sign actually since there is no exact pair of coordinates at longitude, latitude : (8, 45) over which location to retrieve data.

Let's get the closest pair of coordinates that really exists in the data by instructing the --neighbor-lookup nearest option :

rekx select-parquet combined_kerchunk.parquet SIS 8 45 --neighbor-lookup nearest
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 46.0, 114.0, 179.0, 238.0,
290.0, 333.0, 359.0, 379.0, 377.0, 372.0, 344.0, 306.0, 262.0, 206.0, 137.0, 69.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 46.0, 110.0, 175.0, 231.0, 291.0, 332.0, 356.0, 378.0, 376.0, 370.0,
344.0, 308.0, 260.0, 203.0, 137.0, 69.0, 7.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 61.0,
74.0, 112.0, 142.0, 162.0, 185.0, 251.0, 251.0, 176.0, 152.0, 136.0, 111.0, 84.0, 65.0, 44.0, 3.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 46.0, 105.0, 173.0, 236.0, 259.0, 322.0, 371.0, 373.0, 382.0,
358.0, 347.0, 311.0, 267.0, 205.0, 147.0, 74.0, 9.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0

or add -v for a Xarray-styled output

rekx select-parquet combined_kerchunk.parquet SIS 8 45 --neighbor-lookup nearest -v
โœ“ Coordinates : 8.0, 45.0.
<xarray.DataArray 'SIS' (time: 192)>
array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,  46., 114., 179., 238., 290., 333., 359., 379., 377.,
       372., 344., 306., 262., 206., 137.,  69.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,  46., 110., 175., 231., 291., 332., 356., 378., 376.,
       370., 344., 308., 260., 203., 137.,  69.,   7.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,  13.,  61.,  74., 112., 142., 162., 185., 251., 251.,
       176., 152., 136., 111.,  84.,  65.,  44.,   3.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,  46., 105., 173., 236., 259., 322., 371., 373., 382.,
       358., 347., 311., 267., 205., 147.,  74.,   9.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
      dtype=float32)
Coordinates:
    lat      float32 45.03
    lon      float32 8.025
  * time     (time) datetime64[ns] 2020-01-01 ... 2020-01-04T23:30:00
Attributes:
    cell_methods:   time: point
    long_name:      Surface Downwelling Shortwave Radiation
    standard_name:  surface_downwelling_shortwave_flux_in_air
    units:          W m-2

Now it worked! One more option : let's get a statistical overview instead :

rekx select-parquet combined_kerchunk.parquet SIS 8 45 --neighbor-lookup nearest --statistics
                   Selected series

           Statistic   Value
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
               Start   2020-01-01T00:00:00.000000000
                 End   2020-01-04T23:30:00.000000000
               Count   192

                 Min   0.0
     25th Percentile   0.0
                Mean   72.97396
              Median   0.0
                Mode   0.0
                 Max   382.0
                 Sum   14011.0
            Variance   14933.45
  Standard deviation   122.2025

         Time of Min   2020-01-01T00:00:00.000000000
        Index of Min   0
         Time of Max   2020-01-04T11:30:00.000000000
        Index of Max   167

                     Caption text