one element per chunk

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

one element per chunk

Efim Dyadkin-2

Hi,

 

I need to implement a storage for data with the following properties:

1)      multi-dimensional unlimited size data set of variable-length records

2)      may be highly sparsed

3)      usually randomly accessed one record at a time

4)      each record may vary in size from tens of kilobytes to tens of megabytes

 

I am thinking of unlimited chunked data space. However to make it efficient in terms of disk space and access time I need to have my chunks as small as one element. Could you please save me performance test and tell if such configuration is practical with HDF5?

 

Thanks,

Efim

------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: one element per chunk

JohnReadey

Hi Efim,

 

   Unfortunately chunking+compression doesn’t really help much with variable length datatypes.  Variable length datasets consist of an array of heap pointers, so the bulk of the dataset doesn’t participate in any compression.

 

    On the other hand your record size is large enough that you could setup your storage to be a collection of scalar datasets.  Since there is just element per dataset, you can make the datatype be whatever the size of the row is and use a compression filter.  So rather than accessing a row via and index into a dataset, you’d access a dataset via a link name (which could just be a stringified version of a numeric index).

 

   If you go this route, use the “libver=latest” option when opening the file.  Recent changes in the file format have made accessing objects from a large group collection much more efficient.

 

John

 

From: Hdf-forum <[hidden email]> on behalf of Efim Dyadkin <[hidden email]>
Reply-To: HDF Users Discussion List <[hidden email]>
Date: Monday, June 5, 2017 at 3:40 PM
To: "[hidden email]" <[hidden email]>
Subject: [Hdf-forum] one element per chunk

 

Hi,

 

I need to implement a storage for data with the following properties:

1)       multi-dimensional unlimited size data set of variable-length records

2)       may be highly sparsed

3)       usually randomly accessed one record at a time

4)       each record may vary in size from tens of kilobytes to tens of megabytes

 

I am thinking of unlimited chunked data space. However to make it efficient in terms of disk space and access time I need to have my chunks as small as one element. Could you please save me performance test and tell if such configuration is practical with HDF5?

 

Thanks,

Efim

------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5