Questions regarding H5Dcreate_anon() and reclaiming unused disk file space

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions regarding H5Dcreate_anon() and reclaiming unused disk file space

Kevin B. McCarty
Hi,

I have some questions regarding H5Dcreate_anon() as implemented in
version 1.8.18 of the HDF5 library...

I'd like to use this function to create a temporary test dataset.  If
it meets a certain condition, which I basically can't determine until
writing of the test dataset to disk is finished, I'll make it
accessible in the HDF5 file on disk with H5Olink().  Otherwise I'll
discard the temporary dataset and try again with relevant changes.

I'd like to be certain of two things that are needed for this approach
to work well:

1) Does the dataset generated by H5Dcreate_anon() actually exist
(transiently) on-disk, rather than being a clever wrapper for some
memory buffer?  I am generating the dataset chunked and writing it out
chunk-by-chunk, so insufficient RAM isn't a problem UNLESS there is a
concern with using H5Dcreate_anon() for a dataset too large to fit in
memory at once.

2) I understand that "normal" H5Dcreate() and a dataset write,
followed sometime later by H5Ldelete(), can end up (in 1.8.18)
resulting in wasted space in the file on disk.  Can wasted space be
produced similarly by H5Dcreate_anon() when no later call to H5Olink()
is made?  [Assume that H5Dclose() gets properly called.]  I'm hoping
not ... ?

Thanks in advance for info on this subject!


Regarding the wasted space, I have secondary questions.

3) I know that h5repack can be used to produce a new file without
wasted space.  But without h5repack, would the creation of more
datasets in the same file (with library version 1.8.18) re-use that
wasted disk space when possible?

4) There are apparently some mechanisms in 1.10.x for managing /
reclaiming wasted space on disk in HDF5 files?  Does it happen
automatically upon any call to H5Ldelete() with the 1.10.x library, or
are some additional function calls needed?  I can't really find
anything in the docs about this so a pointer would be much
appreciated.  (As noted on this list previously, my employer can't
upgrade to 1.10.x until there is a way to produce 1.8.x backwards
compatible output, but eventually I guess we'll all get there...)

Thanks again,

--
Kevin B. McCarty
<[hidden email]>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

fully dynamic construction of HDF Table dataset

Rafal Lichwala
Hi,

I've read many examples from both H5TB high level API and low level API
for compound HDF data type, but I didn't find a good solution for my
special use case. All those examples have one problematic assumption:
data structure (which means number of fields and their types and values)
must be known a priori - that's the problem in my case, when I don't
know this structure and I need to create a table HDF dataset not only
row-by-row, but also field-by-field in the row.

I need your advice how to achieve what I want using a proper sequence of
HDF API calls.

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int, float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values which
will be overwritten in the next steps.

How to achieve that use case?

Is it possible to create a table by calling H5TBmake_table(), but having
no fields and no records at the beginning and then just call
H5TBinsert_field() in the next steps?

Is it possible to have "data" attribute of H5TBinsert_field() function a
NULL value when we insert a new field to a table dataset with no records
yet?

What about 4th step - can I create just a first column value for a new
record in a table?

I know it's maybe a strange use case, but the problem is that I could
have really huge structure model (a lot of columns and a lot of records)
which should be stored in the HDF table dataset, so I need to avoid
"collecting" required information (number of fields, their types,
values) by initial iterating over whole structure.
The second problem is that I have a vector of objects which need to be
stored as HDF table (where table row is the given object and columns are
its fields), but all examples I've seen just work on C struct.

I would appreciate any advice!

Regards,
Rafal


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: fully dynamic construction of HDF Table dataset

Walter Landry
Rafal Lichwala <[hidden email]> wrote:

> Let's say my final HDF table will look like this:
> ['a', 1, 3.14]
> ['b', 2, 2.11]
> ['c', 3, 1.89]
>
> So we simply have a HDF table with 3 columns of types: char, int,
> float
> and 3 rows with some values.
>
> Creation of that table must be divided into some "steps".
> After 1st "step" I should have a table:
> ['a']
>
> After 2nd step:
> ['a', 1]
>
> After 3rd step:
> ['a', 1, 3.14]
>
> After 4th step:
> ['a', 1, 3.14]
> ['b', x, x]
>
> where x after 4th step is undefined and can be some default values
> which will be overwritten in the next steps.
>
> How to achieve that use case?

I have to do something similar for my program tablator

  https://github.com/Caltech-IPAC/tablator

I read in tables in other formats and write out as HDF5.  So I do not
know the types of the rows at compile time.  It is all in C++.  The
details of writing HDF5 are in

  src/Table/write_hdf5/

> Is it possible to create a table by calling H5TBmake_table(), but
> having no fields and no records at the beginning and then just call
> H5TBinsert_field() in the next steps?

I do not think that is going to work, because you need to know the
sizes of rows when you create the table.

> Is it possible to have "data" attribute of H5TBinsert_field() function
> a NULL value when we insert a new field to a table dataset with no
> records yet?
>
> What about 4th step - can I create just a first column value for a new
> record in a table?

I do not know of a way to do that.  I would end up creating a whole
new table with the new field.  You can then populate the empty fields
with appropriate default values.

> I know it's maybe a strange use case, but the problem is that I could
> have really huge structure model (a lot of columns and a lot of
> records) which should be stored in the HDF table dataset, so I need to
> avoid "collecting" required information (number of fields, their
> types, values) by initial iterating over whole structure.
> The second problem is that I have a vector of objects which need to be
> stored as HDF table (where table row is the given object and columns
> are its fields), but all examples I've seen just work on C struct.

That sounds similar to the internal data structure I use in tablator.

Hope that helps,
Walter Landry


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions regarding H5Dcreate_anon() and reclaiming unused disk file space

Quincey Koziol-3
In reply to this post by Kevin B. McCarty
Hi Kevin,

> On Aug 22, 2017, at 7:13 PM, Kevin B. McCarty <[hidden email]> wrote:
>
> Hi,
>
> I have some questions regarding H5Dcreate_anon() as implemented in
> version 1.8.18 of the HDF5 library...
>
> I'd like to use this function to create a temporary test dataset.  If
> it meets a certain condition, which I basically can't determine until
> writing of the test dataset to disk is finished, I'll make it
> accessible in the HDF5 file on disk with H5Olink().  Otherwise I'll
> discard the temporary dataset and try again with relevant changes.
>
> I'd like to be certain of two things that are needed for this approach
> to work well:
>
> 1) Does the dataset generated by H5Dcreate_anon() actually exist
> (transiently) on-disk, rather than being a clever wrapper for some
> memory buffer?  I am generating the dataset chunked and writing it out
> chunk-by-chunk, so insufficient RAM isn't a problem UNLESS there is a
> concern with using H5Dcreate_anon() for a dataset too large to fit in
> memory at once.

        Yes, it’s really on disk.

> 2) I understand that "normal" H5Dcreate() and a dataset write,
> followed sometime later by H5Ldelete(), can end up (in 1.8.18)
> resulting in wasted space in the file on disk.  Can wasted space be
> produced similarly by H5Dcreate_anon() when no later call to H5Olink()
> is made?  [Assume that H5Dclose() gets properly called.]  I'm hoping
> not … ?

        Yes, there could be some wasted space in the file with H5Dcreate_anon, although it will be less that what could occur with H5Dcreate.

> Thanks in advance for info on this subject!
>
>
> Regarding the wasted space, I have secondary questions.
>
> 3) I know that h5repack can be used to produce a new file without
> wasted space.  But without h5repack, would the creation of more
> datasets in the same file (with library version 1.8.18) re-use that
> wasted disk space when possible?

        Yes, as long as you don’t close & reopen the file.  In the 1.8 release sequence, the free file space info is tracked in memory until the file is closed.  (In the 1.10 sequence, there’s a property to request that this information be tracked persistently in the file)

> 4) There are apparently some mechanisms in 1.10.x for managing /
> reclaiming wasted space on disk in HDF5 files?  Does it happen
> automatically upon any call to H5Ldelete() with the 1.10.x library, or
> are some additional function calls needed?  I can't really find
> anything in the docs about this so a pointer would be much
> appreciated.  (As noted on this list previously, my employer can't
> upgrade to 1.10.x until there is a way to produce 1.8.x backwards
> compatible output, but eventually I guess we'll all get there…)

        Yes, you want the H5Pset_file_space_strategy property (alluded to above).

                Quincey



_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions regarding H5Dcreate_anon() and reclaiming unused disk file space

Kevin B. McCarty
Thank you for the answers, Quincey!

On Wed, Aug 23, 2017 at 8:01 AM, Quincey Koziol <[hidden email]> wrote:

>         Yes, there could be some wasted space in the file with H5Dcreate_anon, although it will be less that what could occur with H5Dcreate.

OK, that is good to know.  I guess I'll have to try some practical
tests to see whether the amount of wasted space is small enough for
this approach to be acceptable to us or not.

[me]

>> 4) There are apparently some mechanisms in 1.10.x for managing /
>> reclaiming wasted space on disk in HDF5 files?  Does it happen
>> automatically upon any call to H5Ldelete() with the 1.10.x library, or
>> are some additional function calls needed?  I can't really find
>> anything in the docs about this so a pointer would be much
>> appreciated.  (As noted on this list previously, my employer can't
>> upgrade to 1.10.x until there is a way to produce 1.8.x backwards
>> compatible output, but eventually I guess we'll all get there…)

>         Yes, you want the H5Pset_file_space_strategy property (alluded to above).

I've found the documentation for this function... I also found a
couple documents here with more detailed and higher-level
descriptions, for which I'll post a link for the sake of anyone else
looking for this info:

https://support.hdfgroup.org/HDF5/docNewFeatures/FileSpace/

I'll spend some time looking at these.  Are these documents accurate
with respect to the library (as of 1.10.1) ?  Or is there a more
up-to-date version that I should look at?

Thanks again,

--
Kevin B. McCarty
<[hidden email]>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: fully dynamic construction of HDF Table dataset

Rafal Lichwala
In reply to this post by Walter Landry
Hi Walter, hi All,

Thank you for sharing your work.
I've analyzed your codes briefly and it seems you manually create a
dataset with compound type and then you put your values in such a
dataset - is that correct?

But that means for my use case I need to collect all columns (their
types) first and then create a compound dataset.
When the number of such columns is really huge this operation can be
time and resource consuming. But that's OK if there is no other solution...

If I well understand your codes, you are collecting your columns (which
are separate classes in your case) just in vectors and then you
calculating columns offsets basing on std::vector::data - is that correct?

Any other suggestions from HDF Forum Team which could help to solve my
use case?

Thank you.

Regards,
Rafal

W dniu 2017-08-23 o 16:25, Walter Landry pisze:

> Rafal Lichwala <[hidden email]> wrote:
>> Let's say my final HDF table will look like this:
>> ['a', 1, 3.14]
>> ['b', 2, 2.11]
>> ['c', 3, 1.89]
>>
>> So we simply have a HDF table with 3 columns of types: char, int,
>> float
>> and 3 rows with some values.
>>
>> Creation of that table must be divided into some "steps".
>> After 1st "step" I should have a table:
>> ['a']
>>
>> After 2nd step:
>> ['a', 1]
>>
>> After 3rd step:
>> ['a', 1, 3.14]
>>
>> After 4th step:
>> ['a', 1, 3.14]
>> ['b', x, x]
>>
>> where x after 4th step is undefined and can be some default values
>> which will be overwritten in the next steps.
>>
>> How to achieve that use case?
>
> I have to do something similar for my program tablator
>
>    https://github.com/Caltech-IPAC/tablator
>
> I read in tables in other formats and write out as HDF5.  So I do not
> know the types of the rows at compile time.  It is all in C++.  The
> details of writing HDF5 are in
>
>    src/Table/write_hdf5/
>
>> Is it possible to create a table by calling H5TBmake_table(), but
>> having no fields and no records at the beginning and then just call
>> H5TBinsert_field() in the next steps?
>
> I do not think that is going to work, because you need to know the
> sizes of rows when you create the table.
>
>> Is it possible to have "data" attribute of H5TBinsert_field() function
>> a NULL value when we insert a new field to a table dataset with no
>> records yet?
>>
>> What about 4th step - can I create just a first column value for a new
>> record in a table?
>
> I do not know of a way to do that.  I would end up creating a whole
> new table with the new field.  You can then populate the empty fields
> with appropriate default values.
>
>> I know it's maybe a strange use case, but the problem is that I could
>> have really huge structure model (a lot of columns and a lot of
>> records) which should be stored in the HDF table dataset, so I need to
>> avoid "collecting" required information (number of fields, their
>> types, values) by initial iterating over whole structure.
>> The second problem is that I have a vector of objects which need to be
>> stored as HDF table (where table row is the given object and columns
>> are its fields), but all examples I've seen just work on C struct.
>
> That sounds similar to the internal data structure I use in tablator.
>
> Hope that helps,
> Walter Landry
>


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: fully dynamic construction of HDF Table dataset

Werner Benger
Hi Rafal,

  it looks both the dimensions and the structural type of your dataset
are supposed to be dynamical in your use case. This sounds like a demand
that would be possible but unhealthily nonperforming to manage if all is
put into one dataset that is dynamically updated and re-organized both
in memory and disk when inserting new data.

I'd rather use a group with many datasets in such a case. Each dataset
can have unlimited dimensions in the one direction where you update
data, but use only one type for each dataset, not a compound structure.
So if a new column is added, you add a new dataset. To iterate over the
compounds over your data type, you iterate over the containing group and
check the type of each dataset there.

Admittedly I have no experience with the HDF5 Table API, probably it's
not possible to use that one and you'd need to use the lower level H5D
and H5G API's.

Regards,

            Werner



On 24.08.2017 06:47, Rafal Lichwala wrote:

> Hi Walter, hi All,
>
> Thank you for sharing your work.
> I've analyzed your codes briefly and it seems you manually create a
> dataset with compound type and then you put your values in such a
> dataset - is that correct?
>
> But that means for my use case I need to collect all columns (their
> types) first and then create a compound dataset.
> When the number of such columns is really huge this operation can be
> time and resource consuming. But that's OK if there is no other
> solution...
>
> If I well understand your codes, you are collecting your columns
> (which are separate classes in your case) just in vectors and then you
> calculating columns offsets basing on std::vector::data - is that
> correct?
>
> Any other suggestions from HDF Forum Team which could help to solve my
> use case?
>
> Thank you.
>
> Regards,
> Rafal
>
> W dniu 2017-08-23 o 16:25, Walter Landry pisze:
>> Rafal Lichwala <[hidden email]> wrote:
>>> Let's say my final HDF table will look like this:
>>> ['a', 1, 3.14]
>>> ['b', 2, 2.11]
>>> ['c', 3, 1.89]
>>>
>>> So we simply have a HDF table with 3 columns of types: char, int,
>>> float
>>> and 3 rows with some values.
>>>
>>> Creation of that table must be divided into some "steps".
>>> After 1st "step" I should have a table:
>>> ['a']
>>>
>>> After 2nd step:
>>> ['a', 1]
>>>
>>> After 3rd step:
>>> ['a', 1, 3.14]
>>>
>>> After 4th step:
>>> ['a', 1, 3.14]
>>> ['b', x, x]
>>>
>>> where x after 4th step is undefined and can be some default values
>>> which will be overwritten in the next steps.
>>>
>>> How to achieve that use case?
>>
>> I have to do something similar for my program tablator
>>
>>    https://github.com/Caltech-IPAC/tablator
>>
>> I read in tables in other formats and write out as HDF5.  So I do not
>> know the types of the rows at compile time.  It is all in C++. The
>> details of writing HDF5 are in
>>
>>    src/Table/write_hdf5/
>>
>>> Is it possible to create a table by calling H5TBmake_table(), but
>>> having no fields and no records at the beginning and then just call
>>> H5TBinsert_field() in the next steps?
>>
>> I do not think that is going to work, because you need to know the
>> sizes of rows when you create the table.
>>
>>> Is it possible to have "data" attribute of H5TBinsert_field() function
>>> a NULL value when we insert a new field to a table dataset with no
>>> records yet?
>>>
>>> What about 4th step - can I create just a first column value for a new
>>> record in a table?
>>
>> I do not know of a way to do that.  I would end up creating a whole
>> new table with the new field.  You can then populate the empty fields
>> with appropriate default values.
>>
>>> I know it's maybe a strange use case, but the problem is that I could
>>> have really huge structure model (a lot of columns and a lot of
>>> records) which should be stored in the HDF table dataset, so I need to
>>> avoid "collecting" required information (number of fields, their
>>> types, values) by initial iterating over whole structure.
>>> The second problem is that I have a vector of objects which need to be
>>> stored as HDF table (where table row is the given object and columns
>>> are its fields), but all examples I've seen just work on C struct.
>>
>> That sounds similar to the internal data structure I use in tablator.
>>
>> Hope that helps,
>> Walter Landry
>>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

--
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: fully dynamic construction of HDF Table dataset

Walter Landry
In reply to this post by Rafal Lichwala
Rafal Lichwala <[hidden email]> wrote:

> Hi Walter, hi All,
>
> Thank you for sharing your work.
> I've analyzed your codes briefly and it seems you manually create a
> dataset with compound type and then you put your values in such a
> dataset - is that correct?
>
> But that means for my use case I need to collect all columns (their
> types) first and then create a compound dataset.
> When the number of such columns is really huge this operation can be
> time and resource consuming. But that's OK if there is no other
> solution...
>
> If I well understand your codes, you are collecting your columns
> (which are separate classes in your case) just in vectors and then you
> calculating columns offsets basing on std::vector::data - is that
> correct?

Correct.

Cheers,
Walter Landry

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: fully dynamic construction of HDF Table dataset

Jason Newton
In reply to this post by Rafal Lichwala
What you want is run time type definitions - which is something HDF supports but C doesn't.  Have a look at python/numpy's dtype for a good idea of the task you're in for, especially on the c side, which maps really well to hdf's type system.  If you just want something really simple you dont' have to get too crazy.

Basically you need to know how big a type is (a runtime version of sizeof), how many fields are in it, the types of those fields, and to prefix sum ( a cumsum with a zero shifted in from the left) the size and arity of those fields to get the offsets of each field of the struct-blob-at-runtime.   Then you have to work with those fields adaptively since their runtime type varies because someone's gotta work with bytes at the end of the day.  It's still quite fast and what you see happening essentially with python's numpy api.

The table/packet table apis will have no problem with these types as records.

FYI for performance you should still buffer some of these up.  The alternative approach is a storing structure of arrays - which is a very common technique - matlab uses it when it stores in hdf5 for instance and it is more .... available to other users of HDF5 who care not to venture into these things.  I side on runtime structs but other software sometime isn't always so clever. This boils down to each column having it's own dataset/table/packet table and being whatever primative hdf5, such that you will likely not have to touch compounds.  The drawback of this approach is that you always have to do work to reconstruct the object, which can be relatively slow in python for instance and involve lots of copying and metaprogramming.  In matlab there is no generic solution for gluing it back together for instance - so you end up writing boilerplate and supporting that is difficult as the documents change.

-Jason


On Wed, Aug 23, 2017 at 5:10 AM, Rafal Lichwala <[hidden email]> wrote:
Hi,

I've read many examples from both H5TB high level API and low level API for compound HDF data type, but I didn't find a good solution for my special use case. All those examples have one problematic assumption: data structure (which means number of fields and their types and values) must be known a priori - that's the problem in my case, when I don't know this structure and I need to create a table HDF dataset not only row-by-row, but also field-by-field in the row.

I need your advice how to achieve what I want using a proper sequence of HDF API calls.

Let's say my final HDF table will look like this:
['a', 1, 3.14]
['b', 2, 2.11]
['c', 3, 1.89]

So we simply have a HDF table with 3 columns of types: char, int, float
and 3 rows with some values.

Creation of that table must be divided into some "steps".
After 1st "step" I should have a table:
['a']

After 2nd step:
['a', 1]

After 3rd step:
['a', 1, 3.14]

After 4th step:
['a', 1, 3.14]
['b', x, x]

where x after 4th step is undefined and can be some default values which will be overwritten in the next steps.

How to achieve that use case?

Is it possible to create a table by calling H5TBmake_table(), but having no fields and no records at the beginning and then just call H5TBinsert_field() in the next steps?

Is it possible to have "data" attribute of H5TBinsert_field() function a NULL value when we insert a new field to a table dataset with no records yet?

What about 4th step - can I create just a first column value for a new record in a table?

I know it's maybe a strange use case, but the problem is that I could have really huge structure model (a lot of columns and a lot of records) which should be stored in the HDF table dataset, so I need to avoid "collecting" required information (number of fields, their types, values) by initial iterating over whole structure.
The second problem is that I have a vector of objects which need to be stored as HDF table (where table row is the given object and columns are its fields), but all examples I've seen just work on C struct.

I would appreciate any advice!

Regards,
Rafal


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5