Quantcast

File corruption and hdf5 design considerations

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

File corruption and hdf5 design considerations

Eelco Hoogendoorn
As much as I love hdf5 (and pytables), I find that it becomes
increasingly unusable when storing large amounts of data, with
potentially troublesome code.

I have already learned the hard way to never store original experimental
data in any database that might be opened with write access; and now I
am finding that storing several days worth of simulation data in hdf5
isnt quite feasible either. Perhaps itd be fine when my code is all done
and bug free; for now, it crashes frequently. Thats part of development,
but id like to be able to do development without losing days worth of
data at a time, AND use hdf5.

My question then is: what are best practices for dealing with these
kinds of situations? One thing I am doing at the moment is splitting my
data over several different .h5 files, so writing to one table can not
take my whole dataset down with it. It is unfortunate though, that
standard OS file systems are more robust than hdf5; id rather see it the
other way around.

I understand that there isnt much one can do about a program crashing in
the middle of a binary tree update; that is not going to be pretty. But
I could envision a rather simple solution to that; just keep one or more
fully redundant metadata structures in memory, and only ever write to
one at the same time. If one becomes corrupted, at least you still have
all your data since your last flush available. I could not care less for
the extra disk space overhead, but in case anyone does, it should be
easy to make the number of histories of the metadata maintained optional.

Is there already such functionality that I have not noticed, is it (or
should it) be planned functionality, or am I missing any other
techniques for dealing with these type of situations?

Thank you for your input,
Eelco Hoogendoorn

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File corruption and hdf5 design considerations

Quincey Koziol
Hi Eelco,
        We are working on two features for the 1.10.0 release that should address the file corruption issue:  journaling changes to the HDF5 file, and writing updates to the file in a specific order that prevents the file from being corrupted if the application crashes.  Journaling has no space overhead, but may perform slower, while ordered updates should perform as normal for HDF5, but will have some space overhead.  We are trying to get both of these features ready by around November.

        Quincey

On Aug 11, 2012, at 11:31 AM, Eelco Hoogendoorn wrote:

> As much as I love hdf5 (and pytables), I find that it becomes increasingly unusable when storing large amounts of data, with potentially troublesome code.
>
> I have already learned the hard way to never store original experimental data in any database that might be opened with write access; and now I am finding that storing several days worth of simulation data in hdf5 isnt quite feasible either. Perhaps itd be fine when my code is all done and bug free; for now, it crashes frequently. Thats part of development, but id like to be able to do development without losing days worth of data at a time, AND use hdf5.
>
> My question then is: what are best practices for dealing with these kinds of situations? One thing I am doing at the moment is splitting my data over several different .h5 files, so writing to one table can not take my whole dataset down with it. It is unfortunate though, that standard OS file systems are more robust than hdf5; id rather see it the other way around.
>
> I understand that there isnt much one can do about a program crashing in the middle of a binary tree update; that is not going to be pretty. But I could envision a rather simple solution to that; just keep one or more fully redundant metadata structures in memory, and only ever write to one at the same time. If one becomes corrupted, at least you still have all your data since your last flush available. I could not care less for the extra disk space overhead, but in case anyone does, it should be easy to make the number of histories of the metadata maintained optional.
>
> Is there already such functionality that I have not noticed, is it (or should it) be planned functionality, or am I missing any other techniques for dealing with these type of situations?
>
> Thank you for your input,
> Eelco Hoogendoorn
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File corruption and hdf5 design considerations

Eelco Hoogendoorn

Hi Quincey,

I am really glad to hear that; this sounds like exactly the kind of
features I am looking for.

Would it be accurate to say that in the meanwhile, best practice for
avoiding this type of problems is by storing data in seperate .h5 files?

Kind regards,
Eelco


Op 13-Aug-12 18:12, Quincey Koziol schreef:

> Hi Eelco,
> We are working on two features for the 1.10.0 release that should address the file corruption issue:  journaling changes to the HDF5 file, and writing updates to the file in a specific order that prevents the file from being corrupted if the application crashes.  Journaling has no space overhead, but may perform slower, while ordered updates should perform as normal for HDF5, but will have some space overhead.  We are trying to get both of these features ready by around November.
>
> Quincey
>
> On Aug 11, 2012, at 11:31 AM, Eelco Hoogendoorn wrote:
>
>> As much as I love hdf5 (and pytables), I find that it becomes increasingly unusable when storing large amounts of data, with potentially troublesome code.
>>
>> I have already learned the hard way to never store original experimental data in any database that might be opened with write access; and now I am finding that storing several days worth of simulation data in hdf5 isnt quite feasible either. Perhaps itd be fine when my code is all done and bug free; for now, it crashes frequently. Thats part of development, but id like to be able to do development without losing days worth of data at a time, AND use hdf5.
>>
>> My question then is: what are best practices for dealing with these kinds of situations? One thing I am doing at the moment is splitting my data over several different .h5 files, so writing to one table can not take my whole dataset down with it. It is unfortunate though, that standard OS file systems are more robust than hdf5; id rather see it the other way around.
>>
>> I understand that there isnt much one can do about a program crashing in the middle of a binary tree update; that is not going to be pretty. But I could envision a rather simple solution to that; just keep one or more fully redundant metadata structures in memory, and only ever write to one at the same time. If one becomes corrupted, at least you still have all your data since your last flush available. I could not care less for the extra disk space overhead, but in case anyone does, it should be easy to make the number of histories of the metadata maintained optional.
>>
>> Is there already such functionality that I have not noticed, is it (or should it) be planned functionality, or am I missing any other techniques for dealing with these type of situations?
>>
>> Thank you for your input,
>> Eelco Hoogendoorn
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> .
>


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File corruption and hdf5 design considerations

Alexander van Amesfoort
Hello Eelco,

To reduce the chance of data loss, we store all datasets in one or more
external raw files. HDF5 can store the filename and access the data
transparently.
We also avoid chunked storage. Chunking works best for large data sets
that are sparse and clustered, but was not faster for our data set
processing last time we checked.
Then, if our HDF5 file would get corrupted, we still have the data in
the most straightforward order and can regenerate the .h5 from our
experiment config file (list of key=value(s) lines) and data file sizes.

I assume the HDF5 bookkeeping code works, so I don't see how B-trees etc
can be corrupted, unless you receive a fatal signal while changing it
(e.g. another thread crashes). (You need to use a thread-safe HDF5 build
or ensure mutual exclusion yourself.)
If it is the HDF5 caching, then you can flush after risky operations
(e.g. data set extend).

Overall, I understand how one can lose cached or stored HDF5 data, but
not really why it happens to you so dramatically/often(?). Maybe you
have to reconsider your work/processing flow instead of cutting losses
with separate HDF5 files.
I don't know your work flow, but if you write several separate programs
that together form processing pipeline(s) and each program operates
out-of-place, then you cannot corrupt your inputs and can always retry a
failed pipeline stage. You don't have to fully recreate new HDF5 files
or even use HDF5 for all intermediate files. Only requires some extra
(temp) disk space.

Regards,
Alexander van Amesfoort
--
HPC Software Engineer
ASTRON (Netherlands Institute for Radio Astronomy)
Oude Hoogeveensedijk 4 (PW 05),
7991 PD Dwingeloo,
The Netherlands.
Tel: +31 521 595 754
http://www.astron.nl/~amesfoort/


On 08/14/2012 04:11 PM, Eelco Hoogendoorn wrote:

>
> Hi Quincey,
>
> I am really glad to hear that; this sounds like exactly the kind of
> features I am looking for.
>
> Would it be accurate to say that in the meanwhile, best practice for
> avoiding this type of problems is by storing data in seperate .h5 files?
>
> Kind regards,
> Eelco
>
>
> Op 13-Aug-12 18:12, Quincey Koziol schreef:
>> Hi Eelco,
>>     We are working on two features for the 1.10.0 release that should
>> address the file corruption issue:  journaling changes to the HDF5
>> file, and writing updates to the file in a specific order that
>> prevents the file from being corrupted if the application crashes.
>> Journaling has no space overhead, but may perform slower, while
>> ordered updates should perform as normal for HDF5, but will have some
>> space overhead.  We are trying to get both of these features ready by
>> around November.
>>
>>     Quincey
>>
>> On Aug 11, 2012, at 11:31 AM, Eelco Hoogendoorn wrote:
>>
>>> As much as I love hdf5 (and pytables), I find that it becomes
>>> increasingly unusable when storing large amounts of data, with
>>> potentially troublesome code.
>>>
>>> I have already learned the hard way to never store original
>>> experimental data in any database that might be opened with write
>>> access; and now I am finding that storing several days worth of
>>> simulation data in hdf5 isnt quite feasible either. Perhaps itd be
>>> fine when my code is all done and bug free; for now, it crashes
>>> frequently. Thats part of development, but id like to be able to do
>>> development without losing days worth of data at a time, AND use hdf5.
>>>
>>> My question then is: what are best practices for dealing with these
>>> kinds of situations? One thing I am doing at the moment is splitting
>>> my data over several different .h5 files, so writing to one table can
>>> not take my whole dataset down with it. It is unfortunate though,
>>> that standard OS file systems are more robust than hdf5; id rather
>>> see it the other way around.
>>>
>>> I understand that there isnt much one can do about a program crashing
>>> in the middle of a binary tree update; that is not going to be
>>> pretty. But I could envision a rather simple solution to that; just
>>> keep one or more fully redundant metadata structures in memory, and
>>> only ever write to one at the same time. If one becomes corrupted, at
>>> least you still have all your data since your last flush available. I
>>> could not care less for the extra disk space overhead, but in case
>>> anyone does, it should be easy to make the number of histories of the
>>> metadata maintained optional.
>>>
>>> Is there already such functionality that I have not noticed, is it
>>> (or should it) be planned functionality, or am I missing any other
>>> techniques for dealing with these type of situations?
>>>
>>> Thank you for your input,
>>> Eelco Hoogendoorn
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>> .
>>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
Loading...