Hdf5 file grew to 15TB during filesystem failure

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Hdf5 file grew to 15TB during filesystem failure

We had a file system failure not too long ago, and after cleaning things
up, we discovered a hdf5 file that was listed as being 15TB, while it
only had 27GB of data in it - you could see this by using h5stat -S.  
This file was being created by a program that 'translates' data into a
hdf5 schema. The file system is lustre 2.4, un striped, linux, running
red hat 7. The app uses hdf5 1.8.15. We built hdf5 with parallel MPI
support. While the app is a MPI program it does not use the parallel
Hdf5 interface.

Our current theory is that due to the filesystem failure, you could
allocate space for your file, but not write to it - I'm not such an
expert with file system issues like this, but I understand it is
possible to allocate more space then one physically has on disk.

Does someone know hdf5's behavior in this regard? If it cannot do a
write, will it continually do new allocations, explaining why the
filesize grew so large? Maybe this is a bug in hdf5 and it should error
out on the write?

Then there is the question of what I can do to make the translating app
more robust, one thing is upgrade to 1.8.17, the other is I have been
looking at the document


and wondering if I should use a non-default file space management
strategy. Currently we just use the default - but the translating app
does not delete hdf5 objects from the output, it creates hundreds of
chunked datasets, some datasets are small but larger datasets have
chunks capped at 100MB. The document suggests that there may be a more
optimal file space management property than the default if you do not
remove h5 objects.



Hdf-forum is for HDF software users discussion.
[hidden email]
Twitter: https://twitter.com/hdf5