Quantcast

File size

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

File size

Carlos Penedo Rocha

Schlumberger-Private

Hi,

 

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

 

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.

Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

 

Approach #1 is not desirable for my case because if there’s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can’t because it’s still writing (still opened).

 

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn’t suffer from this problem.

 

So, my question is: is there an “Approach #3” that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

 

Thanks,

Carlos R.

 


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File size

Landon Clipp

Hello Carlos,

Why not write a program that takes data for a given amount of time, say 5 minutes, and stores it into a temporary text file. Then at the end of the 5 minutes, store that data into HDF, purge the file and then continue to read data. If an outage happens, you should still have the data available in your temporary file which can be recovered.

Regards,
Landon Clipp


On Oct 4, 2016 7:09 PM, "Carlos Penedo Rocha" <[hidden email]> wrote:

Schlumberger-Private

Hi,

 

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

 

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.

Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

 

Approach #1 is not desirable for my case because if there’s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can’t because it’s still writing (still opened).

 

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn’t suffer from this problem.

 

So, my question is: is there an “Approach #3” that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

 

Thanks,

Carlos R.

 


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File size

Mark Koennecke
In reply to this post by Carlos Penedo Rocha
Hi,

Am 05.10.2016 um 02:08 schrieb Carlos Penedo Rocha <[hidden email]>:

Schlumberger-Private
Hi,
 
I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.
 
Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.
 
Approach #1 is not desirable for my case because if there’s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can’t because it’s still writing (still opened).
 
Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn’t suffer from this problem.
 
So, my question is: is there an “Approach #3” that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?
 


I think this may be separate issues: if the compression is bad, this may be because your chunk size is to small. Try writing every 10 sec or such.
Then there is the approach #3 which does not close the file but just flushes it. Flushing also ensures that the on disk structure is intact, so you are 
safe against a crashing program. The call would be H5Fflush(). 

Regards,

      Mark Könnecke



Thanks,
Carlos R.
 
_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File size

Werner Benger
In reply to this post by Carlos Penedo Rocha

Hi Carlos,

 use HDF5 1.10 . That one provides the feature to write to a file while it always remains readable by another process and it ensures the file will never be corrupted. That feature is called SWMR (single write, multiple read) and was introduce with 1.10.

Also you may consider using the LZ4 filter for compression instead of the internal deflate filter. LZ4 does not compress that strongly as deflate, but it's faster by a magnitude, nearly like uncompressed read / write, so it may be worth it, especially for time-constraint data I/O. You may also want to optimize the chunked layout of the dataset according to your data updates since each chunk is compressed on its own.

Cheers,

             Werner


On 05.10.2016 02:08, Carlos Penedo Rocha wrote:

Schlumberger-Private

Hi,

 

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

 

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.

Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

 

Approach #1 is not desirable for my case because if there’s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can’t because it’s still writing (still opened).

 

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn’t suffer from this problem.

 

So, my question is: is there an “Approach #3” that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

 

Thanks,

Carlos R.

 



_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

-- 
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: File size

Patrick Vacek
In reply to this post by Carlos Penedo Rocha
This sounds like a problem I encountered before. Here's my post about
the issue and resolution:

http://hdf-forum.184993.n3.nabble.com/Deflate-and-partial-chunk-writes-td4028713.html

Basically, my solution was to locally buffer data until I'd filled up an
entire chunk before writing to disk. Otherwise there are some
inefficiencies in the compression that will cause your files to be
oversized.

--Patrick

On 10/5/2016 1:06 AM, [hidden email] wrote:

> From: Carlos Penedo Rocha <[hidden email]>
> To: "[hidden email]" <[hidden email]>
> Subject: [Hdf-forum] File size
>
> Schlumberger-Private
> Hi,
>
> I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.
>
> Approach #1: keep the file opened and just write data as they come, or write a buffer at once.
> Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.
>
> Approach #1 is not desirable for my case because if there's any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can't because it's still writing (still opened).
>
> Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn't suffer from this problem.
>
> So, my question is: is there an "Approach #3" that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?
>
> Thanks,
> Carlos R.
>


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

smime.p7s (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Ext] Re: File size

Carlos Penedo Rocha
In reply to this post by Werner Benger

Schlumberger-Private

Thanks Landon, Koennecke, Werner and Patrick for the feedback.

 

I’m suffering exactly the problem described in section 3 in this link:

https://support.hdfgroup.org/HDF5/doc/H5.user/Performance.html

 

I like the suggestion to use the new SWMR feature, I think I’m not that confident in case there’s, say, a shutdown exactly during a write operation.

 

I’ll proceed by having a temporary UNcompressed H5 file and then, from time to time, I’ll have the h5repack tool run on that file to just compress it into another permanent file. In my tests, I could open/write/close an uncompressed H5 file every x seconds and not suffer from the problem described in the link above. The issue really happens with compressed datasets. Another test that I did was to run h5repack on a bloated file, the size really goes down to what it should be.

 

As a side note, I don’t think my chunk size is too small for the data I have. But it’s not as big as the chunk cache (1MB).

 

Thanks and Regards,

Carlos

 

 

From: Hdf-forum [mailto:[hidden email]] On Behalf Of Werner Benger
Sent: Wednesday, October 05, 2016 3:43 AM
To: [hidden email]
Subject: [Ext] Re: [Hdf-forum] File size

 

Hi Carlos,

 use HDF5 1.10 . That one provides the feature to write to a file while it always remains readable by another process and it ensures the file will never be corrupted. That feature is called SWMR (single write, multiple read) and was introduce with 1.10.

Also you may consider using the LZ4 filter for compression instead of the internal deflate filter. LZ4 does not compress that strongly as deflate, but it's faster by a magnitude, nearly like uncompressed read / write, so it may be worth it, especially for time-constraint data I/O. You may also want to optimize the chunked layout of the dataset according to your data updates since each chunk is compressed on its own.

Cheers,

             Werner

 

On 05.10.2016 02:08, Carlos Penedo Rocha wrote:

Schlumberger-Private

Hi,

 

I have a scenario in which my compressed h5 file needs to be updated with new data that is coming in every, say, 5 seconds.

 

Approach #1: keep the file opened and just write data as they come, or write a buffer at once.

Approach #2: open the file (RDWR), write the data (or a buffer) and then close the file.

 

Approach #1 is not desirable for my case because if there’s any problem (outage, etc), then the h5 file will likely get corrupted. Or if I want to have a look at the file, I can’t because it’s still writing (still opened).

 

Approach #2 is good to address the issue above, BUT I noticed that if I open/write/close the file every 5 seconds, the file compression gets really bad and the file size goes up big time. Approach 1 doesn’t suffer from this problem.

 

So, my question is: is there an “Approach #3” that gives me the best of the two worlds? Less likely to get me a corrupted h5 file and at the same time, a good compression rate?

 

Thanks,

Carlos R.

 




_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5



-- 
___________________________________________________________________________
Dr. Werner Benger                Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019  Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809                        Fax.: +1 225 578-5362 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Loading...