Questions about size of generated Hdf5 files

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about size of generated Hdf5 files

guillaume.jacquenot
Hello everyone!

I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.

I am writing 19000 datasets with 21 values of 64 bit (real number).
I write one value at a time, and extend with one each of the 19000 datasets everytime.
All data are correctly written.
But the generated file is more than 48 Mo.
I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.

For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.

Question 1)
Is this behaviour normal?

Question 2)
Does extending dataset each time we write data inside can significantly increase the total required space disk size?
Does preallocating dataset and using hyperslab can save some space?
Does chunk parameters can impact the size of generated hdf5 file

Question 3)
If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?

N.B:
When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
That means each group is about 780 bytes
https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c

Guillaume Jacquenot




_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions about size of generated Hdf5 files

Quincey Koziol-3
Hi Guillaume,
Are you using chunked or contiguous datasets?  If chunked, what size are you using?  Also, can you use the “latest” version of the format, which should be smaller, but is only compatible with HDF5 1.10.x or later?  (i.e. H5Pset_libver_bounds with “latest” for low and high bounds, https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm )

Quincey


On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <[hidden email]> wrote:

Hello everyone!

I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.

I am writing 19000 datasets with 21 values of 64 bit (real number).
I write one value at a time, and extend with one each of the 19000 datasets everytime.
All data are correctly written.
But the generated file is more than 48 Mo.
I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.

For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.

Question 1)
Is this behaviour normal?

Question 2)
Does extending dataset each time we write data inside can significantly increase the total required space disk size?
Does preallocating dataset and using hyperslab can save some space?
Does chunk parameters can impact the size of generated hdf5 file

Question 3)
If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?

N.B:
When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
That means each group is about 780 bytes
https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c

Guillaume Jacquenot



_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions about size of generated Hdf5 files

guillaume.jacquenot
In reply to this post by guillaume.jacquenot
Hello Quincey

I am using version 1.8.16
I have
I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64 floating point number.
I can specify the number n, which controls the number of times I saved my data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()



With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why such an increase?
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset.
With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same behaviour? I have not tried since I don't know how to map the content of an array when calling the h5dwrite_f function for a compound dataset.


If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks



Here is the error I have with contiguous dataset


  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct(): extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <[hidden email]>:
Send Hdf-forum mailing list submissions to
        [hidden email]

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

or, via email, send a message with subject or body 'help' to
        [hidden email]

You can reach the person managing the list at
        [hidden email]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."


Today's Topics:

   1. Re: Questions about size of generated Hdf5 files (Quincey Koziol)
   2. Re: Parallel file access recommendation (Aaron Friesz)


----------------------------------------------------------------------

Message: 1
Date: Tue, 23 May 2017 08:22:59 -0700
From: Quincey Koziol <[hidden email]>
To: HDF Users Discussion List <[hidden email]>
Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"

Hi Guillaume,
        Are you using chunked or contiguous datasets?  If chunked, what size are you using?  Also, can you use the ?latest? version of the format, which should be smaller, but is only compatible with HDF5 1.10.x or later?  (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds, https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )

        Quincey


> On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <[hidden email]> wrote:
>
> Hello everyone!
>
> I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.
>
> I am writing 19000 datasets with 21 values of 64 bit (real number).
> I write one value at a time, and extend with one each of the 19000 datasets everytime.
> All data are correctly written.
> But the generated file is more than 48 Mo.
> I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
> I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.
>
> For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.
>
> Question 1)
> Is this behaviour normal?
>
> Question 2)
> Does extending dataset each time we write data inside can significantly increase the total required space disk size?
> Does preallocating dataset and using hyperslab can save some space?
> Does chunk parameters can impact the size of generated hdf5 file
>
> Question 3)
> If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?
>
> N.B:
> When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
> That means each group is about 780 bytes
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>
> Guillaume Jacquenot
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html>

------------------------------

Message: 2
Date: Tue, 23 May 2017 08:46:07 -0700
From: Aaron Friesz <[hidden email]>
To: HDF Users Discussion List <[hidden email]>
Subject: Re: [Hdf-forum] Parallel file access recommendation
Message-ID:
        <[hidden email]>
Content-Type: text/plain; charset="utf-8"

A year or so back, we changed to BeeGFS as well.  There were some issues
getting parrallel I/O setup.  First thing you want to do is run the
parrallel mpio test. I believe they can be found here:
https://support.hdfgroup.org/HDF5/Tutor/pprog.html.

This will help you verify if your cluster has mpio setup correctly.  If
that doesn't work, you'll need to get in touch with the management group to
fix that.

Then you need to make sure you are using an HDF5 library that is configured
to do parrallel I/O.

I know there aren't a lot of specifics here, but it took me about two weeks
of convincing to get my cluster management group to realize that things
weren't working quite right.  Once everything was setup, I was able to
generate and write about 40 GB of data in around two minutes.

On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <[hidden email]> wrote:

> Hi Jan,
>
> > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich <
> [hidden email]> wrote:
> >
> > Hello HDF users,
> >
> > I am using HDF5 through NetCDF and I recently changed my program so that
> each MPI process writes its data directly to the output file as opposed to
> the master process gathering the results and being the only one who does
> I/O.
> >
> > Now I see that my program slows down file systems a lot (of the whole
> HPC cluster) and I don't really know how to handle I/O. The file system is
> a high throughput Beegfs system.
> >
> > My program uses a hybrid parallelization approach, i.e. work is split
> into N MPI processes, each of which spawns M worker threads. Currently, I
> write to the output file from each of the M*N threads, but the writing is
> guarded by a mutex, so thread-safety shouldn't be a problem. Each writing
> process is a complete `open file, write, close file` cycle.
> >
> > Each write is at a separate region of the HDF5 file, so no chunks are
> shared among any two processes. The amount of data to be written per
> process is 1/(M*N) times the size of the whole file.
> >
> > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What is
> the `best practice` regarding parallel file access with HDF5?
>
>         Yes, this is probably the correct way to operate, but generally
> things are much better for this case when collective I/O operations are
> used.  Are you using collective or independent I/O?  (Independent is the
> default)
>
>         Quincey
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> hdfgroup.org_mailman_listinfo_hdf-2Dforum-5Flists.hdfgroup.org&d=DwICAg&c=
> clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=Rx9txIqgEINHtVDIDfXdIw&m=
> lnwp4oSn3StCocEX3B_WwTydNuJ5oFX7VYl-Ei3bbpw&s=5GdG4kU-9hw-z8kHIDPj6-
> WfvdQeASwtycyfNyQ1tn0&e=
> Twitter: https://urldefense.proofpoint.com/v2/url?u=https-3A__
> twitter.com_hdf5&d=DwICAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=
> Rx9txIqgEINHtVDIDfXdIw&m=lnwp4oSn3StCocEX3B_WwTydNuJ5oFX7VYl-Ei3bbpw&s=
> YAEy34105plaH2V5vqw54_wLbsigIZ__8F13hUdNgEQ&e=
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170523/aee9a001/attachment-0001.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org


------------------------------

End of Hdf-forum Digest, Vol 95, Issue 24
*****************************************


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions about size of generated Hdf5 files

Pierre de Buyl-2
Hello,

I just react to this because of the chunk size. Every chunk carries metadata so
a chunk should contain a non-negligible amount of data to avoid inefficiencies
and large file sizes. The guideline of the HDF5 documentation is a chunk size of
the order of 1MB.

Regards,

Pierre

On Tue, May 23, 2017 at 07:12:47PM +0200, Guillaume Jacquenot wrote:

> Hello Quincey
>
> I am using version 1.8.16
> I have
> I am using chunk of size 1.
> I have tried contiguous dataset, but I have error at runtime
>
> I have written a test program that creates 3000 datasets filled with 64
> floating point number.
> I can specify the number n, which controls the number of times I saved my
> data (the number of timesteps of a simulation in my case)
>
> To sum this test program,
>
>     call hdf5_init(filename)
>     do i = 1, n
>         call hdf5_write(datatosave)
>     end do
>     call hdf5_close()
>
>
>
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
> 370 bytes per empty dataset (Totally reasonnable).
> With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why
> such an increase?
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
> increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)
>
> When setting chunk size to 10, I obtain the following results
>
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
> 370 bytes per empty dataset.
> With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
> increase of 3000*8*10/1e6, which is logical.
>
> I don't understand the first increase of size. It does not make this data
> storage very efficient.
> Do you think coumpound dataset with 3000 columns will present the same
> behaviour? I have not tried since I don't know how to map the content of an
> array when calling the h5dwrite_f function for a compound dataset.
>
>
> If I ask 30000 datasets, I observe the same behaviour
> n=0 -> 10.9 Mo
> n=1 -> 73.2 Mo
>
> Thanks
>
>
>
> Here is the error I have with contiguous dataset
>
>
>   #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to
> create and link to dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
> create new link to object
>     major: Links
>     minor: Unable to initialize object
>   #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
> link
>     major: Symbol table
>     minor: Unable to insert object
>   #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal
> path traversal failed
>     major: Symbol table
>     minor: Object not found
>   #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
> traversal operator failed
>     major: Symbol table
>     minor: Callback failed
>   #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
> object
>     major: Object header
>     minor: Unable to initialize object
>   #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open
> object
>     major: Object header
>     minor: Can't open object
>   #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
> create dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
> construct layout information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
> extendible contiguous non-external dataset
>     major: Dataset
>     minor: Feature is unsupported
> HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C
>
> 2017-05-23 19:00 GMT+02:00 <[hidden email]>:
>
> > Send Hdf-forum mailing list submissions to
> >         [hidden email]
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_
> > lists.hdfgroup.org
> >
> > or, via email, send a message with subject or body 'help' to
> >         [hidden email]
> >
> > You can reach the person managing the list at
> >         [hidden email]
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Hdf-forum digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re: Questions about size of generated Hdf5 files (Quincey Koziol)
> >    2. Re: Parallel file access recommendation (Aaron Friesz)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 23 May 2017 08:22:59 -0700
> > From: Quincey Koziol <[hidden email]>
> > To: HDF Users Discussion List <[hidden email]>
> > Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
> > Message-ID: <[hidden email]>
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi Guillaume,
> >         Are you using chunked or contiguous datasets?  If chunked, what
> > size are you using?  Also, can you use the ?latest? version of the format,
> > which should be smaller, but is only compatible with HDF5 1.10.x or later?
> > (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
> > https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
> > https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )
> >
> >         Quincey
> >
> >
> > > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <
> > [hidden email]> wrote:
> > >
> > > Hello everyone!
> > >
> > > I am creating a HDF5 file from a Fortran program, and I am confused
> > about the size of my generated HDF5 file.
> > >
> > > I am writing 19000 datasets with 21 values of 64 bit (real number).
> > > I write one value at a time, and extend with one each of the 19000
> > datasets everytime.
> > > All data are correctly written.
> > > But the generated file is more than 48 Mo.
> > > I expected the total size of the file to be a little bigger than the raw
> > data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> > > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
> > means each empty dataset is about 400 bytes.
> > > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain
> > everything.
> > >
> > > For comparaison,if I write everything in a text file, where each real
> > number is written with 15 characters, I obtain a 6 Mo CSV file.
> > >
> > > Question 1)
> > > Is this behaviour normal?
> > >
> > > Question 2)
> > > Does extending dataset each time we write data inside can significantly
> > increase the total required space disk size?
> > > Does preallocating dataset and using hyperslab can save some space?
> > > Does chunk parameters can impact the size of generated hdf5 file
> > >
> > > Question 3)
> > > If I pack everything in a compound dataset with 19000 columns, will the
> > result file be smaller?
> > >
> > > N.B:
> > > When looking at the example of generating 100000 groups (grplots.c),the
> > size of the generated HD5 file is 78 Mo for 100000 empty groups
> > > That means each group is about 780 bytes
> > > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <
> > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
> > >
> > > Guillaume Jacquenot
> > >
> > >
> > >
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > [hidden email]
> > > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > > Twitter: https://twitter.com/hdf5
> >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
> > hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html>
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Tue, 23 May 2017 08:46:07 -0700
> > From: Aaron Friesz <[hidden email]>
> > To: HDF Users Discussion List <[hidden email]>
> > Subject: Re: [Hdf-forum] Parallel file access recommendation
> > Message-ID:
> >         <CAC4OLecz_6xCPWfXcvkJCjRm+DF+uttMP72VyxFKqPqGNOa2dg@mail.
> > gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > A year or so back, we changed to BeeGFS as well.  There were some issues
> > getting parrallel I/O setup.  First thing you want to do is run the
> > parrallel mpio test. I believe they can be found here:
> > https://support.hdfgroup.org/HDF5/Tutor/pprog.html.
> >
> > This will help you verify if your cluster has mpio setup correctly.  If
> > that doesn't work, you'll need to get in touch with the management group to
> > fix that.
> >
> > Then you need to make sure you are using an HDF5 library that is configured
> > to do parrallel I/O.
> >
> > I know there aren't a lot of specifics here, but it took me about two weeks
> > of convincing to get my cluster management group to realize that things
> > weren't working quite right.  Once everything was setup, I was able to
> > generate and write about 40 GB of data in around two minutes.
> >
> > On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <[hidden email]> wrote:
> >
> > > Hi Jan,
> > >
> > > > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich <
> > > [hidden email]> wrote:
> > > >
> > > > Hello HDF users,
> > > >
> > > > I am using HDF5 through NetCDF and I recently changed my program so
> > that
> > > each MPI process writes its data directly to the output file as opposed
> > to
> > > the master process gathering the results and being the only one who does
> > > I/O.
> > > >
> > > > Now I see that my program slows down file systems a lot (of the whole
> > > HPC cluster) and I don't really know how to handle I/O. The file system
> > is
> > > a high throughput Beegfs system.
> > > >
> > > > My program uses a hybrid parallelization approach, i.e. work is split
> > > into N MPI processes, each of which spawns M worker threads. Currently, I
> > > write to the output file from each of the M*N threads, but the writing is
> > > guarded by a mutex, so thread-safety shouldn't be a problem. Each writing
> > > process is a complete `open file, write, close file` cycle.
> > > >
> > > > Each write is at a separate region of the HDF5 file, so no chunks are
> > > shared among any two processes. The amount of data to be written per
> > > process is 1/(M*N) times the size of the whole file.
> > > >
> > > > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What
> > is
> > > the `best practice` regarding parallel file access with HDF5?
> > >
> > >         Yes, this is probably the correct way to operate, but generally
> > > things are much better for this case when collective I/O operations are
> > > used.  Are you using collective or independent I/O?  (Independent is the
> > > default)
> > >
> > >         Quincey
> > >

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions about size of generated Hdf5 files

guillaume.jacquenot
In reply to this post by guillaume.jacquenot
Hello Hdf5 community, Quincey

I have tested 1.8.16 and 1.10.1 versions, also with h5pset_libver_bounds_f subroutine

I have inserted these commands in my bench program

    call h5open_f(error)
    call h5pcreate_f( H5P_FILE_ACCESS_F, fapl_id, error)
    call h5pset_libver_bounds_f(fapl_id, H5F_LIBVER_LATEST_F, H5F_LIBVER_LATEST_F, error)


However, I can't see any difference on the size of HDF5 generated files.
Below is the size and md5sum of the generated hdf5 files, with the 2 hdf5 libraries and different number of elements (0,1 and 2) in each dataset



Version 1.8.16
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ee8157f1ce74936021b1958fb796741e *results.h5
-rw-r--r-- 1
xxxxx 1049089 1169632 May 24 09:17 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
1790a5650bb945b17c0f8a4e59adec85 *results.h5
-rw-r--r-- 1
xxxxx 1049089 7481632 May 24 09:17 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
7d3dff2c6a1c29fa0fe827e4bd5ba79e *results.h5
-rw-r--r-- 1
xxxxx 1049089 7505632 May 24 09:17 results.h5


Version 1.10.1
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ec8169773b9ea015c81fc4cb2205d727 *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:12 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
fae64160fe79f4af0ef382fd1790bf76 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:14 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
20aaf160b3d8ab794ab8c14a604dacc5 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:14 results.h5





2017-05-23 19:12 GMT+02:00 Guillaume Jacquenot <[hidden email]>:
Hello Quincey

I am using version 1.8.16

I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64 floating point number.
I can specify the number n, which controls the number of times I saved my data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()



With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why such an increase?
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset.
With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same behaviour? I have not tried since I don't know how to map the content of an array when calling the h5dwrite_f function for a compound dataset.


If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks



Here is the error I have with contiguous dataset


  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct(): extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <[hidden email]>:


Date: Tue, 23 May 2017 08:22:59 -0700
From: Quincey Koziol <[hidden email]>
To: HDF Users Discussion List <[hidden email]>
Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"

Hi Guillaume,
        Are you using chunked or contiguous datasets?  If chunked, what size are you using?  Also, can you use the ?latest? version of the format, which should be smaller, but is only compatible with HDF5 1.10.x or later?  (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds, https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )

        Quincey


> On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <[hidden email]> wrote:
>
> Hello everyone!
>
> I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.
>
> I am writing 19000 datasets with 21 values of 64 bit (real number).
> I write one value at a time, and extend with one each of the 19000 datasets everytime.
> All data are correctly written.
> But the generated file is more than 48 Mo.
> I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
> I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.
>
> For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.
>
> Question 1)
> Is this behaviour normal?
>
> Question 2)
> Does extending dataset each time we write data inside can significantly increase the total required space disk size?
> Does preallocating dataset and using hyperslab can save some space?
> Does chunk parameters can impact the size of generated hdf5 file
>
> Question 3)
> If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?
>
> N.B:
> When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
> That means each group is about 780 bytes
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>
> Guillaume Jacquenot


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Questions about size of generated Hdf5 files

Quincey Koziol-3
Hi Guillaume,
As Pierre mentioned, a chunk size of 1 is not reasonable and will generate a lot of metadata overhead.  Something closer to 1MB of data elements would be much better.

Quincey

On May 24, 2017, at 12:23 AM, Guillaume Jacquenot <[hidden email]> wrote:

Hello Hdf5 community, Quincey

I have tested 1.8.16 and 1.10.1 versions, also with h5pset_libver_bounds_f subroutine

I have inserted these commands in my bench program

    call h5open_f(error)
    call h5pcreate_f( H5P_FILE_ACCESS_F, fapl_id, error)
    call h5pset_libver_bounds_f(fapl_id, H5F_LIBVER_LATEST_F, H5F_LIBVER_LATEST_F, error)


However, I can't see any difference on the size of HDF5 generated files.
Below is the size and md5sum of the generated hdf5 files, with the 2 hdf5 libraries and different number of elements (0,1 and 2) in each dataset



Version 1.8.16
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ee8157f1ce74936021b1958fb796741e *results.h5
-rw-r--r-- 1
xxxxx 1049089 1169632 May 24 09:17 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
1790a5650bb945b17c0f8a4e59adec85 *results.h5
-rw-r--r-- 1
xxxxx 1049089 7481632 May 24 09:17 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
7d3dff2c6a1c29fa0fe827e4bd5ba79e *results.h5
-rw-r--r-- 1
xxxxx 1049089 7505632 May 24 09:17 results.h5


Version 1.10.1
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ec8169773b9ea015c81fc4cb2205d727 *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:12 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
fae64160fe79f4af0ef382fd1790bf76 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:14 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
20aaf160b3d8ab794ab8c14a604dacc5 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:14 results.h5





2017-05-23 19:12 GMT+02:00 Guillaume Jacquenot <[hidden email]>:
Hello Quincey

I am using version 1.8.16

I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime

I have written a test program that creates 3000 datasets filled with 64 floating point number.
I can specify the number n, which controls the number of times I saved my data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()



With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why such an increase?
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 bytes per empty dataset.
With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same behaviour? I have not tried since I don't know how to map the content of an array when calling the h5dwrite_f function for a compound dataset.


If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks



Here is the error I have with contiguous dataset


  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct(): extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <[hidden email]>:


Date: Tue, 23 May 2017 08:22:59 -0700
From: Quincey Koziol <[hidden email]>
To: HDF Users Discussion List <[hidden email]>
Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
Message-ID: <[hidden email]>
Content-Type: text/plain; charset="utf-8"

Hi Guillaume,
        Are you using chunked or contiguous datasets?  If chunked, what size are you using?  Also, can you use the ?latest? version of the format, which should be smaller, but is only compatible with HDF5 1.10.x or later?  (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds, https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )

        Quincey


> On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <[hidden email]> wrote:
>
> Hello everyone!
>
> I am creating a HDF5 file from a Fortran program, and I am confused about the size of my generated HDF5 file.
>
> I am writing 19000 datasets with 21 values of 64 bit (real number).
> I write one value at a time, and extend with one each of the 19000 datasets everytime.
> All data are correctly written.
> But the generated file is more than 48 Mo.
> I expected the total size of the file to be a little bigger than the raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which means each empty dataset is about 400 bytes.
> I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain everything.
>
> For comparaison,if I write everything in a text file, where each real number is written with 15 characters, I obtain a 6 Mo CSV file.
>
> Question 1)
> Is this behaviour normal?
>
> Question 2)
> Does extending dataset each time we write data inside can significantly increase the total required space disk size?
> Does preallocating dataset and using hyperslab can save some space?
> Does chunk parameters can impact the size of generated hdf5 file
>
> Question 3)
> If I pack everything in a compound dataset with 19000 columns, will the result file be smaller?
>
> N.B:
> When looking at the example of generating 100000 groups (grplots.c),the size of the generated HD5 file is 78 Mo for 100000 empty groups
> That means each group is about 780 bytes
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>
> Guillaume Jacquenot

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5