[hdf-forum] RFC: Special Values in HDF5

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Ruth Aydt
Administrator

A new Request for Comments (RFC) on the handling of Special Values in  
HDF5 has just been published at http://hdfgroup.com/pubs/rfcs/RFC_Special_Values_in_HDF5.pdf 
.

The HDF Group is currently soliciting feedback on this RFC.    
Community comments will be one of the factors considered by The HDF  
Group in making the final design and implementation decisions.

Comments may be sent to help at hdfgroup.org.

-Ruth Aydt
The HDF Group
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080902/1607d8de/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Francesc Alted
Hi Ruth,

A Tuesday 02 September 2008, Ruth Aydt escrigu?:
> A new Request for Comments (RFC) on the handling of Special Values in
> HDF5 has just been published at
> http://hdfgroup.com/pubs/rfcs/RFC_Special_Values_in_HDF5.pdf .
>
> The HDF Group is currently soliciting feedback on this RFC.
> Community comments will be one of the factors considered by The HDF
> Group in making the final design and implementation decisions.

Thanks for sharing this with us.  After pondering a bit about the
different possibilities, I'd say that the "Parallel Special Values
Dataset" option seems best to my eyes.  Here it is a rational:

- I think that "Parallel Special Values Dataset" is more general
than "Attribute Triplet" in that the former allows to describe highly
scattered special values more efficiently than the latter.  I
personally find the "Attribute Triplet" more suited for geographical
purposes, but not for general special value distributions.

- As you said in your report, compression will reduce a lot the space
overhead of requiring several datasets for keeping the special values
in the "Parallel Special Values Dataset" approach.  On its hand,
the "Attribute Triplet" won't let you to compress data, so it is
perfectly possible that, in the end, the "Parallel Special Values
Dataset" would effectively require less space on disk in many
situations (and not only in the scattered special values scenario).

- Moreover, reading a specific dataset of special values out of
a "Parallel Special Values Dataset" setup would probably be similar in
speed of perhaps faster than with an "Attribute Triplet" one.  The
former will be probably much faster for a highly scattered special
values scenario.  In a more 'geographic' scenario (i.e. the special
values are relatively contiguous), the "Attribute Triplet" approach
could be marginally faster, but if a bit-mask, compressed dataset is
used to keep special values in a "Parallel Special Values Dataset"
setup, that can be very fast to read too (where it is the cross-point
between both apporaches will depend on the special values spatial
distribution).

- Regarding the implementation of simple operations on dataset region
selections (i.e. union, intersect, complement) would be very easy to
implement with the "Parallel Special Values Dataset" approach, and also
would perform fast, IMO.  This is because there is an easy conversion
path from special values datasets to bit-mask datasets (in many cases,
the special value should be a bit-mask itself, so no conversion at all
would be needed), and computing the unions, intersections or
complements on contiguous datasets are fast operations on nowadays
superscalar processors (integer '&', '|' and '~' operators).

- Finally, and in my opinion, a "Parallel Special Values Dataset" would
integrate better on existing "masked array" implementations in
numerical libraries (I'm thinking in NumPy here, but there should exist
others out there), in that they setup a couple of datasets in memory:
one that contains the regular values, and another (the mask) that says
whether the regular value is valid or not.  It is clear that
the "Parallel Special Values Dataset" approach is more general than
this, but it is equally evident the parallelism between both
implementations and that this parallelism should allow for a better and
more efficient integration for both libraries.

Having said this, I'm not specially against the "Attribute Triplet"
approach (it is better than nothing), but I think that the "Parallel
Special Values Dataset" has a lot of virtues and could be a better bet
in the long term (due to its generality, compressibility, simplicity
and high level of integration with existing computing libraries).

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Andrew Collette
Hi,

I'd like to add a little to what Francesc is saying here.  What struck
me when I read the RFC is that with the Attribute Triplet scenario it's
unclear how to efficiently turn a point coordinate into a "compiled"
special value (i.e. a single bitmasked integer). The "parallel dataset"
scenario solves this nicely, at the cost of having to explicitly
enumerate the values across the dataspace.  I think this design is the
better one for general use, given the current limitations of the
dataspace API and the availability of chunking and compression to manage
the storage cost.

It seems like the issue addessed by the "attribute triplet" idea is how
to express the concept of set membership.  We have a collection of
points (coordinates in the "main" dataset) which can be a member of one
or more externally defined categories.  Attribute triplet arrays express
this through a list of region references labelled with strings.  Each
region reference defines a set. In this sense, you don't even need to
store a "special value" with the reference; it could easily be an
"attribute doublet". The user can do whatever they like with the region
at read-time, including using it to impose a mask value (via H5Dfill).

It's easy to select and read points which belong to a certain set (like
"ice") with this approach. The weakness, as the RFC implies, is that
there's no easy way to perform set-like operations on dataspaces (union,
intersection, complement, etc.). It's unclear how I would read all
"cloud and ice" points, or all "cloud but not ice" points in a single
operation, or even "all cloud points within this box".  The dataspace
API would have to advance significantly for this strategy to be useful
beyond single-dataspace selections.

Conversely, a "parallel dataset" is an explicitly populated lookup
table. Each point contains a bitmask with the containing "sets"
explicitly listed. It's very easy to go from a coordinate to a list of
the containing sets.  As Francesc pointed out, using bitmasks also
allows you to use bitwise & and | to replace the missing set operations.
This is much more in line with the traditional "element mask" idea found
in many numerical analysis environments.  Even if the specification
didn't require a bitmask the element-wise addressing is still much more
suited to this convention than regions.

The RFC mentions the obvious disadvantages; there's no way to get hold
of all points in one category ("ice" or "ice and cloud") without polling
the entire table.  It's also more expensive to add or remove categories
as you need to explicitly write to each member point, and the number of
categories is limited to the number of bits in the mask.

The limitations of each approach indicates to me that there are really
two well-distinguished use cases here.  Perhaps there could even be two
specifications, one for "masked" datasets backed by lookup tables with
bitmasked/enumeration/user-provided values, and one for "set-like"
datasets, with a standardized storage convention for an unlimited number
of annotated region references, perhaps not even associated with
specific numerical values. Finally, I strongly agree with keeping this
out of the core library and in the form of a specification, at least for
now.  All of this should be on top of the existing low-level
infrastructure.

Thanks,

Andrew Collette
h5py.alfven.org

On Wed, 2008-09-03 at 10:15 +0200, Francesc Alted wrote:

> Hi Ruth,
>
> A Tuesday 02 September 2008, Ruth Aydt escrigu?:
> > A new Request for Comments (RFC) on the handling of Special Values in
> > HDF5 has just been published at
> > http://hdfgroup.com/pubs/rfcs/RFC_Special_Values_in_HDF5.pdf .
> >
> > The HDF Group is currently soliciting feedback on this RFC.
> > Community comments will be one of the factors considered by The HDF
> > Group in making the final design and implementation decisions.
>
> Thanks for sharing this with us.  After pondering a bit about the
> different possibilities, I'd say that the "Parallel Special Values
> Dataset" option seems best to my eyes.  Here it is a rational:
>
> - I think that "Parallel Special Values Dataset" is more general
> than "Attribute Triplet" in that the former allows to describe highly
> scattered special values more efficiently than the latter.  I
> personally find the "Attribute Triplet" more suited for geographical
> purposes, but not for general special value distributions.
>
> - As you said in your report, compression will reduce a lot the space
> overhead of requiring several datasets for keeping the special values
> in the "Parallel Special Values Dataset" approach.  On its hand,
> the "Attribute Triplet" won't let you to compress data, so it is
> perfectly possible that, in the end, the "Parallel Special Values
> Dataset" would effectively require less space on disk in many
> situations (and not only in the scattered special values scenario).
>
> - Moreover, reading a specific dataset of special values out of
> a "Parallel Special Values Dataset" setup would probably be similar in
> speed of perhaps faster than with an "Attribute Triplet" one.  The
> former will be probably much faster for a highly scattered special
> values scenario.  In a more 'geographic' scenario (i.e. the special
> values are relatively contiguous), the "Attribute Triplet" approach
> could be marginally faster, but if a bit-mask, compressed dataset is
> used to keep special values in a "Parallel Special Values Dataset"
> setup, that can be very fast to read too (where it is the cross-point
> between both apporaches will depend on the special values spatial
> distribution).
>
> - Regarding the implementation of simple operations on dataset region
> selections (i.e. union, intersect, complement) would be very easy to
> implement with the "Parallel Special Values Dataset" approach, and also
> would perform fast, IMO.  This is because there is an easy conversion
> path from special values datasets to bit-mask datasets (in many cases,
> the special value should be a bit-mask itself, so no conversion at all
> would be needed), and computing the unions, intersections or
> complements on contiguous datasets are fast operations on nowadays
> superscalar processors (integer '&', '|' and '~' operators).
>
> - Finally, and in my opinion, a "Parallel Special Values Dataset" would
> integrate better on existing "masked array" implementations in
> numerical libraries (I'm thinking in NumPy here, but there should exist
> others out there), in that they setup a couple of datasets in memory:
> one that contains the regular values, and another (the mask) that says
> whether the regular value is valid or not.  It is clear that
> the "Parallel Special Values Dataset" approach is more general than
> this, but it is equally evident the parallelism between both
> implementations and that this parallelism should allow for a better and
> more efficient integration for both libraries.
>
> Having said this, I'm not specially against the "Attribute Triplet"
> approach (it is better than nothing), but I think that the "Parallel
> Special Values Dataset" has a lot of virtues and could be a better bet
> in the long term (due to its generality, compressibility, simplicity
> and high level of integration with existing computing libraries).
>
> Cheers,
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Ger van Diepen
I fully agree with the remarks made by Andrew and Francesc. Masks and regions are distinct cases.
Note that it is also possible to define regions in world coordinates (e.g. geographic longitude and latitude), but that is beyond the scope of this RFC.

I would like to remark that I think it is usually much more efficient to store a region as a bounding box and mask than as a list of element indices. Not only in space, but also in testing if an dataset element is part of the region. Also calculating the union, intersection, etc. of regions is much more efficient that way (and could be done on the fly).

Cheers,
Ger van Diepen


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Dimitris Servis
Hi all,

I tried to read the RFC and responses as carefully as I could so here is my
2c:

I agree with the previous opinions that the parallel datasets are much more
flexible. Moreover I assume they will require less intervention in the
library itself. For me the most important disadvantage is that it will make
the library more complex than it should be and therefore less attractive for
people to join in. Already people complain that it is pretty complex and I
always advocate it should be as simple (but less ugly ;-) ) as XML. Taking
this under consideration and from the architectural point of view, HDF5
should be as simple as possible a data format (no special attributes that
the user cannot recognize putting them there) with a library for storing and
retrieving the data. The rest is business domain that has no business in the
core library. And I agree with Fransesc that the "parallel dataset design"
is more fit for general purposes.

In my own use case, I use the parallel datasets to define one-to-many
mappings between topologies. Using the attribute method would bloat the
attributes of the source topology.

Regards,

--  dimitris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080904/c398c53f/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Dimitris Servis
2008/9/4 Dimitris Servis <servisster at gmail.com>

> Hi all,
>
> I tried to read the RFC and responses as carefully as I could so here is my
> 2c:
>
> I agree with the previous opinions that the parallel datasets are much more
> flexible. Moreover I assume they will require less intervention in the
> library itself. For me the most important disadvantage is that it will make
> the library more complex than it should be and therefore less attractive for
> people to join in. Already people complain that it is pretty complex and I
> always advocate it should be as simple (but less ugly ;-) ) as XML. Taking
> this under consideration and from the architectural point of view, HDF5
> should be as simple as possible a data format (no special attributes that
> the user cannot recognize putting them there) with a library for storing and
> retrieving the data. The rest is business domain that has no business in the
> core library. And I agree with Fransesc that the "parallel dataset design"
> is more fit for general purposes.
>
> In my own use case, I use the parallel datasets to define one-to-many
> mappings between topologies. Using the attribute method would bloat the
> attributes of the source topology.
>
> Regards,
>
> --  dimitris
>

What would be interesting for me would be to read in a dataset as a
combination of a mask or a mapping and the target dataset without having to
read both in memory (i.e. do the masking operation on the fly). I do not
know how interesting this is for others.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080904/2bfc7dde/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Performance issue of HDF data group

Zhengying Wang
In reply to this post by Ger van Diepen
Hi,

I have come cross a performance issue with HDF group, which really
puzzles me.

Here listed the formats of two HDF files:

1) Data organized with groups

HDF5 "/tmp/test.h5" {
FILE_CONTENTS {
 group      /group2
 dataset    /group2/dataset1
 dataset    /group2/dataset2
 dataset    /group2/dataset3
 datatype   /datatype1
 group      /group3
 dataset    /group3/dataset1
 dataset    /group3/dataset2
 dataset    /group3/dataset3
 group      /group4
 dataset    /group4/dataset1
 dataset    /group4/dataset2
 dataset    /group4/dataset3
 datatype   /datatype2
 group      /group1
 dataset    /group1/dataset1
 dataset    /group1/dataset2
 dataset    /group1/dataset3
 }
}

2) Data organized with flat datasets

HDF5 "/tmp/test_un.h5" {
FILE_CONTENTS {
 dataset    /group2dataset1
 dataset    /group2dataset2
 dataset    /group2dataset3
 datatype   /datatype1
 dataset    /group3dataset1
 dataset    /group3dataset2
 dataset    /group3dataset3
 dataset    /group4dataset1
 dataset    /group4dataset2
 dataset    /group4dataset3
 datatype   /datatype2
 dataset    /group1dataset1
 dataset    /group1dataset2
 dataset    /group1dataset3
 }
}

Same data is stored in the exactly same format of each dataset.
Amazingly, the performance are quite different to access the two files.

To read the same amount of data (with same compression level and chunk
size), it's about 2 times faster to read data in 2) than 1). By running
callgrind to profile the program, I found function call inflate_fast()
of 2) spent much less time compared to format 1).

Does anyone know why? How would group affect the compression performance
in HDF?

Any help would be really appreciated!

Thanks,
Zane

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Francesc Alted
In reply to this post by Dimitris Servis
A Thursday 04 September 2008, Dimitris Servis escrigu?:

> 2008/9/4 Dimitris Servis <servisster at gmail.com>
>
> > Hi all,
> >
> > I tried to read the RFC and responses as carefully as I could so
> > here is my 2c:
> >
> > I agree with the previous opinions that the parallel datasets are
> > much more flexible. Moreover I assume they will require less
> > intervention in the library itself. For me the most important
> > disadvantage is that it will make the library more complex than it
> > should be and therefore less attractive for people to join in.
> > Already people complain that it is pretty complex and I always
> > advocate it should be as simple (but less ugly ;-) ) as XML. Taking
> > this under consideration and from the architectural point of view,
> > HDF5 should be as simple as possible a data format (no special
> > attributes that the user cannot recognize putting them there) with
> > a library for storing and retrieving the data. The rest is business
> > domain that has no business in the core library. And I agree with
> > Fransesc that the "parallel dataset design" is more fit for general
> > purposes.
> >
> > In my own use case, I use the parallel datasets to define
> > one-to-many mappings between topologies. Using the attribute method
> > would bloat the attributes of the source topology.
> >
> > Regards,
> >
> > --  dimitris
>
> What would be interesting for me would be to read in a dataset as a
> combination of a mask or a mapping and the target dataset without
> having to read both in memory (i.e. do the masking operation on the
> fly). I do not know how interesting this is for others.

Hmm, I didn't think about this, but that could be interesting in many
situations (for example, when the user doesn't have an implementation
of masked arrays in memory at hand).  Ideally, one can even think about
a filter that is able to do such automatic masking/mapping, so
the 'parallel' dataset can be transparent to the end user (he only has
to specify the dataset to read, the region/elements and the desired
mask/map, that's all).  Pretty cool.

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Dimitris Servis
Hi Francesc,

2008/9/4 Francesc Alted <faltet at pytables.com>

> A Thursday 04 September 2008, Dimitris Servis escrigu?:
> > 2008/9/4 Dimitris Servis <servisster at gmail.com>
> >
> > > Hi all,
> > >
> > > I tried to read the RFC and responses as carefully as I could so
> > > here is my 2c:
> > >
> > > I agree with the previous opinions that the parallel datasets are
> > > much more flexible. Moreover I assume they will require less
> > > intervention in the library itself. For me the most important
> > > disadvantage is that it will make the library more complex than it
> > > should be and therefore less attractive for people to join in.
> > > Already people complain that it is pretty complex and I always
> > > advocate it should be as simple (but less ugly ;-) ) as XML. Taking
> > > this under consideration and from the architectural point of view,
> > > HDF5 should be as simple as possible a data format (no special
> > > attributes that the user cannot recognize putting them there) with
> > > a library for storing and retrieving the data. The rest is business
> > > domain that has no business in the core library. And I agree with
> > > Fransesc that the "parallel dataset design" is more fit for general
> > > purposes.
> > >
> > > In my own use case, I use the parallel datasets to define
> > > one-to-many mappings between topologies. Using the attribute method
> > > would bloat the attributes of the source topology.
> > >
> > > Regards,
> > >
> > > --  dimitris
> >
> > What would be interesting for me would be to read in a dataset as a
> > combination of a mask or a mapping and the target dataset without
> > having to read both in memory (i.e. do the masking operation on the
> > fly). I do not know how interesting this is for others.
>
> Hmm, I didn't think about this, but that could be interesting in many
> situations (for example, when the user doesn't have an implementation
> of masked arrays in memory at hand).  Ideally, one can even think about
> a filter that is able to do such automatic masking/mapping, so
> the 'parallel' dataset can be transparent to the end user (he only has
> to specify the dataset to read, the region/elements and the desired
> mask/map, that's all).  Pretty cool.
>
> Cheers,
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249
>

Hi Francesc,

I always had this idea of defining operations (mathematical and the like)
that can be performed on datasets while they were loaded. So that you could
add two datasets on the fly for example, thus saving a lot of memory
probably for the same IO. Mapping/masking is also an operation in this
sense. Your mentioning of filters may be a good solution to this.

thanks!

-- dimitris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080904/76bd48a7/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Ger van Diepen
Hi Dimitris,

Several years ago C++ classes and an expression grammar were developed
in the casacore package for on-the-fly mathematical operations on N-dim
astronomical image datasets (which can now also be in HDF5 format). It
can also apply masks (or calculate them on the fly) and regions (boxes,
polygons, etc.). It indeed works nicely.

Note that when adding, etc datasets on the fly you cannot let HDF5
apply the mask, otherwise you don't know what the corresponding elements
are.

Cheers,
Ger

>>> "Dimitris Servis" <servisster at gmail.com> 09/04/08 12:29 PM >>>
Hi Francesc,

2008/9/4 Francesc Alted <faltet at pytables.com>

> A Thursday 04 September 2008, Dimitris Servis escrigu?:
> > 2008/9/4 Dimitris Servis <servisster at gmail.com>
> >
> > > Hi all,
> > >
> > > I tried to read the RFC and responses as carefully as I could so
> > > here is my 2c:
> > >
> > > I agree with the previous opinions that the parallel datasets
are
> > > much more flexible. Moreover I assume they will require less
> > > intervention in the library itself. For me the most important
> > > disadvantage is that it will make the library more complex than
it
> > > should be and therefore less attractive for people to join in.
> > > Already people complain that it is pretty complex and I always
> > > advocate it should be as simple (but less ugly ;-) ) as XML.
Taking
> > > this under consideration and from the architectural point of
view,
> > > HDF5 should be as simple as possible a data format (no special
> > > attributes that the user cannot recognize putting them there)
with
> > > a library for storing and retrieving the data. The rest is
business
> > > domain that has no business in the core library. And I agree
with
> > > Fransesc that the "parallel dataset design" is more fit for
general
> > > purposes.
> > >
> > > In my own use case, I use the parallel datasets to define
> > > one-to-many mappings between topologies. Using the attribute
method
> > > would bloat the attributes of the source topology.
> > >
> > > Regards,
> > >
> > > --  dimitris
> >
> > What would be interesting for me would be to read in a dataset as
a
> > combination of a mask or a mapping and the target dataset without
> > having to read both in memory (i.e. do the masking operation on
the
> > fly). I do not know how interesting this is for others.
>
> Hmm, I didn't think about this, but that could be interesting in
many
> situations (for example, when the user doesn't have an
implementation
> of masked arrays in memory at hand).  Ideally, one can even think
about
> a filter that is able to do such automatic masking/mapping, so
> the 'parallel' dataset can be transparent to the end user (he only
has

> to specify the dataset to read, the region/elements and the desired
> mask/map, that's all).  Pretty cool.
>
> Cheers,
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249
>

Hi Francesc,

I always had this idea of defining operations (mathematical and the
like)
that can be performed on datasets while they were loaded. So that you
could
add two datasets on the fly for example, thus saving a lot of memory
probably for the same IO. Mapping/masking is also an operation in this
sense. Your mentioning of filters may be a good solution to this.

thanks!

-- dimitris


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Francesc Alted
Hi Ger & Dimitris,

A Thursday 04 September 2008, Ger van Diepen escrigu?:
> Hi Dimitris,
>
> Several years ago C++ classes and an expression grammar were
> developed in the casacore package for on-the-fly mathematical
> operations on N-dim astronomical image datasets (which can now also
> be in HDF5 format). It can also apply masks (or calculate them on the
> fly) and regions (boxes, polygons, etc.). It indeed works nicely.

Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this is
more a matter of the application on top of HDF5.  For example, if the
application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.

> Note that when adding, etc datasets on the fly you cannot let HDF5
> apply the mask, otherwise you don't know what the corresponding
> elements are.

That's true.  And this is another reason to defer these complex
operations to the application layer --even masking, provided that the
app already knows how to deal with masked arrays.  Filters should be
only used to perform very simple and well defined operations.

Francesc

>
> Cheers,
> Ger
>
> >>> "Dimitris Servis" <servisster at gmail.com> 09/04/08 12:29 PM >>>
>
> Hi Francesc,
>
> 2008/9/4 Francesc Alted <faltet at pytables.com>
>
> > A Thursday 04 September 2008, Dimitris Servis escrigu?:
> > > 2008/9/4 Dimitris Servis <servisster at gmail.com>
> > >
> > > > Hi all,
> > > >
> > > > I tried to read the RFC and responses as carefully as I could
> > > > so here is my 2c:
> > > >
> > > > I agree with the previous opinions that the parallel datasets
>
> are
>
> > > > much more flexible. Moreover I assume they will require less
> > > > intervention in the library itself. For me the most important
> > > > disadvantage is that it will make the library more complex than
>
> it
>
> > > > should be and therefore less attractive for people to join in.
> > > > Already people complain that it is pretty complex and I always
> > > > advocate it should be as simple (but less ugly ;-) ) as XML.
>
> Taking
>
> > > > this under consideration and from the architectural point of
>
> view,
>
> > > > HDF5 should be as simple as possible a data format (no special
> > > > attributes that the user cannot recognize putting them there)
>
> with
>
> > > > a library for storing and retrieving the data. The rest is
>
> business
>
> > > > domain that has no business in the core library. And I agree
>
> with
>
> > > > Fransesc that the "parallel dataset design" is more fit for
>
> general
>
> > > > purposes.
> > > >
> > > > In my own use case, I use the parallel datasets to define
> > > > one-to-many mappings between topologies. Using the attribute
>
> method
>
> > > > would bloat the attributes of the source topology.
> > > >
> > > > Regards,
> > > >
> > > > --  dimitris
> > >
> > > What would be interesting for me would be to read in a dataset as
>
> a
>
> > > combination of a mask or a mapping and the target dataset without
> > > having to read both in memory (i.e. do the masking operation on
>
> the
>
> > > fly). I do not know how interesting this is for others.
> >
> > Hmm, I didn't think about this, but that could be interesting in
>
> many
>
> > situations (for example, when the user doesn't have an
>
> implementation
>
> > of masked arrays in memory at hand).  Ideally, one can even think
>
> about
>
> > a filter that is able to do such automatic masking/mapping, so
> > the 'parallel' dataset can be transparent to the end user (he only
>
> has
>
> > to specify the dataset to read, the region/elements and the desired
> > mask/map, that's all).  Pretty cool.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > Freelance developer
> > Tel +34-964-282-249
>
> Hi Francesc,
>
> I always had this idea of defining operations (mathematical and the
> like)
> that can be performed on datasets while they were loaded. So that you
> could
> add two datasets on the fly for example, thus saving a lot of memory
> probably for the same IO. Mapping/masking is also an operation in
> this sense. Your mentioning of filters may be a good solution to
> this.
>
> thanks!
>
> -- dimitris
>
>
> ---------------------------------------------------------------------
>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org. To unsubscribe, send a message to
> hdf-forum-unsubscribe at hdfgroup.org.



--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Ger van Diepen
Hi Francesc,

Sorry I was a bit unclear.
I don't use filters; it is done in application code on top of HDF5 or
other storage formats for the image data (like FITS).
I fully agree filters are not the appropriate place to do it; I don't
see how a single filter could add two or more data sets from possibly
different HDF5 files.  

Cheers,
Ger
 
>>> Francesc Alted <faltet at pytables.com> 09/04/08 2:03 PM >>>
Hi Ger & Dimitris,

A Thursday 04 September 2008, Ger van Diepen escrigu?:
> Hi Dimitris,
>
> Several years ago C++ classes and an expression grammar were
> developed in the casacore package for on-the-fly mathematical
> operations on N-dim astronomical image datasets (which can now also
> be in HDF5 format). It can also apply masks (or calculate them on
the
> fly) and regions (boxes, polygons, etc.). It indeed works nicely.

Hmm, I'm not sure if overloading HDF5 filters is the correct path to
implement such a general operations, but I tend to think that this is
more a matter of the application on top of HDF5.  For example, if the
application already has a buffer for doing the I/O, implementing
general operations at the application layer could be quite easy and
probably much more flexible than using filters.

> Note that when adding, etc datasets on the fly you cannot let HDF5
> apply the mask, otherwise you don't know what the corresponding
> elements are.

That's true.  And this is another reason to defer these complex
operations to the application layer --even masking, provided that the
app already knows how to deal with masked arrays.  Filters should be
only used to perform very simple and well defined operations.

Francesc

>
> Cheers,
> Ger
>
> >>> "Dimitris Servis" <servisster at gmail.com> 09/04/08 12:29 PM >>>
>
> Hi Francesc,
>
> 2008/9/4 Francesc Alted <faltet at pytables.com>
>
> > A Thursday 04 September 2008, Dimitris Servis escrigu?:
> > > 2008/9/4 Dimitris Servis <servisster at gmail.com>
> > >
> > > > Hi all,
> > > >
> > > > I tried to read the RFC and responses as carefully as I could
> > > > so here is my 2c:
> > > >
> > > > I agree with the previous opinions that the parallel datasets
>
> are
>
> > > > much more flexible. Moreover I assume they will require less
> > > > intervention in the library itself. For me the most important
> > > > disadvantage is that it will make the library more complex
than

>
> it
>
> > > > should be and therefore less attractive for people to join in.
> > > > Already people complain that it is pretty complex and I always
> > > > advocate it should be as simple (but less ugly ;-) ) as XML.
>
> Taking
>
> > > > this under consideration and from the architectural point of
>
> view,
>
> > > > HDF5 should be as simple as possible a data format (no special
> > > > attributes that the user cannot recognize putting them there)
>
> with
>
> > > > a library for storing and retrieving the data. The rest is
>
> business
>
> > > > domain that has no business in the core library. And I agree
>
> with
>
> > > > Fransesc that the "parallel dataset design" is more fit for
>
> general
>
> > > > purposes.
> > > >
> > > > In my own use case, I use the parallel datasets to define
> > > > one-to-many mappings between topologies. Using the attribute
>
> method
>
> > > > would bloat the attributes of the source topology.
> > > >
> > > > Regards,
> > > >
> > > > --  dimitris
> > >
> > > What would be interesting for me would be to read in a dataset
as
>
> a
>
> > > combination of a mask or a mapping and the target dataset
without

> > > having to read both in memory (i.e. do the masking operation on
>
> the
>
> > > fly). I do not know how interesting this is for others.
> >
> > Hmm, I didn't think about this, but that could be interesting in
>
> many
>
> > situations (for example, when the user doesn't have an
>
> implementation
>
> > of masked arrays in memory at hand).  Ideally, one can even think
>
> about
>
> > a filter that is able to do such automatic masking/mapping, so
> > the 'parallel' dataset can be transparent to the end user (he only
>
> has
>
> > to specify the dataset to read, the region/elements and the
desired

> > mask/map, that's all).  Pretty cool.
> >
> > Cheers,
> >
> > --
> > Francesc Alted
> > Freelance developer
> > Tel +34-964-282-249
>
> Hi Francesc,
>
> I always had this idea of defining operations (mathematical and the
> like)
> that can be performed on datasets while they were loaded. So that
you

> could
> add two datasets on the fly for example, thus saving a lot of memory
> probably for the same IO. Mapping/masking is also an operation in
> this sense. Your mentioning of filters may be a good solution to
> this.
>
> thanks!
>
> -- dimitris
>
>
>
---------------------------------------------------------------------
>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org. To unsubscribe, send a message to
> hdf-forum-unsubscribe at hdfgroup.org.



--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.



----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Dimitris Servis
In reply to this post by Francesc Alted
Hi Ger & Francesc

2008/9/4 Francesc Alted <faltet at pytables.com>

> Hi Ger & Dimitris,
>
> A Thursday 04 September 2008, Ger van Diepen escrigu?:
> > Hi Dimitris,
> >
> > Several years ago C++ classes and an expression grammar were
> > developed in the casacore package for on-the-fly mathematical
> > operations on N-dim astronomical image datasets (which can now also
> > be in HDF5 format). It can also apply masks (or calculate them on the
> > fly) and regions (boxes, polygons, etc.). It indeed works nicely.
>
> Hmm, I'm not sure if overloading HDF5 filters is the correct path to
> implement such a general operations, but I tend to think that this is
> more a matter of the application on top of HDF5.  For example, if the
> application already has a buffer for doing the I/O, implementing
> general operations at the application layer could be quite easy and
> probably much more flexible than using filters.
>
> > Note that when adding, etc datasets on the fly you cannot let HDF5
> > apply the mask, otherwise you don't know what the corresponding
> > elements are.
>
> That's true.  And this is another reason to defer these complex
> operations to the application layer --even masking, provided that the
> app already knows how to deal with masked arrays.  Filters should be
> only used to perform very simple and well defined operations.
>
> Francesc
>
> >
> > Cheers,
> > Ger
> >
> > >>> "Dimitris Servis" <servisster at gmail.com> 09/04/08 12:29 PM >>>
> >
> > Hi Francesc,
> >
> > 2008/9/4 Francesc Alted <faltet at pytables.com>
> >
> > > A Thursday 04 September 2008, Dimitris Servis escrigu?:
> > > > 2008/9/4 Dimitris Servis <servisster at gmail.com>
> > > >
> > > > > Hi all,
> > > > >
> > > > > I tried to read the RFC and responses as carefully as I could
> > > > > so here is my 2c:
> > > > >
> > > > > I agree with the previous opinions that the parallel datasets
> >
> > are
> >
> > > > > much more flexible. Moreover I assume they will require less
> > > > > intervention in the library itself. For me the most important
> > > > > disadvantage is that it will make the library more complex than
> >
> > it
> >
> > > > > should be and therefore less attractive for people to join in.
> > > > > Already people complain that it is pretty complex and I always
> > > > > advocate it should be as simple (but less ugly ;-) ) as XML.
> >
> > Taking
> >
> > > > > this under consideration and from the architectural point of
> >
> > view,
> >
> > > > > HDF5 should be as simple as possible a data format (no special
> > > > > attributes that the user cannot recognize putting them there)
> >
> > with
> >
> > > > > a library for storing and retrieving the data. The rest is
> >
> > business
> >
> > > > > domain that has no business in the core library. And I agree
> >
> > with
> >
> > > > > Fransesc that the "parallel dataset design" is more fit for
> >
> > general
> >
> > > > > purposes.
> > > > >
> > > > > In my own use case, I use the parallel datasets to define
> > > > > one-to-many mappings between topologies. Using the attribute
> >
> > method
> >
> > > > > would bloat the attributes of the source topology.
> > > > >
> > > > > Regards,
> > > > >
> > > > > --  dimitris
> > > >
> > > > What would be interesting for me would be to read in a dataset as
> >
> > a
> >
> > > > combination of a mask or a mapping and the target dataset without
> > > > having to read both in memory (i.e. do the masking operation on
> >
> > the
> >
> > > > fly). I do not know how interesting this is for others.
> > >
> > > Hmm, I didn't think about this, but that could be interesting in
> >
> > many
> >
> > > situations (for example, when the user doesn't have an
> >
> > implementation
> >
> > > of masked arrays in memory at hand).  Ideally, one can even think
> >
> > about
> >
> > > a filter that is able to do such automatic masking/mapping, so
> > > the 'parallel' dataset can be transparent to the end user (he only
> >
> > has
> >
> > > to specify the dataset to read, the region/elements and the desired
> > > mask/map, that's all).  Pretty cool.
> > >
> > > Cheers,
> > >
> > > --
> > > Francesc Alted
> > > Freelance developer
> > > Tel +34-964-282-249
> >
> > Hi Francesc,
> >
> > I always had this idea of defining operations (mathematical and the
> > like)
> > that can be performed on datasets while they were loaded. So that you
> > could
> > add two datasets on the fly for example, thus saving a lot of memory
> > probably for the same IO. Mapping/masking is also an operation in
> > this sense. Your mentioning of filters may be a good solution to
> > this.
> >
> > thanks!
> >
> > -- dimitris
> >
> >
> > ---------------------------------------------------------------------
> >- This mailing list is for HDF software users discussion.
> > To subscribe to this list, send a message to
> > hdf-forum-subscribe at hdfgroup.org. To unsubscribe, send a message to
> > hdf-forum-unsubscribe at hdfgroup.org.
>
>
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>
I think Francesc refers to my mentioning of filters and I think you are both
right that filters are not the appropriate way to do it. My mind jumped to
the idea of using one dataset as filter for another.

I absolutely agree that the definition of such operations is business
related. However the implementation of generic dataset combinations, I'm
afraid would be more low level as one would have to operate on an element
basis in order to do it efficiently and not load entire datasets in memory.
AFAIK the only alternative would be to use H5Diterate (can we use 2 reading
threads there?) and I admit I have no idea how expensive and optimized this
is for such operations. On the other hand I see the benefit of an
application managing its own buffers (PyTables for example do it I think),
loading parts of the two datasets and manipulating them. I am not sure if I
got all the concepts right until now...

That's a really interesting subject, thanks for the replies and help!

-- dimitris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080904/23aa3e5b/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Ger van Diepen
Hi Dimitris,

casacore does not load the entire datasets into memory. The datasets
can be much bigger than memory.
The expression is another type of image and is evaluated on the fly.
This means that the user asks for a chunk of data from the image
expression which is then evaluated for that chunk only. Internally it
uses iterators to do it efficiently (usually chunk by chunk).
A reduction function like median can be part of the expression (e.g. to
clip based on a median). It has a special implementation because it
requires histogramming and a partial sort.

Ger
 
>>> "Dimitris Servis" <servisster at gmail.com> 09/04/08 3:22 PM >>>
Hi Ger & Francesc

2008/9/4 Francesc Alted <faltet at pytables.com>

> Hi Ger & Dimitris,
>
> A Thursday 04 September 2008, Ger van Diepen escrigu?:
> > Hi Dimitris,
> >
> > Several years ago C++ classes and an expression grammar were
> > developed in the casacore package for on-the-fly mathematical
> > operations on N-dim astronomical image datasets (which can now
also
> > be in HDF5 format). It can also apply masks (or calculate them on
the
> > fly) and regions (boxes, polygons, etc.). It indeed works nicely.
>
> Hmm, I'm not sure if overloading HDF5 filters is the correct path to
> implement such a general operations, but I tend to think that this
is
> more a matter of the application on top of HDF5.  For example, if
the

> application already has a buffer for doing the I/O, implementing
> general operations at the application layer could be quite easy and
> probably much more flexible than using filters.
>
> > Note that when adding, etc datasets on the fly you cannot let HDF5
> > apply the mask, otherwise you don't know what the corresponding
> > elements are.
>
> That's true.  And this is another reason to defer these complex
> operations to the application layer --even masking, provided that
the

> app already knows how to deal with masked arrays.  Filters should be
> only used to perform very simple and well defined operations.
>
> Francesc
>
> >
> > Cheers,
> > Ger
> >
> > >>> "Dimitris Servis" <servisster at gmail.com> 09/04/08 12:29 PM >>>
> >
> > Hi Francesc,
> >
> > 2008/9/4 Francesc Alted <faltet at pytables.com>
> >
> > > A Thursday 04 September 2008, Dimitris Servis escrigu?:
> > > > 2008/9/4 Dimitris Servis <servisster at gmail.com>
> > > >
> > > > > Hi all,
> > > > >
> > > > > I tried to read the RFC and responses as carefully as I
could
> > > > > so here is my 2c:
> > > > >
> > > > > I agree with the previous opinions that the parallel
datasets
> >
> > are
> >
> > > > > much more flexible. Moreover I assume they will require less
> > > > > intervention in the library itself. For me the most
important
> > > > > disadvantage is that it will make the library more complex
than
> >
> > it
> >
> > > > > should be and therefore less attractive for people to join
in.
> > > > > Already people complain that it is pretty complex and I
always
> > > > > advocate it should be as simple (but less ugly ;-) ) as XML.
> >
> > Taking
> >
> > > > > this under consideration and from the architectural point of
> >
> > view,
> >
> > > > > HDF5 should be as simple as possible a data format (no
special
> > > > > attributes that the user cannot recognize putting them
there)

> >
> > with
> >
> > > > > a library for storing and retrieving the data. The rest is
> >
> > business
> >
> > > > > domain that has no business in the core library. And I agree
> >
> > with
> >
> > > > > Fransesc that the "parallel dataset design" is more fit for
> >
> > general
> >
> > > > > purposes.
> > > > >
> > > > > In my own use case, I use the parallel datasets to define
> > > > > one-to-many mappings between topologies. Using the attribute
> >
> > method
> >
> > > > > would bloat the attributes of the source topology.
> > > > >
> > > > > Regards,
> > > > >
> > > > > --  dimitris
> > > >
> > > > What would be interesting for me would be to read in a dataset
as
> >
> > a
> >
> > > > combination of a mask or a mapping and the target dataset
without
> > > > having to read both in memory (i.e. do the masking operation
on

> >
> > the
> >
> > > > fly). I do not know how interesting this is for others.
> > >
> > > Hmm, I didn't think about this, but that could be interesting in
> >
> > many
> >
> > > situations (for example, when the user doesn't have an
> >
> > implementation
> >
> > > of masked arrays in memory at hand).  Ideally, one can even
think
> >
> > about
> >
> > > a filter that is able to do such automatic masking/mapping, so
> > > the 'parallel' dataset can be transparent to the end user (he
only
> >
> > has
> >
> > > to specify the dataset to read, the region/elements and the
desired

> > > mask/map, that's all).  Pretty cool.
> > >
> > > Cheers,
> > >
> > > --
> > > Francesc Alted
> > > Freelance developer
> > > Tel +34-964-282-249
> >
> > Hi Francesc,
> >
> > I always had this idea of defining operations (mathematical and
the
> > like)
> > that can be performed on datasets while they were loaded. So that
you
> > could
> > add two datasets on the fly for example, thus saving a lot of
memory

> > probably for the same IO. Mapping/masking is also an operation in
> > this sense. Your mentioning of filters may be a good solution to
> > this.
> >
> > thanks!
> >
> > -- dimitris
> >
> >
> >
---------------------------------------------------------------------
> >- This mailing list is for HDF software users discussion.
> > To subscribe to this list, send a message to
> > hdf-forum-subscribe at hdfgroup.org. To unsubscribe, send a message
to

> > hdf-forum-unsubscribe at hdfgroup.org.
>
>
>
> --
> Francesc Alted
> Freelance developer
> Tel +34-964-282-249
>
>
----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to
hdf-forum-unsubscribe at hdfgroup.org.
>
>
I think Francesc refers to my mentioning of filters and I think you are
both
right that filters are not the appropriate way to do it. My mind jumped
to
the idea of using one dataset as filter for another.

I absolutely agree that the definition of such operations is business
related. However the implementation of generic dataset combinations,
I'm
afraid would be more low level as one would have to operate on an
element
basis in order to do it efficiently and not load entire datasets in
memory.
AFAIK the only alternative would be to use H5Diterate (can we use 2
reading
threads there?) and I admit I have no idea how expensive and optimized
this
is for such operations. On the other hand I see the benefit of an
application managing its own buffers (PyTables for example do it I
think),
loading parts of the two datasets and manipulating them. I am not sure
if I
got all the concepts right until now...

That's a really interesting subject, thanks for the replies and help!

-- dimitris


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] RFC: Special Values in HDF5

Francesc Alted
In reply to this post by Dimitris Servis
A Thursday 04 September 2008, Dimitris Servis escrigu?:

> I think Francesc refers to my mentioning of filters and I think you
> are both right that filters are not the appropriate way to do it. My
> mind jumped to the idea of using one dataset as filter for another.
>
> I absolutely agree that the definition of such operations is business
> related. However the implementation of generic dataset combinations,
> I'm afraid would be more low level as one would have to operate on an
> element basis in order to do it efficiently and not load entire
> datasets in memory. AFAIK the only alternative would be to use
> H5Diterate (can we use 2 reading threads there?)

You can use threads, yes.  However, HDF5 is blocking threads internally
on a quite coarse grain level while the library is accessing critical
parts (I don't know whether they are working on reducing this), so in
the end, you should not see too much increase in performance, IMO.

> and I admit I have
> no idea how expensive and optimized this is for such operations. On
> the other hand I see the benefit of an application managing its own
> buffers (PyTables for example do it I think), loading parts of the
> two datasets and manipulating them. I am not sure if I got all the
> concepts right until now...

Yes, PyTables implements buffered I/O (only on compound datasets as they
are the cornerstone of PyTables).  And in fact, the I/O buffers are
used to compute arbitrarily complex expressions (using an enhanced
computing kernel in C named Numexpr [1]) between the columns in the
user tables without the need to read the complete dataset in memory.

This approach seems similar to how Ger is using casacore, but instead of
using chunks, a larger buffer is used instead.  Using a large buffer is
important because, in order to achieve maximum speed, modern CPUs does
require to work on buffers that are generally larger than usual
chunksizes (my experiments are saying that buffers 10x larger are
enough in general for keeping the pipelines working most of the time,
although that depends on the number of fields and the chunksize).

And this works pretty well as you can see in the speed-ups achieved by
in-kernel [2] and indexed [3] queries (both are using the computational
kernel for speeding-up data selections).

[1] http://code.google.com/p/numexpr/
[2] http://www.pytables.org/docs/manual/ch05.html#inkernelSearch
[3] http://www.pytables.org/docs/manual/ch05.html#indexedSearches

> That's a really interesting subject, thanks for the replies and help!

Yeah, I'm learning quite a few too!

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Performance issue of HDF data group

Quincey Koziol
In reply to this post by Zhengying Wang
Hi Zane,

On Sep 4, 2008, at 4:55 AM, Zhengying Wang wrote:

> Hi,
>
> I have come cross a performance issue with HDF group, which really
> puzzles me.
>
> Here listed the formats of two HDF files:
>
> 1) Data organized with groups
>
> HDF5 "/tmp/test.h5" {
> FILE_CONTENTS {
> group      /group2
> dataset    /group2/dataset1
> dataset    /group2/dataset2
> dataset    /group2/dataset3
> datatype   /datatype1
> group      /group3
> dataset    /group3/dataset1
> dataset    /group3/dataset2
> dataset    /group3/dataset3
> group      /group4
> dataset    /group4/dataset1
> dataset    /group4/dataset2
> dataset    /group4/dataset3
> datatype   /datatype2
> group      /group1
> dataset    /group1/dataset1
> dataset    /group1/dataset2
> dataset    /group1/dataset3
> }
> }
>
> 2) Data organized with flat datasets
>
> HDF5 "/tmp/test_un.h5" {
> FILE_CONTENTS {
> dataset    /group2dataset1
> dataset    /group2dataset2
> dataset    /group2dataset3
> datatype   /datatype1
> dataset    /group3dataset1
> dataset    /group3dataset2
> dataset    /group3dataset3
> dataset    /group4dataset1
> dataset    /group4dataset2
> dataset    /group4dataset3
> datatype   /datatype2
> dataset    /group1dataset1
> dataset    /group1dataset2
> dataset    /group1dataset3
> }
> }
>
> Same data is stored in the exactly same format of each dataset.
> Amazingly, the performance are quite different to access the two  
> files.
>
> To read the same amount of data (with same compression level and chunk
> size), it's about 2 times faster to read data in 2) than 1). By  
> running
> callgrind to profile the program, I found function call inflate_fast()
> of 2) spent much less time compared to format 1).
>
> Does anyone know why? How would group affect the compression  
> performance
> in HDF?

        That's definitely counterintuitive... :-?  Can you write some simple  
programs that show this behavior, so we can see the details of what  
you are doing?

        Quincey


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.