[hdf-forum] Selection and crosscuts in HDF5 files

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Selection and crosscuts in HDF5 files

Ger van Diepen
We are thinking of storing the data observed with our radio telescopes in HDF5. The amount of data can be ten to a few hundred GBytes. The data arrives in order of time.
The data have basically 4 axes: polarisation, frequency, baseline, and time. Depending on the application a slice of data a along one or more of those axes is needed. So a chunked dataset seems like a good candidate.
However, the axes are not regular. E.g. for longer baselines the integration times can be shorter. So we cannot use a simple 4-dim dataset of float values which would allow for easy access in all directions.

An option would be to store the data in a hierarchical way. E.g. a group per time, then a group per baseline and finally a dataset containing an array of data for the pol/freq axes. However, I fear that in that way it is expensive to get, say, a slice containing all data for a given baseline and frequency.

Another option is to store it like groups, but then in a dataset with variable length entries. However, I guess I cannot chunk such a dataset. So again it would be expensive to get the slice mentioned above.

So I'm wondering what is the best way to store such data while having reasonable access times along all axes?

Regards,
Ger


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Selection and crosscuts in HDF5 files

Francesc Alted
Hi Ger,

A Wednesday 10 September 2008, Ger van Diepen escrigu?:

> We are thinking of storing the data observed with our radio
> telescopes in HDF5. The amount of data can be ten to a few hundred
> GBytes. The data arrives in order of time. The data have basically 4
> axes: polarisation, frequency, baseline, and time. Depending on the
> application a slice of data a along one or more of those axes is
> needed. So a chunked dataset seems like a good candidate. However,
> the axes are not regular. E.g. for longer baselines the integration
> times can be shorter. So we cannot use a simple 4-dim dataset of
> float values which would allow for easy access in all directions.
>
> An option would be to store the data in a hierarchical way. E.g. a
> group per time, then a group per baseline and finally a dataset
> containing an array of data for the pol/freq axes. However, I fear
> that in that way it is expensive to get, say, a slice containing all
> data for a given baseline and frequency.
>
> Another option is to store it like groups, but then in a dataset with
> variable length entries. However, I guess I cannot chunk such a
> dataset. So again it would be expensive to get the slice mentioned
> above.

I'm not sure if I understand you, but it seems that you are referring
as "chunking" to what is called 'hyperslicing' in HDF5 jargon.

> So I'm wondering what is the best way to store such data while having
> reasonable access times along all axes?

One possibility would be to use a table as in a traditional database.  
In terms of HDF5 that can be implemented as a compound, chunked (in the
sense of HDF5) dataset with one field for each irregular axis, plus an
additional field for keeping the actual float values.  The length of
such a dataset would be the product of the lengths for each of the
axes.  This would arguably take much more space on disk than other
solutions (the entries are not made only of actual values, but also of
*axes values*), but as the axes information would have relatively low
entropy, the compressor+shuffle filters could greatly reduce the amount
of space needed (to be reasonably similar of what your original values
would take).

For accessing the values as slices of your axes, you should add some
logic on your app that allows you to select the information you are
interested in.  For example, if you want the values within a range
of 'polarization' and 'frequency', you can traverse the dataset and
select those values.

However, in order to avoid traversing the complete table, you may want
to index all the fields that are treated as axis, so as to speed-up the
lookups (as a matter of fact, this is what traditional databases do).

<blurb-mode>
In case you were using the Python language for your analysis job, you
may want to use PyTables Pro [1] for this.  It implements an indexing
engine that can cope with very large datasets, and lets you do
operations like:

slice = table.readWhere('(pol>10) & (pol<20) | (pres<1.3)'
                        field="actual_value")

where 'slice' has the data that you are interested in.  Of course, if
the 'pol' or 'pres' fields are indexed, then the need of traversing the
complete dataset is avoided.

In addition to use HDF5 as a container for all of its data, the indexing
engine behing PyTables Pro does scale much better than the ones in
traditional databases, as can be seen in [2].
</blurb-mode>

[1] http://www.pytables.org/moin/PyTablesPro
[2] http://www.pytables.org/docs/OPSI-indexes.pdf

Hope that helps,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Selection and crosscuts in HDF5 files

Quincey Koziol
In reply to this post by Ger van Diepen
Hi Ger,

On Sep 10, 2008, at 7:23 AM, Ger van Diepen wrote:

> We are thinking of storing the data observed with our radio  
> telescopes in HDF5. The amount of data can be ten to a few hundred  
> GBytes. The data arrives in order of time.
> The data have basically 4 axes: polarisation, frequency, baseline,  
> and time. Depending on the application a slice of data a along one  
> or more of those axes is needed. So a chunked dataset seems like a  
> good candidate.
> However, the axes are not regular. E.g. for longer baselines the  
> integration times can be shorter. So we cannot use a simple 4-dim  
> dataset of float values which would allow for easy access in all  
> directions.

        Just to be certain I know what we're talking about here, are you  
thinking that you want the  dimensions of your dataset to be "ragged"  
in one dimension while expanding another dimension?  (And holding the  
other two dimensions fixed)

> An option would be to store the data in a hierarchical way. E.g. a  
> group per time, then a group per baseline and finally a dataset  
> containing an array of data for the pol/freq axes. However, I fear  
> that in that way it is expensive to get, say, a slice containing all  
> data for a given baseline and frequency.
>
> Another option is to store it like groups, but then in a dataset  
> with variable length entries. However, I guess I cannot chunk such a  
> dataset. So again it would be expensive to get the slice mentioned  
> above.

        You can chunk datasets that have variable-length datatype elements.

> So I'm wondering what is the best way to store such data while  
> having reasonable access times along all axes?

        Hmm, HDF5 doesn't tackle the case I mentioned above (ragged dims,  
etc) in "big" ways, just "small" ways with variable-length datatypes.  
The downside of using variable-length datatypes is that there's no way  
to subset along that "dimension" currently.

        Quincey


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Selection and crosscuts in HDF5 files

Quincey Koziol
Hi Ger,

On Sep 11, 2008, at 1:29 AM, Ger van Diepen wrote:

> Hi Quincey and Francesc,
>
> Thanks for your answers.
> Indeed it expands in the time dimension and it is ragged. For  
> instance, for baseline A we'll have a cube of [ntimeA,nfreq,npol],  
> while for baseline B we'll have [ntimeB,nfreq,npol]. Ragging is very  
> much desired to save a factor of at least 2 in storage, so we cannot  
> have a cube [ntime,nbaseline,nfreq,npol].
> We have 3 main applications with different access patterns.
> - RFI detection needs a sliding window in time and freq per baseline.
> - Calibration needs all data in chunks of time.
> - Imaging needs all data in chunks of frequency.
>
> When chunking a dataset with variable length data types, I cannot  
> see it can still chunk all 4 dimensions. What does it chunk?

        I was thinking of making a chunked 3-D dataset with a variable-length  
datatype for the ragged dimension.

> I guess I can use multiple non-ragged chunked data sets, for  
> instance one per baseline or combine baselines with the same time  
> integration. I have to think more about that.
> I assume the chunk cache is shared by all datasets, so it should be  
> large enough when doing, say, the imaging.

        I think you may have some reasonable results with making a  
[ntime,nbaseline,nfreq,npol] cube as long as you sent the chunk  
dimensions relatively small and add a compression filter (like  
deflate).  You will benefit from the fact that chunk without any data  
elements aren't instantiated in the file and chunks that are partially  
filled with [ragged] elements will be compressed well.  It won't be as  
good as having a fully supported ragged dimension, but it will still  
give you better subsetting capabilities than having multiple datasets.

> Maybe Francesc's idea of a database-like approach is feasible, but I  
> hesitate to index billions of values that way. Typical values are:
> npol=4
> nfreq=1024
> nbaseline=900
> ntime=6400 for long baselines and 800 for short baselines (and  
> something like 3200 or 1600 for intermediate baselines)

        *ick* :-)

                Quincey

> Cheers,
> Ger
>
>>>> Quincey Koziol <koziol at hdfgroup.org> 09/10/08 8:20 PM >>>
> Hi Ger,
>
> On Sep 10, 2008, at 7:23 AM, Ger van Diepen wrote:
>
>> We are thinking of storing the data observed with our radio
>> telescopes in HDF5. The amount of data can be ten to a few hundred
>> GBytes. The data arrives in order of time.
>> The data have basically 4 axes: polarisation, frequency, baseline,
>> and time. Depending on the application a slice of data a along one
>> or more of those axes is needed. So a chunked dataset seems like a
>> good candidate.
>> However, the axes are not regular. E.g. for longer baselines the
>> integration times can be shorter. So we cannot use a simple 4-dim
>> dataset of float values which would allow for easy access in all
>> directions.
>
> Just to be certain I know what we're talking about here, are you
> thinking that you want the  dimensions of your dataset to be "ragged"
> in one dimension while expanding another dimension?  (And holding the
> other two dimensions fixed)
>
>> An option would be to store the data in a hierarchical way. E.g. a
>> group per time, then a group per baseline and finally a dataset
>> containing an array of data for the pol/freq axes. However, I fear
>> that in that way it is expensive to get, say, a slice containing all
>> data for a given baseline and frequency.
>>
>> Another option is to store it like groups, but then in a dataset
>> with variable length entries. However, I guess I cannot chunk such a
>> dataset. So again it would be expensive to get the slice mentioned
>> above.
>
> You can chunk datasets that have variable-length datatype elements.
>
>> So I'm wondering what is the best way to store such data while
>> having reasonable access times along all axes?
>
> Hmm, HDF5 doesn't tackle the case I mentioned above (ragged dims,
> etc) in "big" ways, just "small" ways with variable-length datatypes.
> The downside of using variable-length datatypes is that there's no way
> to subset along that "dimension" currently.
>
> Quincey
>
>
>


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Selection and crosscuts in HDF5 files

Francesc Alted
Hi Ger,

A Thursday 11 September 2008, Quincey Koziol escrigu?:
> On Sep 11, 2008, at 1:29 AM, Ger van Diepen wrote:
[clip]
> > Maybe Francesc's idea of a database-like approach is feasible, but
> > I hesitate to index billions of values that way. Typical values
> > are: npol=4
> > nfreq=1024
> > nbaseline=900
> > ntime=6400 for long baselines and 800 for short baselines (and
> > something like 3200 or 1600 for intermediate baselines)

No problem.  Tables with 5 billions of entries (and more) are typical
figures for PyTables Pro as you can see in the white paper of OPSI
(that I mentioned on a previous message).

Sure, indexes should take quite a few of space, but much less than, for
example, PostgreSQL (typically, 3x less, and up to 15x less in
forthcoming Pro 2.1).  Also, the creation of indexes is around 10x
faster.  For example, creating an index for a table with 5 billions of
rows would take just a couple of hours (using a machine with an
Opteron64 processor at 2 GHz and a regular SATA disk), so the complete
indexation for the all 5 columns required for your case (4 if you don't
want to index the values) would take just 10 hours.  All in all, this
is not that much for getting first-class time access to your data on
any of your axes.

Finally, I must say that the main drawback of OPSI indexes (but also the
reason behind its high efficency and compactness), is that the update
of values in indexed tables is far more slow than other databases (10x
slower or more).  However, if you are going to use it for mostly
read-only or append-only tables, then there is no problem at all.  
Also, the speed of the update of values that are not indexed is not
affected by other columns that can be indexed.

At any rate, whether or not this can be a good solution for you, depends
largely on your requirements.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.