Quantcast

Optimising HDF5 data structure

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Optimising HDF5 data structure

tamasgal

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

Best regards,
Tamas
_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Rafal Lichwala
Hi Tamas,

> So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp

Second - I know this is HDF5 forum, but for such a huge but simple set
of data, I would suggest to use some SQL engine as a backend.
MySQL or PostgreSQL would be a good choice if you need a full set of
relational database engine features for your data analysis, but
file-based solutions (SQLite) could be also taken into consideration.
In your case data would be stored into two tables (hits and events) with
a proper key-based join between them.

Regards,
Rafal


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

tamasgal
Dear Rafal,

thanks for you reply.

On 31. Mar 2017, at 09:52, Rafal Lichwala <[hidden email]> wrote:

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.

We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.

So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Francesc Altet-2

Hi Tamas,


I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.  Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance.  Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches? 


Best,

Francesc Alted


From: Hdf-forum <[hidden email]> on behalf of Tamas Gal <[hidden email]>
Sent: Friday, March 31, 2017 10:20:37 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
 
Dear Rafal,

thanks for you reply.

On 31. Mar 2017, at 09:52, Rafal Lichwala <[hidden email]> wrote:

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.

We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.

So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Andrey Paramonov
In reply to this post by tamasgal
30.03.2017 22:33, Tamas Gal пишет:

> Dear all,
>
> we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.
>
> The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:
>
> Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)
>
> As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.
>
> I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:
>
> Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.
>
> This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.
>
> Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.
>
> The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

Hello Tamas!

My experience suggests that simply indexing the data is not enough to
achieve top performance. The actual layout of information on disk
(primary index) should be well-suited for your typical queries. For
example, if you need to query by event_id, all values with the same
event_id have to be closely located to minimize the number of disk seeks.

If you have several types of typical queries, it might be worth to
duplicate the information using different physical layouts. This
philosophy is utilized to great success in e.g.
http://cassandra.apache.org/

 From my experience HDF5 is almost as fast as direct disk read, and even
*faster* when using fast compression (LZ4, blosc). On my data HDF5
proved to be much faster compared to SQLite and local PostgreSQL databases.

Best wishes,
Andrey Paramonov


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Adrien Devresse
In reply to this post by tamasgal
Dear Tamas,

> we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

It is a pleasure to see some HEP people here.

> The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?


> Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

This approach would create a large number of dataset ( one per id ),
which is  from my experience, a bad idea in HDF5


I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.

For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"


Best Regards,
Adrien




Le 30. 03. 17 à 21:33, Tamas Gal a écrit :

> Dear all,
>
> we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.
>
> The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:
>
> Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)
>
> As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.
>
> I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:
>
> Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.
>
> This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.
>
> Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.
>
> The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.
>
> I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.
>
> So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?
>
> Best regards,
> Tamas
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5



_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

tamasgal
In reply to this post by Francesc Altet-2
On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.

alright, there is hope ;)

On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. 

We experimented with compression levels and libs and ended up using the blosc. And this is what we used:

tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background.

On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches? 

The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python:

ROOT_readout.py  5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py  17.88s user 4.29s system 105% cpu 21.585 total

On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing.

On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.
http://cassandra.apache.org/

Thanks, I will have a look!

On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.

Sounds good ;-)

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

makepeace@jawasoft.com
In reply to this post by tamasgal
Dear Tamas,

My instinct in your situation would be to define a compound data structure to represent one hit (it sounds as if you have done that) and then write a dataset per event.

You could use event-ID for the dataset name, and any other event metadata could be stored as attributes on the dataset.

>> The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach.

It sounds as if you have already tried this approach.

>> To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.

I believe you can directly get the number of rows in each dataset and so I am confused by the attribute suggestion. It seems performance was still an issue?

Generally I find that performance is all about the chunk size - HDF will generally read a whole chunk at a time and cache those chunks - have you tried different chunk sizes?

rgds
Ewan


> On 31 Mar 2017, at 9:30 AM, [hidden email] wrote:
>
> Send Hdf-forum mailing list submissions to
> [hidden email]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>
> or, via email, send a message with subject or body 'help' to
> [hidden email]
>
> You can reach the person managing the list at
> [hidden email]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
>
>
> Today's Topics:
>
>   1. Optimising HDF5 data structure (Tamas Gal)
>   2. Re: Optimising HDF5 data structure (Rafal Lichwala)
>   3. Re: Optimising HDF5 data structure (Tamas Gal)
>   4. Re: Optimising HDF5 data structure (Francesc Altet)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 30 Mar 2017 21:33:11 +0200
> From: Tamas Gal <[hidden email]>
> To: [hidden email]
> Subject: [Hdf-forum] Optimising HDF5 data structure
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=us-ascii
>
>
> Dear all,
>
> we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.
>
> The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:
>
> Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)
>
> As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.
>
> I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:
>
> Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.
>
> This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.
>
> Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.
>
> The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.
>
> I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.
>
> So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?
>
> Best regards,
> Tamas
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 31 Mar 2017 09:52:55 +0200
> From: Rafal Lichwala <[hidden email]>
> To: [hidden email]
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Hi Tamas,
>
>> So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?
>
> I see two solutions for your purposes.
> First - try to switch from Python to C++ - it's much faster.
>
> http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp
>
> Second - I know this is HDF5 forum, but for such a huge but simple set
> of data, I would suggest to use some SQL engine as a backend.
> MySQL or PostgreSQL would be a good choice if you need a full set of
> relational database engine features for your data analysis, but
> file-based solutions (SQLite) could be also taken into consideration.
> In your case data would be stored into two tables (hits and events) with
> a proper key-based join between them.
>
> Regards,
> Rafal
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 31 Mar 2017 10:20:37 +0200
> From: Tamas Gal <[hidden email]>
> To: HDF Users Discussion List <[hidden email]>
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset="us-ascii"
>
> Dear Rafal,
>
> thanks for you reply.
>
>> On 31. Mar 2017, at 09:52, Rafal Lichwala <[hidden email]> wrote:
>>
>> I see two solutions for your purposes.
>> First - try to switch from Python to C++ - it's much faster.
>
> I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org <http://www.pytables.org/>) and h5py (http://www.h5py.org <http://www.h5py.org/>) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org <http://www.numpy.org/>) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...
>
>> Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.
>
> We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.
>
> So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?
>
> Cheers,
> Tamas
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/5b241dd2/attachment-0001.html>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 31 Mar 2017 08:29:50 +0000
> From: Francesc Altet <[hidden email]>
> To: HDF Users Discussion List <[hidden email]>
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> Message-ID:
> <[hidden email]>
>
> Content-Type: text/plain; charset="us-ascii"
>
> Hi Tamas,
>
>
> I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.  Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance.  Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches?
>
>
> Best,
>
> Francesc Alted
>
> ________________________________
> From: Hdf-forum <[hidden email]> on behalf of Tamas Gal <[hidden email]>
> Sent: Friday, March 31, 2017 10:20:37 AM
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
>
> Dear Rafal,
>
> thanks for you reply.
>
> On 31. Mar 2017, at 09:52, Rafal Lichwala <[hidden email]<mailto:[hidden email]>> wrote:
>
> I see two solutions for your purposes.
> First - try to switch from Python to C++ - it's much faster.
>
> I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...
>
> Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.
>
> We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.
>
> So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?
>
> Cheers,
> Tamas
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/3a986f34/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>
>
> ------------------------------
>
> End of Hdf-forum Digest, Vol 93, Issue 29
> *****************************************


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Francesc Altet-2
In reply to this post by tamasgal

Oh sure, there is always hope 😉


Ok, so based on your report, I'd suggest to use other codecs inside Blosc.  By default the "blosc" filter translates to the "blosc:blosclz" codec internally, but you can also specify "blosc:lz4", "blosc:snappy", "blosc:zlib" and "blosc:zstd".  Each codec has its strong and weak points, so my first advice is that you experiment with them (specially with "blosc:zstd" which is a surprisingly good newcomer).


Then, you should experiment with different chunksizes.  If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`.


Indexing can make lookups much faster too.  Make sure that you create the index with maximum optimization (look for http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex) before deciding that it is not for you.  Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit.


In general, for understanding how chunksizes, compression and indexing can affect your lookup performance it is worth to have a look at the _Optimization Tips_ chapter of PyTables UG: http://www.pytables.org/usersguide/optimization.html .


Good luck,

Francesc Alted




From: Hdf-forum <[hidden email]> on behalf of Tamas Gal <[hidden email]>
Sent: Friday, March 31, 2017 11:14 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
 
On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.

alright, there is hope ;)

On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. 

We experimented with compression levels and libs and ended up using the blosc. And this is what we used:

tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background.

On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches? 

The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python:

ROOT_readout.py  5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py  17.88s user 4.29s system 105% cpu 21.585 total

On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing.

On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.
http://cassandra.apache.org/

Thanks, I will have a look!

On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.

Sounds good ;-)

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

tamasgal
Thanks for all the feedback so far!

On 31. Mar 2017, at 13:49, Francesc Altet <[hidden email]> wrote:
Then, you should experiment with different chunksizes.  If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`.
[...]
Indexing can make lookups much faster too.  Make sure that you create the index with maximum optimization (look forhttp://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex) before deciding that it is not for you.  Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit.

Alright, I will study that extensively. :)

I am just curious if I tie the HDF5 format to much to the pytables framework. We also use other languages and as far as I understand, pytables creates some hidden tables to do all the magic behind. Or are these commonly supported HDF5 features? (sorry for the dumb question)

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
It is a pleasure to see some HEP people here.

Thanks, glad to hear ;)

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?

Ahm, with "used in parallel" I was referring to the fact that we use both ROOT and HDF5 files in our collaboration. There is a generation conflict between the two "frameworks" as you may imagine. Younger people refuse to use ROOT (for good reasons, but that's another story). That's why I maintain this branch in parallel.

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
This approach would create a large number of dataset ( one per id ),
which is  from my experience, a bad idea in HDF5

Yes, this is kind of the problem with the second approach. h5py is extremely fast when iterating whereas pytables takes 50 times longer using the very same code (a for loop and direct access to the nodes). And there are people using other frameworks so there might be some huge performance variations I fear, which of course is not user friendly at all.

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.
 
For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"

I see... So there might be a well suited set of chunk/index parameters which could improve the speed of that structure. I need to dig deeper then.

Cheers,
Tamas


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Steven Varga-2
In reply to this post by tamasgal
Hello Tamas,

I use HDF5 to store stream of  irregular time series (IRTS) from financial sector. The events are organised per day into a data-set, where as the data-set  is a variable length  stream/vector with custom datatype. 
The custom record type is created to increase density, and and iterator in C/C++ to iterate through the event stream which is linked against Julia,R, python code. 
Because the custom datatype is saved into the file it is readily accessible though hdfview.  

The access pattern to this database is write once/read many, sequential and I get good result over the past 5 years. I use it in MPI cluster environment, C++/Julia/Rcpp. 

custom datatpe in my case:
[event id, asset, .... ] 

You see to have optimised access both sequentially all events, and sequentially only some events is a by-objective problem that you can mitigate by using more space to gain time.   

As others pointed out, chunk size matters. 

hope it helps,
steve


On Thu, Mar 30, 2017 at 3:33 PM, Tamas Gal <[hidden email]> wrote:

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

Best regards,
Tamas
_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

tamasgal
Dear Steven,

On 31. Mar 2017, at 17:13, Steven Varga <[hidden email]> wrote:

I use HDF5 to store stream of  irregular time series (IRTS) from financial sector. The events are organised per day into a data-set, where as the data-set  is a variable length  stream/vector with custom datatype. 
The custom record type is created to increase density, and and iterator in C/C++ to iterate through the event stream which is linked against Julia,R, python code. 
Because the custom datatype is saved into the file it is readily accessible though hdfview.  

this sounds very interesting. Do you have some public code to look at the implementation details?

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Francesc Altet-2
In reply to this post by tamasgal

Indeed, indexing is a PyTables feature, so if want to use HDF5 with other interfaces, then better not rely on this.


Francesc Alted


From: Hdf-forum <[hidden email]> on behalf of Tamas Gal <[hidden email]>
Sent: Friday, March 31, 2017 5:10:45 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
 
Thanks for all the feedback so far!

On 31. Mar 2017, at 13:49, Francesc Altet <[hidden email]> wrote:
Then, you should experiment with different chunksizes.  If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`.
[...]
Indexing can make lookups much faster too.  Make sure that you create the index with maximum optimization (look forhttp://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex) before deciding that it is not for you.  Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit.

Alright, I will study that extensively. :)

I am just curious if I tie the HDF5 format to much to the pytables framework. We also use other languages and as far as I understand, pytables creates some hidden tables to do all the magic behind. Or are these commonly supported HDF5 features? (sorry for the dumb question)

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
It is a pleasure to see some HEP people here.

Thanks, glad to hear ;)

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?

Ahm, with "used in parallel" I was referring to the fact that we use both ROOT and HDF5 files in our collaboration. There is a generation conflict between the two "frameworks" as you may imagine. Younger people refuse to use ROOT (for good reasons, but that's another story). That's why I maintain this branch in parallel.

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
This approach would create a large number of dataset ( one per id ),
which is  from my experience, a bad idea in HDF5

Yes, this is kind of the problem with the second approach. h5py is extremely fast when iterating whereas pytables takes 50 times longer using the very same code (a for loop and direct access to the nodes). And there are people using other frameworks so there might be some huge performance variations I fear, which of course is not user friendly at all.

On 31. Mar 2017, at 10:47, Adrien Devresse <[hidden email]> wrote:
I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.
 
For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"

I see... So there might be a well suited set of chunk/index parameters which could improve the speed of that structure. I need to dig deeper then.

Cheers,
Tamas


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

tamasgal
In reply to this post by makepeace@jawasoft.com
Sorry Ewan, I nearly missed your message!

> On 31. Mar 2017, at 13:28, Ewan Makepeace <[hidden email]> wrote:
> My instinct in your situation would be to define a compound data structure to represent one hit (it sounds as if you have done that) and then write a dataset per event.

Yes, we use compound data structures for the hits right now.

> On 31. Mar 2017, at 13:28, Ewan Makepeace <[hidden email]> wrote:
>>> To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
>
> I believe you can directly get the number of rows in each dataset and so I am confused by the attribute suggestion. It seems performance was still an issue?

That was referring to the one-big-table with an event_id array. Kind of mocking the pytables indexing feature without having a strict pytables dependency, so like an extra dataset which stores the "from-to" index values for each event. Which is of course ugly ;)

> On 31. Mar 2017, at 13:28, Ewan Makepeace <[hidden email]> wrote:
> Generally I find that performance is all about the chunk size - HDF will generally read a whole chunk at a time and cache those chunks - have you tried different chunk sizes?

I tried but obviously I did something wrong... ;)


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Elvis Stansvik
In reply to this post by tamasgal

Den 31 mars 2017 11:15 fm skrev "Tamas Gal" <[hidden email]>:
>>
>> On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
>>
>> I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.
>
>
> alright, there is hope ;)
>
>> On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
>>
>> Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. 
>
>
> We experimented with compression levels and libs and ended up using the blosc. And this is what we used:
>
> tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

Typing on my phone so can't say much. Just wanted to react to this. I haven't used pytables, but if the shuffle parameter here refers to the HDF5 library's built in shuffle filter, I think you want to turn it off when using blosc, since the blosc compressor does its own shuffling, and I think the two may interfere.

Cheers,
Elvis

>
>
> We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background.
>
>> On 31. Mar 2017, at 10:29, Francesc Altet <[hidden email]> wrote:
>>
>> Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches? 
>
>
> The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it.
>
> Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python:
>
> ROOT_readout.py  5.27s user 3.33s system 153% cpu 5.609 total
> HDF5_big_table_readout.py  17.88s user 4.29s system 105% cpu 21.585 total
>
>> On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
>>
>> My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.
>
>
> OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing.
>
>> On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
>>
>> If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.
>> http://cassandra.apache.org/
>
>
> Thanks, I will have a look!
>
>> On 31. Mar 2017, at 10:36, Андрей Парамонов <[hidden email]> wrote:
>>
>> From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.
>
>
> Sounds good ;-)
>
> Cheers,
> Tamas
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Ray Burkholder
In reply to this post by tamasgal

See below:

 

Dear Steven,

On 31. Mar 2017, at 17:13, Steven Varga <[hidden email]> wrote:

 

I use HDF5 to store stream of  irregular time series (IRTS) from financial sector. The events are organised per day into a data-set, where as the data-set  is a variable length  stream/vector with custom datatype. 

The custom record type is created to increase density, and and iterator in C/C++ to iterate through the event stream which is linked against Julia,R, python code. 

Because the custom datatype is saved into the file it is readily accessible though hdfview.  

 

this sounds very interesting. Do you have some public code to look at the implementation details?

 

 

>>> here is something similar I’ve done for a compound data set, iterators, and random search:

>>>>   https://github.com/rburkholder/trade-frame/tree/master/lib/TFHDF5TimeSeries

>>>> code currently works in linux, and should work in windows

 


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Optimising HDF5 data structure

Ger van Diepen
In reply to this post by tamasgal

Note that the HDF5 chunk cache size can be very important. HDF5 does not look at the access pattern to estimate the optimal cache size. If your access pattern is not sequential, you need to set a cache size that minimizes I/O for that access pattern.

I've noted that accessing small hyperslabs is quite slow in HDF5, probably due to B-tree lookup overhead.


Some colleagues at sister institutes have used the ADIOS data system developed at Oak Ridge and said it was much faster than HDF5. However, AFAIK it can use a lot of memory to achieve it. But a large chunk cache is not much different.


BTW. I assume that in your tests both ROOT and HDF5 used cold data, thus no data was already available in the system file buffers.


- Ger



>>> Tamas Gal <[hidden email]> 31-Mar-17 17:46 >>>
Sorry Ewan, I nearly missed your message!

> On 31. Mar 2017, at 13:28, Ewan Makepeace <[hidden email]> wrote:
> My instinct in your situation would be to define a compound data structure to represent one hit (it sounds as if you have done that) and then write a dataset per event.

Yes, we use compound data structures for the hits right now.

> On 31. Mar 2017, at 13:28, Ewan Makepeace <[hidden email]> wrote:
>>> To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
>
> I believe you can directly get the number of rows in each dataset and so I am confused by the attribute suggestion. It seems performance was still an issue?

That was referring to the one-big-table with an event_id array. Kind of mocking the pytables indexing feature without having a strict pytables dependency, so like an extra dataset which stores the "from-to" index values for each event. Which is of course ugly ;)

> On 31. Mar 2017, at 13:28, Ewan Makepeace <[hidden email]> wrote:
> Generally I find that performance is all about the chunk size - HDF will generally read a whole chunk at a time and cache those chunks - have you tried different chunk sizes?

I tried but obviously I did something wrong... ;)


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Loading...