Opening datasets expensive?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opening datasets expensive?

Jim Robinson
Hi,  I am using HDF5 as the backend for a genomics visualizer.    The
data is organized by experiement,  chromosome, and resolution scale.  A
typical file might have 300 or so experiments, 24 chromosomes, and 8
resolution scales.  My current design uses a for each experiment,
chromosome, and resolution scale,  or 57,600 datasets in all.  

First question, is that too many datasets?   I could combine the
experiment and chromosome dimensions with a corresponding reduction in
the number of datasets and increase in each datasets size.  It would
complicate the application code but is doable.

The application is a visualization and needs to access small portions of
each dataset very quickly.  It is organized similar to google maps and
as the user zooms and pans small slices or datasets are accessed and
rendered.  The number of datasets accessed at one time is equal to the
number of experiments.  It is working fine with small numbers of
experiments,  < 20,  but panning and zooming is noticeably sluggish with
300.   I did some profiling and discoverd that about 70% of the time is
spent just opening the datasets.    Is this to be expected?      Is it
good practice to have a few large datasets rather than many smaller ones?

Oh, I'm using the java jni wrapper (H5).  I am not using the object
api,  just the jni wrapper functions.

Thanks for any tips.

Jim Robinson
Broad Institute


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opening datasets expensive?

Elena Pourmal
Jim,

It is a known performance problem related to the behavior of the HDF5
metadata cache. We have a fix  and will be testing it in the next few
days.

Would you like to get a tar ball when it is available and see if the
fix addresses the problem? The fix will be in the 1.8 branch.

Elena

At 11:31 PM -0500 1/16/08, Jim Robinson wrote:

>Hi,  I am using HDF5 as the backend for a genomics visualizer.
>The data is organized by experiement,  chromosome, and resolution
>scale.  A typical file might have 300 or so experiments, 24
>chromosomes, and 8 resolution scales.  My current design uses a for
>each experiment, chromosome, and resolution scale,  or 57,600
>datasets in all.  
>
>First question, is that too many datasets?   I could combine the
>experiment and chromosome dimensions with a corresponding reduction
>in the number of datasets and increase in each datasets size.  It
>would complicate the application code but is doable.
>
>The application is a visualization and needs to access small
>portions of each dataset very quickly.  It is organized similar to
>google maps and as the user zooms and pans small slices or datasets
>are accessed and rendered.  The number of datasets accessed at one
>time is equal to the number of experiments.  It is working fine with
>small numbers of experiments,  < 20,  but panning and zooming is
>noticeably sluggish with 300.   I did some profiling and discoverd
>that about 70% of the time is spent just opening the datasets.    Is
>this to be expected?      Is it good practice to have a few large
>datasets rather than many smaller ones?
>Oh, I'm using the java jni wrapper (H5).  I am not using the object
>api,  just the jni wrapper functions.
>
>Thanks for any tips.
>
>Jim Robinson
>Broad Institute
>
>
>----------------------------------------------------------------------
>This mailing list is for HDF software users discussion.
>To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
>To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.


--

------------------------------------------------------------
Elena Pourmal
The HDF Group
1901 So First ST.
Suite C-2
Champaign, IL 61820

epourmal at hdfgroup.org
(217)333-0238 (office)
(217)333-9049 (fax)
------------------------------------------------------------

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opening datasets expensive?

Francesc Altet
In reply to this post by Jim Robinson
A Thursday 17 January 2008, Jim Robinson escrigu?:

> Hi,  I am using HDF5 as the backend for a genomics visualizer.    The
> data is organized by experiement,  chromosome, and resolution scale.
> A typical file might have 300 or so experiments, 24 chromosomes, and
> 8 resolution scales.  My current design uses a for each experiment,
> chromosome, and resolution scale,  or 57,600 datasets in all.
>
> First question, is that too many datasets?   I could combine the
> experiment and chromosome dimensions with a corresponding reduction
> in the number of datasets and increase in each datasets size.  It
> would complicate the application code but is doable.
>
> The application is a visualization and needs to access small portions
> of each dataset very quickly.  It is organized similar to google maps
> and as the user zooms and pans small slices or datasets are accessed
> and rendered.  The number of datasets accessed at one time is equal
> to the number of experiments.  It is working fine with small numbers
> of experiments,  < 20,  but panning and zooming is noticeably
> sluggish with 300.   I did some profiling and discoverd that about
> 70% of the time is spent just opening the datasets.    Is this to be
> expected?      Is it good practice to have a few large datasets
> rather than many smaller ones?

My experience says that it is definitely better to have fewer large
datasets than many smaller ones.

Even if, as Elena is saying, there is a bug in the metadata cache that
the THG people is fixing, if you want maximum speed to access parts of
your data, my guess is that accessing it in different parts of a large
dataset would be always faster than accessing different datasets.  This
is because a dataset has more metadata that has to be retrieved from
disk, while for accessing parts of a large dataset you only have to
read the part of the btree to reach it (if not yet in memory) and the
data itself, which is pretty fast.

My 2 cents,

--
>0,0<   Francesc Altet ? ? http://www.carabos.com/
V   V   C?rabos Coop. V. ??Enjoy Data
 "-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opening datasets expensive?

Dougherty, Matthew T.
my preliminary performance analysis on writing 2D 64x64 32bit pixel images:
1) stdio and HDF are comparable up to 10k images
2) HDF performance badly diverges non linearly after 10k images, stdio is linear.
3) By 200k images HDF is 20x slower.
4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel stack, HDF is 20-40% faster than stdio.


 
Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA
=========================================================================
=========================================================================




-----Original Message-----
From: Francesc Altet [mailto:faltet at carabos.com]
Sent: Thu 1/17/2008 6:23 AM
To: hdf-forum at hdfgroup.org
Subject: Re: Opening datasets expensive?
 
A Thursday 17 January 2008, Jim Robinson escrigu?:

> Hi,  I am using HDF5 as the backend for a genomics visualizer.    The
> data is organized by experiement,  chromosome, and resolution scale.
> A typical file might have 300 or so experiments, 24 chromosomes, and
> 8 resolution scales.  My current design uses a for each experiment,
> chromosome, and resolution scale,  or 57,600 datasets in all.
>
> First question, is that too many datasets?   I could combine the
> experiment and chromosome dimensions with a corresponding reduction
> in the number of datasets and increase in each datasets size.  It
> would complicate the application code but is doable.
>
> The application is a visualization and needs to access small portions
> of each dataset very quickly.  It is organized similar to google maps
> and as the user zooms and pans small slices or datasets are accessed
> and rendered.  The number of datasets accessed at one time is equal
> to the number of experiments.  It is working fine with small numbers
> of experiments,  < 20,  but panning and zooming is noticeably
> sluggish with 300.   I did some profiling and discoverd that about
> 70% of the time is spent just opening the datasets.    Is this to be
> expected?      Is it good practice to have a few large datasets
> rather than many smaller ones?

My experience says that it is definitely better to have fewer large
datasets than many smaller ones.

Even if, as Elena is saying, there is a bug in the metadata cache that
the THG people is fixing, if you want maximum speed to access parts of
your data, my guess is that accessing it in different parts of a large
dataset would be always faster than accessing different datasets.  This
is because a dataset has more metadata that has to be retrieved from
disk, while for accessing parts of a large dataset you only have to
read the part of the btree to reach it (if not yet in memory) and the
data itself, which is pretty fast.

My 2 cents,

--
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   C?rabos Coop. V.   Enjoy Data
 "-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080118/b5127b76/attachment.html>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opening datasets expensive?

Dimitris Servis
Hi all,

I've only tested with large datasets using:
1) raw files
2) opaque datatypes (serialize with another lib and save data as opaque
type)
3) native HDF structures with variable arrays
4) In memory buffered native HDF structures
5) Breakdown of structures to HDF5 native arrays

These are my results for writing files ranging in total from 1GB to 100GB:

Opaque data type writing is always faster than raw file, but not accounting
for the time to serialize.
Writing HDF5 native structures, especially with variable arrays is always
slower up to a factor of 2
Rearranging data into fixed size arrays is usually about 20% slower

On NFS things seem to be different, with HDF5 outperforming raw file in most
cases.

Reading a dataset is usually faster with the raw file but faster with HDF5
after the first time!

Overwriting a dataset is always significantly faster with HDF5.

In all cases, writing opaque datatypes is always faster.

HTH

-- dimitris

2008/1/18, Dougherty, Matthew T. <matthewd at bcm.tmc.edu>:

>
>  Hi Dimitris,
>
> Looks like you did the same thing I did.
> Francesc pointed it out to me a few minutes ago, below is his email to me
>
>
>
> ========
>
> Matthew,
>
> Interesting experiment.  BTW, you have only send it to me, which is
> great, but are you sure that you don't want to share this with the rest
> of the HDF list? ;)
>
> Cheers, Francesc
>
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
> =========================================================================
> =========================================================================
>
>
>
>
> -----Original Message-----
> From: Dimitris Servis [mailto:servisster at gmail.com <servisster at gmail.com>]
> Sent: Fri 1/18/2008 4:38 AM
> To: Dougherty, Matthew T.
> Subject: Re: Opening datasets expensive?
>
> Hi all,
>
> I've only tested with large datasets using:
> 1) raw files
> 2) opaque datatypes (serialize with another lib and save data as opaque
> type)
> 3) native HDF structures with variable arrays
> 4) In memory buffered native HDF structures
> 5) Breakdown of structures to HDF5 native arrays
>
> These are my results for writing files ranging in total from 1GB to 100GB:
>
> Opaque data type writing is always faster than raw file, but not
> accounting
> for the time to serialize.
> Writing HDF5 native structures, especially with variable arrays is always
> slower up to a factor of 2
> Rearranging data into fixed size arrays is usually about 20% slower
>
> On NFS things seem to be different, with HDF5 outperforming raw file in
> most
> cases.
>
> Reading a dataset is usually faster with the raw file but faster with HDF5
> after the first time!
>
> Overwriting a dataset is always significantly faster with HDF5.
>
> In all cases, writing opaque datatypes is always faster.
>
> HTH
>
> -- dimitris
>
>
> 2008/1/18, Dougherty, Matthew T. <matthewd at bcm.tmc.edu>:
> >
> >  my preliminary performance analysis on writing 2D 64x64 32bit pixel
> > images:
> > 1) stdio and HDF are comparable up to 10k images
> > 2) HDF performance badly diverges non linearly after 10k images, stdio
> is
> > linear.
> > 3) By 200k images HDF is 20x slower.
> > 4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel stack,
> > HDF is 20-40% faster than stdio.
> >
> >
> >
> > Matthew Dougherty
> > 713-433-3849
> > National Center for Macromolecular Imaging
> > Baylor College of Medicine/Houston Texas USA
> >
> =========================================================================
> >
> =========================================================================
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Francesc Altet [mailto:faltet at carabos.com <faltet at carabos.com> <
> faltet at carabos.com>]
> > Sent: Thu 1/17/2008 6:23 AM
> > To: hdf-forum at hdfgroup.org
> > Subject: Re: Opening datasets expensive?
> >
> > A Thursday 17 January 2008, Jim Robinson escrigu?:
> > > Hi,  I am using HDF5 as the backend for a genomics visualizer.    The
> > > data is organized by experiement,  chromosome, and resolution scale.
> > > A typical file might have 300 or so experiments, 24 chromosomes, and
> > > 8 resolution scales.  My current design uses a for each experiment,
> > > chromosome, and resolution scale,  or 57,600 datasets in all.
> > >
> > > First question, is that too many datasets?   I could combine the
> > > experiment and chromosome dimensions with a corresponding reduction
> > > in the number of datasets and increase in each datasets size.  It
> > > would complicate the application code but is doable.
> > >
> > > The application is a visualization and needs to access small portions
> > > of each dataset very quickly.  It is organized similar to google maps
> > > and as the user zooms and pans small slices or datasets are accessed
> > > and rendered.  The number of datasets accessed at one time is equal
> > > to the number of experiments.  It is working fine with small numbers
> > > of experiments,  < 20,  but panning and zooming is noticeably
> > > sluggish with 300.   I did some profiling and discoverd that about
> > > 70% of the time is spent just opening the datasets.    Is this to be
> > > expected?      Is it good practice to have a few large datasets
> > > rather than many smaller ones?
> >
> > My experience says that it is definitely better to have fewer large
> > datasets than many smaller ones.
> >
> > Even if, as Elena is saying, there is a bug in the metadata cache that
> > the THG people is fixing, if you want maximum speed to access parts of
> > your data, my guess is that accessing it in different parts of a large
> > dataset would be always faster than accessing different datasets.  This
> > is because a dataset has more metadata that has to be retrieved from
> > disk, while for accessing parts of a large dataset you only have to
> > read the part of the btree to reach it (if not yet in memory) and the
> > data itself, which is pretty fast.
> >
> > My 2 cents,
> >
> > --
> > >0,0<   Francesc Altet     http://www.carabos.com/
> > V   V   C?rabos Coop. V.   Enjoy Data
> >  "-"
> >
> > ----------------------------------------------------------------------
> > This mailing list is for HDF software users discussion.
> > To subscribe to this list, send a message to
> > hdf-forum-subscribe at hdfgroup.org.
> > To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
> >
> >
> >
> >
>
>
> --
> What is the difference between mechanical engineers and civil engineers?
> Mechanical engineers build weapons civil engineers build targets.
>
>


--
What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20080118/25ecc0ac/attachment.html>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Opening datasets expensive?

Quincey Koziol
In reply to this post by Dougherty, Matthew T.

On Jan 18, 2008, at 3:29 AM, Dougherty, Matthew T. wrote:

> my preliminary performance analysis on writing 2D 64x64 32bit pixel  
> images:
> 1) stdio and HDF are comparable up to 10k images
> 2) HDF performance badly diverges non linearly after 10k images,  
> stdio is linear.
> 3) By 200k images HDF is 20x slower.
>

        I'm pretty confident that our next set of performance improvements  
will remedy this slowdown at the larger scales.  We'll see in a week  
or so... :-)

        Quincey

> 4) if you stack the 2D images as a 3D 200k x 64 x 64 32bit pixel  
> stack, HDF is 20-40% faster than stdio.
>
>
>
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
> =
> =
> =
> ======================================================================
> =
> =
> =
> ======================================================================
>
>
>
>
> -----Original Message-----
> From: Francesc Altet [mailto:faltet at carabos.com]
> Sent: Thu 1/17/2008 6:23 AM
> To: hdf-forum at hdfgroup.org
> Subject: Re: Opening datasets expensive?
>
> A Thursday 17 January 2008, Jim Robinson escrigu?:
> > Hi,  I am using HDF5 as the backend for a genomics visualizer.    
> The
> > data is organized by experiement,  chromosome, and resolution scale.
> > A typical file might have 300 or so experiments, 24 chromosomes, and
> > 8 resolution scales.  My current design uses a for each experiment,
> > chromosome, and resolution scale,  or 57,600 datasets in all.
> >
> > First question, is that too many datasets?   I could combine the
> > experiment and chromosome dimensions with a corresponding reduction
> > in the number of datasets and increase in each datasets size.  It
> > would complicate the application code but is doable.
> >
> > The application is a visualization and needs to access small  
> portions
> > of each dataset very quickly.  It is organized similar to google  
> maps
> > and as the user zooms and pans small slices or datasets are accessed
> > and rendered.  The number of datasets accessed at one time is equal
> > to the number of experiments.  It is working fine with small numbers
> > of experiments,  < 20,  but panning and zooming is noticeably
> > sluggish with 300.   I did some profiling and discoverd that about
> > 70% of the time is spent just opening the datasets.    Is this to be
> > expected?      Is it good practice to have a few large datasets
> > rather than many smaller ones?
>
> My experience says that it is definitely better to have fewer large
> datasets than many smaller ones.
>
> Even if, as Elena is saying, there is a bug in the metadata cache that
> the THG people is fixing, if you want maximum speed to access parts of
> your data, my guess is that accessing it in different parts of a large
> dataset would be always faster than accessing different datasets.  
> This
> is because a dataset has more metadata that has to be retrieved from
> disk, while for accessing parts of a large dataset you only have to
> read the part of the btree to reach it (if not yet in memory) and the
> data itself, which is pretty fast.
>
> My 2 cents,
>
> --
> >0,0<   Francesc Altet     http://www.carabos.com/
> V   V   C?rabos Coop. V.   Enjoy Data
>  "-"
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org
> .
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>
>


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Loading...