Quantcast

Writing to a dataset with 'wrong' chunksize

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing to a dataset with 'wrong' chunksize

Francesc Altet
Hi,

Some time ago, a Pytables user complained about that the next simple
operation was hogging gigantics amounts of memory:

import tables, numpy
N = 600
f = tables.openFile('foo.h5', 'w')
f.createCArray(f.root, 'huge_array',
               tables.Float64Atom(),
               shape = (2,2,N,N,50,50))
for i in xrange(50):
    for j in xrange(50):
        f.root.huge_array[:,:,:,:,j,i] = \
            numpy.array([[1,0],[0,1]])[:,:,None,None]

and I think that the problem could be in the HDF5 side.

The point is that, for the 6-th dimensional 'huge_array' dataset,
Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50).
Then, the user wanted to update the array starting in the trailing
dimensions (instead of using the leading ones, which is the recommended
practice for C-ordered arrays).  This results in Pytables asking HDF5
to do the update using the traditional procedure:

 /* Create a simple memory data space */
 if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 )
   return -3;

 /* Get the file data space */
 if ( (space_id = H5Dget_space( dataset_id )) < 0 )
  return -4;

 /* Define a hyperslab in the dataset */
 if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET, start,
                                        step, count, NULL) < 0 )
  return -5;

 if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id,
H5P_DEFAULT, data ) < 0 )
   return -6;

While I understand that this approach is suboptimal (2*2*600*100=240000
chunks has to 'updated' for each update operation in the loop), I don't
understand completely the reason why the user reports that the script
is consuming so much memory (the script crashes, but perhaps it is
asking for several GB).  My guess is that perhaps HDF5 is trying to
load all the affected chunks in-memory before trying to update them,
but I thought it is best to report this here just in case this is a
bug, or, if not, the huge demand of memory can be somewhat alleviated.

In case you need more information, you may find it by following the
details of the discussion in the next thread:

http://www.mail-archive.com/pytables-users at lists.sourceforge.net/msg00722.html

Thanks!

--
>0,0<   Francesc Altet ? ? http://www.carabos.com/
V   V   C?rabos Coop. V. ??Enjoy Data
 "-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing to a dataset with 'wrong' chunksize

Quincey Koziol
Hi Francesc,

On Nov 23, 2007, at 2:06 PM, Francesc Altet wrote:

> Hi,
>
> Some time ago, a Pytables user complained about that the next simple
> operation was hogging gigantics amounts of memory:
>
> import tables, numpy
> N = 600
> f = tables.openFile('foo.h5', 'w')
> f.createCArray(f.root, 'huge_array',
>                tables.Float64Atom(),
>                shape = (2,2,N,N,50,50))
> for i in xrange(50):
>     for j in xrange(50):
>         f.root.huge_array[:,:,:,:,j,i] = \
>             numpy.array([[1,0],[0,1]])[:,:,None,None]
>
> and I think that the problem could be in the HDF5 side.
>
> The point is that, for the 6-th dimensional 'huge_array' dataset,
> Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50).
> Then, the user wanted to update the array starting in the trailing
> dimensions (instead of using the leading ones, which is the  
> recommended
> practice for C-ordered arrays).  This results in Pytables asking HDF5
> to do the update using the traditional procedure:
>
>  /* Create a simple memory data space */
>  if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 )
>    return -3;
>
>  /* Get the file data space */
>  if ( (space_id = H5Dget_space( dataset_id )) < 0 )
>   return -4;
>
>  /* Define a hyperslab in the dataset */
>  if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET,  
> start,
> step, count, NULL) < 0 )
>   return -5;
>
>  if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id,
> H5P_DEFAULT, data ) < 0 )
>    return -6;
>
> While I understand that this approach is suboptimal  
> (2*2*600*100=240000
> chunks has to 'updated' for each update operation in the loop), I  
> don't
> understand completely the reason why the user reports that the script
> is consuming so much memory (the script crashes, but perhaps it is
> asking for several GB).  My guess is that perhaps HDF5 is trying to
> load all the affected chunks in-memory before trying to update them,
> but I thought it is best to report this here just in case this is a
> bug, or, if not, the huge demand of memory can be somewhat alleviated.

        Is this with the 1.6.x library code?  If so, it would be worthwhile  
checking with the 1.8.0 code, which is designed to do all the I/O on  
each chunk at once and then proceed to the next chunk.  However, it  
does build information about the selection on each chunk to update  
and if the I/O operation will update 240,000 chunks, that could be a  
large amount of memory...

        Quincey

> In case you need more information, you may find it by following the
> details of the discussion in the next thread:
>
> http://www.mail-archive.com/pytables-users at lists.sourceforge.net/
> msg00722.html
>
> Thanks!
>
> --
>> 0,0<   Francesc Altet     http://www.carabos.com/
> V   V   C?rabos Coop. V.   Enjoy Data
>  "-"
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-
> subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing to a dataset with 'wrong' chunksize

Dominik Szczerba-2
So what should I chose to have the same default behavior as saying "gzip
file.dat" on command line? I seek a reasonable default for my small library
and a user can finetune later etc.
thanks,
Dominik

On Tuesday 27 November 2007 13.39:51 Quincey Koziol wrote:

> Hi Francesc,
>
> On Nov 23, 2007, at 2:06 PM, Francesc Altet wrote:
> > Hi,
> >
> > Some time ago, a Pytables user complained about that the next simple
> > operation was hogging gigantics amounts of memory:
> >
> > import tables, numpy
> > N = 600
> > f = tables.openFile('foo.h5', 'w')
> > f.createCArray(f.root, 'huge_array',
> >                tables.Float64Atom(),
> >                shape = (2,2,N,N,50,50))
> > for i in xrange(50):
> >     for j in xrange(50):
> >         f.root.huge_array[:,:,:,:,j,i] = \
> >             numpy.array([[1,0],[0,1]])[:,:,None,None]
> >
> > and I think that the problem could be in the HDF5 side.
> >
> > The point is that, for the 6-th dimensional 'huge_array' dataset,
> > Pytables computed an 'optimal' chunkshape of (1, 1, 1, 6, 50, 50).
> > Then, the user wanted to update the array starting in the trailing
> > dimensions (instead of using the leading ones, which is the
> > recommended
> > practice for C-ordered arrays).  This results in Pytables asking HDF5
> > to do the update using the traditional procedure:
> >
> >  /* Create a simple memory data space */
> >  if ( (mem_space_id = H5Screate_simple( rank, count, NULL )) < 0 )
> >    return -3;
> >
> >  /* Get the file data space */
> >  if ( (space_id = H5Dget_space( dataset_id )) < 0 )
> >   return -4;
> >
> >  /* Define a hyperslab in the dataset */
> >  if ( rank != 0 && H5Sselect_hyperslab( space_id, H5S_SELECT_SET,
> > start,
> > step, count, NULL) < 0 )
> >   return -5;
> >
> >  if ( H5Dwrite( dataset_id, type_id, mem_space_id, space_id,
> > H5P_DEFAULT, data ) < 0 )
> >    return -6;
> >
> > While I understand that this approach is suboptimal
> > (2*2*600*100=240000
> > chunks has to 'updated' for each update operation in the loop), I
> > don't
> > understand completely the reason why the user reports that the script
> > is consuming so much memory (the script crashes, but perhaps it is
> > asking for several GB).  My guess is that perhaps HDF5 is trying to
> > load all the affected chunks in-memory before trying to update them,
> > but I thought it is best to report this here just in case this is a
> > bug, or, if not, the huge demand of memory can be somewhat alleviated.
>
> Is this with the 1.6.x library code?  If so, it would be worthwhile
> checking with the 1.8.0 code, which is designed to do all the I/O on
> each chunk at once and then proceed to the next chunk.  However, it
> does build information about the selection on each chunk to update
> and if the I/O operation will update 240,000 chunks, that could be a
> large amount of memory...
>
> Quincey
>
> > In case you need more information, you may find it by following the
> > details of the discussion in the next thread:
> >
> > http://www.mail-archive.com/pytables-users at lists.sourceforge.net/
> > msg00722.html
> >
> > Thanks!
> >
> > --
> >
> >> 0,0<   Francesc Altet     http://www.carabos.com/
> >
> > V   V   C?rabos Coop. V.   Enjoy Data
> >  "-"
> >
> > ----------------------------------------------------------------------
> > This mailing list is for HDF software users discussion.
> > To subscribe to this list, send a message to hdf-forum-
> > subscribe at hdfgroup.org.
> > To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org. To unsubscribe, send a message to
> hdf-forum-unsubscribe at hdfgroup.org.



--
Dominik Szczerba, Ph.D.
Computer Vision Lab CH-8092 Zurich
http://www.vision.ee.ethz.ch/~domi

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing to a dataset with 'wrong' chunksize

Francesc Altet
In reply to this post by Quincey Koziol
Hi Quincey,

A Tuesday 27 November 2007, Quincey Koziol escrigu?:
[snip]

> > While I understand that this approach is suboptimal
> > (2*2*600*100=240000
> > chunks has to 'updated' for each update operation in the loop), I
> > don't
> > understand completely the reason why the user reports that the
> > script is consuming so much memory (the script crashes, but perhaps
> > it is asking for several GB).  My guess is that perhaps HDF5 is
> > trying to load all the affected chunks in-memory before trying to
> > update them, but I thought it is best to report this here just in
> > case this is a bug, or, if not, the huge demand of memory can be
> > somewhat alleviated.
>
> Is this with the 1.6.x library code?  If so, it would be worthwhile
> checking with the 1.8.0 code, which is designed to do all the I/O on
> each chunk at once and then proceed to the next chunk.  However, it
> does build information about the selection on each chunk to update
> and if the I/O operation will update 240,000 chunks, that could be a
> large amount of memory...

Yes, this was using 1.6.x library.  I've directed the user to compile
PyTables with the latest 1.8.0 (beta5) library (with
the "--with-default-api-version=v16" flag) but he is reporting
problems.  Here is the relevant excerpt of the traceback:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210186064 (LWP 12304)]
0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
464         H5S_SELECT_RELEASE(ds);
(gdb) bt
#0  0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
#1  0xb7a0ab4e in H5D_destroy_chunk_map (fm=0xbfb0fff8) at H5Dio.c:2651
#2  0xb7a0b04c in H5D_create_chunk_map (fm=0xbfb0fff8,
    io_info=<value optimized out>, nelmts=1440000, file_space=0x84bd140,
    mem_space=0x84b40f0, mem_type=0x8363000) at H5Dio.c:2556
#3  0xb7a0cd1a in H5D_chunk_write (io_info=0xbfb13c24, nelmts=1440000,
    mem_type=0x8363000, mem_space=0x84b40f0, file_space=0x84bd140,
    tpath=0x8363e30, src_id=50331970, dst_id=50331966, buf=0xb57b8008)
    at H5Dio.c:1765
#4  0xb7a106f9 in H5D_write (dataset=0x840a418, mem_type_id=50331970,
    mem_space=0x84b40f0, file_space=0x84bd140, dxpl_id=167772168,
    buf=0xb57b8008) at H5Dio.c:732
#5  0xb7a117aa in H5Dwrite (dset_id=83886080, mem_type_id=50331970,
    mem_space_id=67108874, file_space_id=67108875, plist_id=167772168,
    buf=0xb57b8008) at H5Dio.c:434

We don't have time right now to look into it, but it could be a problem
with PyTables code (although, if the "--with-default-api-version=v16"
flag is working properly, this should not be the case).  It is strange,
because PyTables used to work perfectly up to HDF5 1.8.0 beta3 (i.e.
all tests passed).

If we do more progress on this issue, I'll let you know.

Thanks!

--
>0,0<   Francesc Altet ? ? http://www.carabos.com/
V   V   C?rabos Coop. V. ??Enjoy Data
 "-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing to a dataset with 'wrong' chunksize

Quincey Koziol
Hi Francesc,

On Nov 28, 2007, at 10:44 AM, Francesc Altet wrote:

> Hi Quincey,
>
> A Tuesday 27 November 2007, Quincey Koziol escrigu?:
> [snip]
>>> While I understand that this approach is suboptimal
>>> (2*2*600*100=240000
>>> chunks has to 'updated' for each update operation in the loop), I
>>> don't
>>> understand completely the reason why the user reports that the
>>> script is consuming so much memory (the script crashes, but perhaps
>>> it is asking for several GB).  My guess is that perhaps HDF5 is
>>> trying to load all the affected chunks in-memory before trying to
>>> update them, but I thought it is best to report this here just in
>>> case this is a bug, or, if not, the huge demand of memory can be
>>> somewhat alleviated.
>>
>> Is this with the 1.6.x library code?  If so, it would be worthwhile
>> checking with the 1.8.0 code, which is designed to do all the I/O on
>> each chunk at once and then proceed to the next chunk.  However, it
>> does build information about the selection on each chunk to update
>> and if the I/O operation will update 240,000 chunks, that could be a
>> large amount of memory...
>
> Yes, this was using 1.6.x library.  I've directed the user to compile
> PyTables with the latest 1.8.0 (beta5) library (with
> the "--with-default-api-version=v16" flag) but he is reporting
> problems.  Here is the relevant excerpt of the traceback:
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread -1210186064 (LWP 12304)]
> 0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
> 464         H5S_SELECT_RELEASE(ds);
> (gdb) bt
> #0  0xb7b578b1 in H5S_close (ds=0xbfb11178) at H5S.c:464
> #1  0xb7a0ab4e in H5D_destroy_chunk_map (fm=0xbfb0fff8) at H5Dio.c:
> 2651
> #2  0xb7a0b04c in H5D_create_chunk_map (fm=0xbfb0fff8,
>     io_info=<value optimized out>, nelmts=1440000,  
> file_space=0x84bd140,
>     mem_space=0x84b40f0, mem_type=0x8363000) at H5Dio.c:2556
> #3  0xb7a0cd1a in H5D_chunk_write (io_info=0xbfb13c24, nelmts=1440000,
>     mem_type=0x8363000, mem_space=0x84b40f0, file_space=0x84bd140,
>     tpath=0x8363e30, src_id=50331970, dst_id=50331966, buf=0xb57b8008)
>     at H5Dio.c:1765
> #4  0xb7a106f9 in H5D_write (dataset=0x840a418, mem_type_id=50331970,
>     mem_space=0x84b40f0, file_space=0x84bd140, dxpl_id=167772168,
>     buf=0xb57b8008) at H5Dio.c:732
> #5  0xb7a117aa in H5Dwrite (dset_id=83886080, mem_type_id=50331970,
>     mem_space_id=67108874, file_space_id=67108875, plist_id=167772168,
>     buf=0xb57b8008) at H5Dio.c:434
>
> We don't have time right now to look into it, but it could be a  
> problem
> with PyTables code (although, if the "--with-default-api-version=v16"
> flag is working properly, this should not be the case).  It is  
> strange,
> because PyTables used to work perfectly up to HDF5 1.8.0 beta3 (i.e.
> all tests passed).

        Hmm, I have been working on that section of code a lot, it's  
certainly possible that I've introduced a bug. :-/

> If we do more progress on this issue, I'll let you know.

        If you can characterize it in a standalone program, that would be  
really great!

        Thanks,
                Quincey

>
> Thanks!
>
> --
>> 0,0<   Francesc Altet     http://www.carabos.com/
> V   V   C?rabos Coop. V.   Enjoy Data
>  "-"
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to hdf-forum-
> subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
>


----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Writing to a dataset with 'wrong' chunksize

Francesc Altet
Quincey,

A Thursday 29 November 2007, Quincey Koziol escrigu?:
> Hmm, I have been working on that section of code a lot, it's
> certainly possible that I've introduced a bug. :-/
>
> If you can characterize it in a standalone program, that would be
> really great!

I've done this in the attached program.  It works as it is, but set N to
600 and you will get the segfault using 1.8.0 beta5 (sorry, I'm in a
hurry and don't have time to check other HDF5 versions).

Cheers,

--
>0,0<   Francesc Altet ? ? http://www.carabos.com/
V   V   C?rabos Coop. V. ??Enjoy Data
 "-"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: write-bug.c
Type: text/x-csrc
Size: 2362 bytes
Desc: not available
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20071201/5a20ebd0/attachment.bin>
-------------- next part --------------
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.

Loading...