[hdf-forum] Transposing large array

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Douglas Eck
Hi all,
I have an hdf file containing this array :

 DATASET "bfeat_mcmc_array" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
               STRSIZE 7;
               STRPAD H5T_STR_NULLTERM;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE  SCALAR
      }


I need to access it both row-wise and column-wise.  I would like to store a
transposed version (size 6109666 x 360) to make it easier to read out a
single vector of size 6109666 x 1

What's the best way to do this.  I'd love to see some magic utility such as
:

hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

:-)

I'm using pytables but could also do a quick .c program if necessary.

Thanks,
Doug Eck


Dr. Douglas Eck, Associate Professor
Universit? de Montr?al, Department of Computer Science / BRAMS
CP 6128, Succ. Centre-Ville Montr?al, Qu?bec H3C 3J7  CANADA
Office: 3253 Pavillion Andre-Aisenstadt
Phone: 1-514-343-6111 ext 3520  Fax: 1-514-343-5834
http://www.iro.umontreal.ca/~eckdoug
Research Areas: Machine Learning and Music Cognition
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081104/9dcadf29/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

George N. White III
On Tue, Nov 4, 2008 at 4:25 PM, Douglas Eck <eckdoug at iro.umontreal.ca> wrote:

> Hi all,
> I have an hdf file containing this array :
>
>  DATASET "bfeat_mcmc_array" {
>       DATATYPE  H5T_IEEE_F64LE
>       DATASPACE  SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
>       ATTRIBUTE "CLASS" {
>          DATATYPE  H5T_STRING {
>                STRSIZE 7;
>                STRPAD H5T_STR_NULLTERM;
>                CSET H5T_CSET_ASCII;
>                CTYPE H5T_C_S1;
>             }
>          DATASPACE  SCALAR
>       }
>
>
> I need to access it both row-wise and column-wise.  I would like to store a
> transposed version (size 6109666 x 360) to make it easier to read out a
> single vector of size 6109666 x 1
>
> What's the best way to do this.  I'd love to see some magic utility such as
> :
>
> hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

> I'm using pytables but could also do a quick .c program if necessary.

Do you mean quick as in runs fast because you have bunch of those arrays,
or quick as in easy to code because you have one array and want to spend
less time coding than it will take to run the program for your one array?

There has been a lot of work on implementing transpose for arrays larger
than the working set (on parallel hardware).   IDL and Matlab use such
approaches, but I'm not sure how widespread they are in free tools.   For
specific architectures you can find very low-level libraries (Intel TBB+IBM
HTA), that can be used to (slowly) build quick-running programs.

--
George N. White III <aa056 at chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Francesc Alted
In reply to this post by Douglas Eck
Hi Douglas,

A Tuesday 04 November 2008, Douglas Eck escrigu?:

> Hi all,
> I have an hdf file containing this array :
>
>  DATASET "bfeat_mcmc_array" {
>       DATATYPE  H5T_IEEE_F64LE
>       DATASPACE  SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
>       ATTRIBUTE "CLASS" {
>          DATATYPE  H5T_STRING {
>                STRSIZE 7;
>                STRPAD H5T_STR_NULLTERM;
>                CSET H5T_CSET_ASCII;
>                CTYPE H5T_C_S1;
>             }
>          DATASPACE  SCALAR
>       }
>
>
> I need to access it both row-wise and column-wise.  I would like to
> store a transposed version (size 6109666 x 360) to make it easier to
> read out a single vector of size 6109666 x 1
>
> What's the best way to do this.

I don't think you need to transpose the dataset: why not just copy it with a sensible chunksize in destination?.  As you are using PyTables, in forthcoming 2.1 version [1] the .copy() method of leaves will support the 'chunkshape' argument, so this could be done quite easily:

# 'a' is the original 'dim1xdim2' dataset
newchunkshape = (1, a.chunkshape[0]*a.chunkshape[1])
b = a.copy(f.root, "b", chunkshape=newchunkshape)
# 'b' contains the dataset with an optimized chunkshape for reading rows

And that's all.  As I was curious about the improvements on this approach, I have created a small benchmark (attached) and here are the results for your dataset:

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 279.564 sec (60.0 MB/s)
Time to read ten rows in original array: 945.315 sec (0.5 MB/s)
================================
Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 611.177 sec (27.5 MB/s)
Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)
================================
Speed-up with a row-wise chunkshape: 27.9

Mmh, it seems like I'm not getting the most out of my disk (14 MB/s is too few) here. So, perhaps this is the effect of the large HDF5 hash table for accessing actual data.  So as to confirm this, I choose a chunkshape 10x larger, up to 1.2 MB (that will make the size of the HDF5 hash table to decrease).  Here it is the new result:

<snip>
Chunkshape for row-wise chunkshape array: (1, 162000)
Time to copy the original array: 379.388 sec (44.2 MB/s)
Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)
================================
Speed-up with a row-wise chunkshape: 111.6

Ok.  Now I'm getting a decent performance for the new dataset.  It is also worth noting that the copy speed has been accelerated by a 60% (I'd say that it is pretty optimal now).  I've also tried out bigger chunksizes, but the performance drops quite a lot.  Definitely a chunkshape of (1, 162000), allowing for more than 100x of speed-up over the original setting, seems good enough for this case.

Incidentally, you could always do the copy manually:

# 'a' is the original 'dim1xdim2' dataset
b = f.createEArray(f.root, "b", tables.Float64Atom(),
                   shape = (0, dim2), chunkshape=(1, 162000))
for i in xrange(dim1):
    b.append([a[i]])
# 'b' contains the dataset with an optimized chunkshape for reading rows

but this method is much more expensive (perhaps more than 10x) than using the .copy() method (this is because the I/O is optimized during copy operations).  However, I'd say that the read throughput would be similar.

> I'd love to see some magic utility such as
>
>
> hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5

In forthcoming PyTables 2.1 [1] you will be able to do:

$ ptrepack  /tmp/test.h5:/a /tmp/test2.h5:/a
$ ptrepack --chunkshape='(1, 162000)' /tmp/test.h5:/a /tmp/test2.h5:/b

and then use the test2.h5 for your purposes.

[1] http://www.pytables.org/download/preliminary/

Hope that helps,

--
Francesc Alted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081105/25fbfbaa/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bench-chunksize.py
Type: application/x-python
Size: 1643 bytes
Desc: not available
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081105/25fbfbaa/attachment.bin>
-------------- next part --------------
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Francesc Alted
A Wednesday 05 November 2008, Francesc Alted escrigu?:
[snip]

> And that's all.  As I was curious about the improvements on this
> approach, I have created a small benchmark (attached) and here are
> the results for your dataset:
>
> ================================
> Chunkshape for original array: (360, 45)
> Time to append 6109666 rows: 279.564 sec (60.0 MB/s)
> Time to read ten rows in original array: 945.315 sec (0.5 MB/s)
> ================================
> Chunkshape for row-wise chunkshape array: (1, 16200)
> Time to copy the original array: 611.177 sec (27.5 MB/s)
> Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)
> ================================
> Speed-up with a row-wise chunkshape: 27.9
>
> Mmh, it seems like I'm not getting the most out of my disk (14 MB/s
> is too few) here. So, perhaps this is the effect of the large HDF5
> hash table for accessing actual data.  So as to confirm this, I
> choose a chunkshape 10x larger, up to 1.2 MB (that will make the size
> of the HDF5 hash table to decrease).  Here it is the new result:
>
> <snip>
> Chunkshape for row-wise chunkshape array: (1, 162000)
> Time to copy the original array: 379.388 sec (44.2 MB/s)
> Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)
> ================================
> Speed-up with a row-wise chunkshape: 111.6
>
> Ok.  Now I'm getting a decent performance for the new dataset.  It is
> also worth noting that the copy speed has been accelerated by a 60%
> (I'd say that it is pretty optimal now).  I've also tried out bigger
> chunksizes, but the performance drops quite a lot.  Definitely a
> chunkshape of (1, 162000), allowing for more than 100x of speed-up
> over the original setting, seems good enough for this case.

I was hooked on this and tried yet another chunkshape: (1, 324000).  
This doubles the size of the latter.  Here are the results:

================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 275.9 sec (60.8 MB/s)
Time to read ten rows in original array: 945.315 sec (0.5 MB/s)
================================
Chunkshape for row-wise chunkshape array: (1, 324000)
Time to copy the original array: 284.88 sec (58.9 MB/s)
Time to read with a row-wise chunkshape: 3.508 sec (132.9 MB/s)
================================
Speed-up with a row-wise chunkshape: 269.5

So, by doubling the chunkshape, you can get a further improvement of
almost 3x in read speed.  Also, the copy is very efficient now (and
very close to creating the dataset anew, which is a bit
counter-intuitive :-/).

Most probably you could find a better figure by playing with other
values.  This is to say that, although PyTables provides its own
guesses for chunkshapes (based on the estimated sizes of datasets), in
general there is no replacement for running your own experiments in
order to determine the chunkshape that works best for you.

Cheers,

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Douglas Eck
In reply to this post by Francesc Alted
Hi Francesc,  Forum

First, thank you very much for your help.  I now am beginning to understand
chunking, thanks to you!
I am trying your recommendations and having some problems.  I am trying to
copy bfeat_mcmc.h5 to bfeat_mcmc_ptrepack.h5 using your recommended
approach.  The file bfeat_mcmc.h5 contains a table bfeat_mcmc_table and an
array bfeat_mcmc_array. The table is small and is not an issue here. The
table looks like this:
Existing array bfeat_mcmc_array is type <class 'tables.earray.EArray'> shape
(360, 6109666) chunkshape (360, 2)
I don't know why the chunkshape is (360,2) but I now understand that this is
not a good chunkshape.
The attache file bfeat_mcmc.txt contains the output from hgdump -H
bfeat_mcmc.h5

I installed v2.1rc1
629 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>ipython
In [1]: import tables
In [2]: tables.__version__
Out[2]: '2.1rc1'

When I run ptrepack I get this error:
638 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>ptrepack
--overwrite-nodes --chunkshape='(1,324000)' bfeat_mcmc.h5:/bfeat_mcmc_array
bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array
Problems doing the copy from 'bfeat_mcmc.h5:/bfeat_mcmc_array' to
'bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array'
The error was --> <type 'exceptions.TypeError'>: _g_copyWithStats() got an
unexpected keyword argument 'propindexes'
The destination file looks like:
bfeat_mcmc_ptrepack.h5 (File) ''
Last modif.: 'Thu Nov  6 10:05:57 2008'
Object Tree:
/ (RootGroup) ''

Traceback (most recent call last):
  File "/u/eckdoug/share/bin/ptrepack", line 3, in <module>
    main()
  File
"/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepack.py",
line 483, in main
    upgradeflavors=upgradeflavors)
  File
"/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepack.py",
line 140, in copyLeaf
    raise RuntimeError, "Please check that the node names are not duplicated
in destination, and if so, add the --overwrite-nodes flag if desired."
RuntimeError: Please check that the node names are not duplicated in
destination, and if so, add the --overwrite-nodes flag if desired.

When I run the following code in python I let it run for ten hours and still
hadn't written even 0.1% of the new array.  I checked memory usage. It
wasn't a problem.  So I was doing something wrong I suppose:
def rechunk_cache_file_h5(h5_in_fn,h5_out_fn,feattype,chunkshape=(1,
324000)) :
    """rechunks an existing file to enable fast column rather than row
lookups"""
    print 'Working with tables',tables.__file__,'version',tables.__version__
    h5in = tables.openFile(h5_in_fn,'r')
    array_name ='%s_array' % feattype
    table_name = '%s_table' % feattype
    arr = h5in.getNode(h5in.root,array_name)
    tbl = h5in.getNode(h5in.root,table_name)
    print 'Opening',h5_out_fn
    h5out = tables.openFile(h5_out_fn,'w')
    print 'Copying table',table_name
    newtbl = tbl.copy(h5out.root,table_name)
    print 'Existing array',array_name,'is
type',type(arr),'shape',arr.shape,'chunkshape',arr.chunkshape
    print 'Copying to',h5_out_fn,'with chunkshape',chunkshape
    newarr = arr.copy(h5out.root,array_name,chunkshape=chunkshape)
    h5in.close()
    h5out.close()


I also tried recreating the original bfeat_mcmc.h5 using approproate
chunksize.  This was also slow, though I haven't checked closely how long it
will take.  Here's the code I used for that.  This code is called in a loop.
The h5 file is opened outside of the loop and the filehandle is passed in as
"h5".  Each call of this function should write a matrix of size
[fcount=360,fdim] to the array '%s_array' % feattype  and also update a
table which stores indexes.
def write_cache_h5(h5,tid,dat,feattype='sfeat', chunkshape=(1, 324000),
force=False) :
    import tables
    (fcount,fdim)=shape(dat)
    array_name ='%s_array' % feattype
    table_name = '%s_table' % feattype
    if h5.root.__contains__(array_name) :
        arr = h5.getNode(h5.root,array_name)
    else:
        filters = tables.Filters(complevel=1, complib='lzo')
        arr = h5.createEArray(h5.root, array_name, tables.FloatAtom(),
(fdim,0), feattype, filters=filters,       chunkshape=chunkshape)

    if h5.root.__contains__(table_name):
        tbl = h5.getNode(h5.root,table_name)
    else :
        tbl = h5.createTable('/', table_name,SFeat, table_name)

    idxs = tbl.getWhereList('tid == %i' % tid)
    if len(idxs)>0 :
        if force :
            #don't do anything here. We always look at last record when
multiple are present for a tid
            pass
        else :
            #print 'TID',tid,'is already present in',h5.filename,'but
force=False so not writing data'
            return
    rec = tbl.row
    rec['tid']=tid
    rec['start']=arr.shape[1]
    rec['fcount']=fcount
    rec.append()
    for f in range(fcount) :
        arr.append(dat[f,:].reshape([fdim,1]))
    tbl.flush()
    arr.flush()
class SFeat(tables.IsDescription) :
    tid = tables.Int32Col()    #track id
    start = tables.Int32Col()  #start idx
    fcount = tables.Int32Col() #number of frames
    def __init__(self) :
        self.tid.createIndex()


Thanks for any help!


On Wed, Nov 5, 2008 at 9:02 AM, Francesc Alted <faltet at pytables.com> wrote:

> Hi Douglas,
>
> A Tuesday 04 November 2008, Douglas Eck escrigu?:
>
> > Hi all,
>
> > I have an hdf file containing this array :
>
> >
>
> > DATASET "bfeat_mcmc_array" {
>
> > DATATYPE H5T_IEEE_F64LE
>
> > DATASPACE SIMPLE { ( 360, 6109666 ) / ( 360, H5S_UNLIMITED ) }
>
> > ATTRIBUTE "CLASS" {
>
> > DATATYPE H5T_STRING {
>
> > STRSIZE 7;
>
> > STRPAD H5T_STR_NULLTERM;
>
> > CSET H5T_CSET_ASCII;
>
> > CTYPE H5T_C_S1;
>
> > }
>
> > DATASPACE SCALAR
>
> > }
>
> >
>
> >
>
> > I need to access it both row-wise and column-wise. I would like to
>
> > store a transposed version (size 6109666 x 360) to make it easier to
>
> > read out a single vector of size 6109666 x 1
>
> >
>
> > What's the best way to do this.
>
> I don't think you need to transpose the dataset: why not just copy it with
> a sensible chunksize in destination?. As you are using PyTables, in
> forthcoming 2.1 version [1] the .copy() method of leaves will support the
> 'chunkshape' argument, so this could be done quite easily:
>
> # 'a' is the original 'dim1xdim2' dataset
>
> newchunkshape = (1, a.chunkshape[0]*a.chunkshape[1])
>
> b = a.copy(f.root, "b", chunkshape=newchunkshape)
>
> # 'b' contains the dataset with an optimized chunkshape for reading rows
>
> And that's all. As I was curious about the improvements on this approach, I
> have created a small benchmark (attached) and here are the results for your
> dataset:
>
> ================================
>
> Chunkshape for original array: (360, 45)
>
> Time to append 6109666 rows: 279.564 sec (60.0 MB/s)
>
> Time to read ten rows in original array: 945.315 sec (0.5 MB/s)
>
> ================================
>
> Chunkshape for row-wise chunkshape array: (1, 16200)
>
> Time to copy the original array: 611.177 sec (27.5 MB/s)
>
> Time to read with a row-wise chunkshape: 33.877 sec (13.8 MB/s)
>
> ================================
>
> Speed-up with a row-wise chunkshape: 27.9
>
> Mmh, it seems like I'm not getting the most out of my disk (14 MB/s is too
> few) here. So, perhaps this is the effect of the large HDF5 hash table for
> accessing actual data. So as to confirm this, I choose a chunkshape 10x
> larger, up to 1.2 MB (that will make the size of the HDF5 hash table to
> decrease). Here it is the new result:
>
> <snip>
>
> Chunkshape for row-wise chunkshape array: (1, 162000)
>
> Time to copy the original array: 379.388 sec (44.2 MB/s)
>
> Time to read with a row-wise chunkshape: 8.469 sec (55.0 MB/s)
>
> ================================
>
> Speed-up with a row-wise chunkshape: 111.6
>
> Ok. Now I'm getting a decent performance for the new dataset. It is also
> worth noting that the copy speed has been accelerated by a 60% (I'd say that
> it is pretty optimal now). I've also tried out bigger chunksizes, but the
> performance drops quite a lot. Definitely a chunkshape of (1, 162000),
> allowing for more than 100x of speed-up over the original setting, seems
> good enough for this case.
>
> Incidentally, you could always do the copy manually:
>
> # 'a' is the original 'dim1xdim2' dataset
>
> b = f.createEArray(f.root, "b", tables.Float64Atom(),
>
> shape = (0, dim2), chunkshape=(1, 162000))
>
> for i in xrange(dim1):
>
> b.append([a[i]])
>
> # 'b' contains the dataset with an optimized chunkshape for reading rows
>
> but this method is much more expensive (perhaps more than 10x) than using
> the .copy() method (this is because the I/O is optimized during copy
> operations). However, I'd say that the read throughput would be similar.
>
> > I'd love to see some magic utility such as
>
> >
>
> >
>
> > hdtranspose --dataset bfeat_mcmc_array in.h5 out.h5
>
> In forthcoming PyTables 2.1 [1] you will be able to do:
>
> $ ptrepack /tmp/test.h5:/a /tmp/test2.h5:/a
>
> $ ptrepack --chunkshape='(1, 162000)' /tmp/test.h5:/a /tmp/test2.h5:/b
>
> and then use the test2.h5 for your purposes.
>
> [1] http://www.pytables.org/download/preliminary/
>
> Hope that helps,
>
> --
>
> Francesc Alted
>
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe at hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081106/b98fb723/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bfeat_mcmc.txt
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081106/b98fb723/attachment.txt>
-------------- next part --------------
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Douglas Eck
Ok I have some comparison times.  If I create an array with
chunkshape=(360,100)
fdim=360
arr = h5.createEArray(h5.root, array_name, tables.FloatAtom(), (fdim,0),
feattype, filters=filters, chunkshape=(360,100))
0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
/u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
FiltersWarning: compression library ``lzo`` is not available; using ``zlib``
instead
  % (complib, default_complib), FiltersWarning )
File 1 of 100.  Chunkshape (360, 100) processed 2 of size (52, 360) in
0.0122499465942
File 11 of 100.  Chunkshape (360, 100) processed 12 of size (41, 360) in
0.0148320198059
File 21 of 100.  Chunkshape (360, 100) processed 22 of size (62, 360) in
0.0179829597473
File 31 of 100.  Chunkshape (360, 100) processed 32 of size (68, 360) in
0.016970872879
File 41 of 100.  Chunkshape (360, 100) processed 42 of size (38, 360) in
0.00971698760986
File 51 of 100.  Chunkshape (360, 100) processed 52 of size (59, 360) in
0.0128350257874
File 61 of 100.  Chunkshape (360, 100) processed 62 of size (45, 360) in
0.00909495353699
File 71 of 100.  Chunkshape (360, 100) processed 72 of size (40, 360) in
0.00970888137817
File 81 of 100.  Chunkshape (360, 100) processed 82 of size (38, 360) in
0.00800800323486
File 91 of 100.  Chunkshape (360, 100) processed 92 of size (33, 360) in
0.0119268894196
100 Processing /part/02/sans-bkp/sitm/data/features/1000/T101.mp3.h5
Breaking after 100
Total time 1.661028862

On the contrary if I create a file with chunkshape=(1,324000) then the times
are much worse
653 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>cache_feat.py
bfeat_mcmc testing4.h5
0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
/u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
FiltersWarning: compression library ``lzo`` is not available; using ``zlib``
instead
  % (complib, default_complib), FiltersWarning )
File 1 of 100.  Chunkshape (1, 324000) processed 2 of size (52, 360) in
448.759572029

As a baseline here is the output from bench-chunksize.py
596 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>python
~/test/bench-chunksize.py
Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1
================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)
================================
Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 1253.463 sec (13.4 MB/s)
Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)
================================
Speed-up with a row-wise chunkshape: 102.9

I seem to be doing something wrong but I can't see what....

Thanks!
Doug Eck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081106/930441bc/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Francesc Alted
In reply to this post by Douglas Eck
A Thursday 06 November 2008, Douglas Eck escrigu?:

> Hi Francesc,  Forum
>
> First, thank you very much for your help.  I now am beginning to
> understand chunking, thanks to you!
> I am trying your recommendations and having some problems.  I am
> trying to copy bfeat_mcmc.h5 to bfeat_mcmc_ptrepack.h5 using your
> recommended approach.  The file bfeat_mcmc.h5 contains a table
> bfeat_mcmc_table and an array bfeat_mcmc_array. The table is small
> and is not an issue here. The table looks like this:
> Existing array bfeat_mcmc_array is type <class
> 'tables.earray.EArray'> shape (360, 6109666) chunkshape (360, 2)
> I don't know why the chunkshape is (360,2) but I now understand that
> this is not a good chunkshape.

Maybe you forgot to pass the 'expectedrows' parameter in order to inform
PyTables about the expected size of the EArray (this is critical for
allowing for a decent selection of the chunkshape).  See the
bench-chunksize.py and you will see that this is used.

> The attache file bfeat_mcmc.txt contains the output from hgdump -H
> bfeat_mcmc.h5
>
> I installed v2.1rc1
> 629 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>ipython
> In [1]: import tables
> In [2]: tables.__version__
> Out[2]: '2.1rc1'
>
> When I run ptrepack I get this error:
> 638 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>ptrepack
> --overwrite-nodes --chunkshape='(1,324000)'
> bfeat_mcmc.h5:/bfeat_mcmc_array
> bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array
> Problems doing the copy from 'bfeat_mcmc.h5:/bfeat_mcmc_array' to
> 'bfeat_mcmc_ptrepack.h5:/bfeat_mcmc_array'
> The error was --> <type 'exceptions.TypeError'>: _g_copyWithStats()
> got an unexpected keyword argument 'propindexes'
> The destination file looks like:
> bfeat_mcmc_ptrepack.h5 (File) ''
> Last modif.: 'Thu Nov  6 10:05:57 2008'
> Object Tree:
> / (RootGroup) ''
>
> Traceback (most recent call last):
>   File "/u/eckdoug/share/bin/ptrepack", line 3, in <module>
>     main()
>   File
> "/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepa
>ck.py", line 483, in main
>     upgradeflavors=upgradeflavors)
>   File
> "/u/eckdoug/share/lib64/python2.5/site-packages/tables/scripts/ptrepa
>ck.py", line 140, in copyLeaf
>     raise RuntimeError, "Please check that the node names are not
> duplicated in destination, and if so, add the --overwrite-nodes flag
> if desired." RuntimeError: Please check that the node names are not
> duplicated in destination, and if so, add the --overwrite-nodes flag
> if desired.

Yeah, I ran into this too.  This is a bug that I solved yesterday:

http://www.pytables.org/trac/ticket/195

You can either download the trunk version, or apply the patch in:

http://www.pytables.org/trac/changeset/3893

> When I run the following code in python I let it run for ten hours
> and still hadn't written even 0.1% of the new array.  I checked
> memory usage. It wasn't a problem.  So I was doing something wrong I
> suppose: def
> rechunk_cache_file_h5(h5_in_fn,h5_out_fn,feattype,chunkshape=(1,
> 324000)) :
>     """rechunks an existing file to enable fast column rather than
> row lookups"""
>     print 'Working with
> tables',tables.__file__,'version',tables.__version__ h5in =
> tables.openFile(h5_in_fn,'r')
>     array_name ='%s_array' % feattype
>     table_name = '%s_table' % feattype
>     arr = h5in.getNode(h5in.root,array_name)
>     tbl = h5in.getNode(h5in.root,table_name)
>     print 'Opening',h5_out_fn
>     h5out = tables.openFile(h5_out_fn,'w')
>     print 'Copying table',table_name
>     newtbl = tbl.copy(h5out.root,table_name)
>     print 'Existing array',array_name,'is
> type',type(arr),'shape',arr.shape,'chunkshape',arr.chunkshape
>     print 'Copying to',h5_out_fn,'with chunkshape',chunkshape
>     newarr = arr.copy(h5out.root,array_name,chunkshape=chunkshape)
>     h5in.close()
>     h5out.close()

I don't see why this have to be slow.  Perhaps are you using
compression?  Also, which version of the HDF5 library are you using?

> I also tried recreating the original bfeat_mcmc.h5 using approproate
> chunksize.  This was also slow, though I haven't checked closely how
> long it will take.  Here's the code I used for that.  This code is
> called in a loop. The h5 file is opened outside of the loop and the
> filehandle is passed in as "h5".  Each call of this function should
> write a matrix of size [fcount=360,fdim] to the array '%s_array' %
> feattype  and also update a table which stores indexes.
> def write_cache_h5(h5,tid,dat,feattype='sfeat', chunkshape=(1,
> 324000), force=False) :
>     import tables
>     (fcount,fdim)=shape(dat)
>     array_name ='%s_array' % feattype
>     table_name = '%s_table' % feattype
>     if h5.root.__contains__(array_name) :
>         arr = h5.getNode(h5.root,array_name)
>     else:
>         filters = tables.Filters(complevel=1, complib='lzo')
>         arr = h5.createEArray(h5.root, array_name,
> tables.FloatAtom(), (fdim,0), feattype, filters=filters,      
> chunkshape=chunkshape)
>
>     if h5.root.__contains__(table_name):
>         tbl = h5.getNode(h5.root,table_name)
>     else :
>         tbl = h5.createTable('/', table_name,SFeat, table_name)
>
>     idxs = tbl.getWhereList('tid == %i' % tid)
>     if len(idxs)>0 :
>         if force :
>             #don't do anything here. We always look at last record
> when multiple are present for a tid
>             pass
>         else :
>             #print 'TID',tid,'is already present in',h5.filename,'but
> force=False so not writing data'
>             return
>     rec = tbl.row
>     rec['tid']=tid
>     rec['start']=arr.shape[1]
>     rec['fcount']=fcount
>     rec.append()
>     for f in range(fcount) :
>         arr.append(dat[f,:].reshape([fdim,1]))
>     tbl.flush()
>     arr.flush()
> class SFeat(tables.IsDescription) :
>     tid = tables.Int32Col()    #track id
>     start = tables.Int32Col()  #start idx
>     fcount = tables.Int32Col() #number of frames
>     def __init__(self) :
>         self.tid.createIndex()

Again, I don't see why this would be slow.  Could you check whether you
can run my bench-chunksize.py at decent speeds in your system?

Cheers,

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Douglas Eck
> Maybe you forgot to pass the 'expectedrows' parameter in order to inform
> PyTables about the expected size of the EArray (this is critical for
> allowing for a decent selection of the chunkshape).  See the
> bench-chunksize.py and you will see that this is used.
>

I tried with expectedrows and got similar results.


> Yeah, I ran into this too.  This is a bug that I solved yesterday:
>
> http://www.pytables.org/trac/ticket/195
>

Great!  I'll grab source from trunk.

I don't see why this have to be slow.  Perhaps are you using
> compression?  Also, which version of the HDF5 library are you using?
>

I am using compression. I thought that was a good thing (?)  Learning more
and more....


Again, I don't see why this would be slow.  Could you check whether you
> can run my bench-chunksize.py at decent speeds in your system?
>
> 596 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>python
~/test/bench-chunksize.py
Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1
================================
Chunkshape for original array: (360, 45)
Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)
================================
Chunkshape for row-wise chunkshape array: (1, 16200)
Time to copy the original array: 1253.463 sec (13.4 MB/s)
Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)
================================
Speed-up with a row-wise chunkshape: 102.9
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081106/482fa62e/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Francesc Alted
In reply to this post by Douglas Eck
A Thursday 06 November 2008, Douglas Eck escrigu?:

> Ok I have some comparison times.  If I create an array with
> chunkshape=(360,100)
> fdim=360
> arr = h5.createEArray(h5.root, array_name, tables.FloatAtom(),
> (fdim,0), feattype, filters=filters, chunkshape=(360,100))
> 0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
> /u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
> FiltersWarning: compression library ``lzo`` is not available; using
> ``zlib`` instead
>   % (complib, default_complib), FiltersWarning )

Aha.  So you were using compression.

> File 1 of 100.  Chunkshape (360, 100) processed 2 of size (52, 360)
> in 0.0122499465942
> File 11 of 100.  Chunkshape (360, 100) processed 12 of size (41, 360)
> in 0.0148320198059
> File 21 of 100.  Chunkshape (360, 100) processed 22 of size (62, 360)
> in 0.0179829597473
> File 31 of 100.  Chunkshape (360, 100) processed 32 of size (68, 360)
> in 0.016970872879
> File 41 of 100.  Chunkshape (360, 100) processed 42 of size (38, 360)
> in 0.00971698760986
> File 51 of 100.  Chunkshape (360, 100) processed 52 of size (59, 360)
> in 0.0128350257874
> File 61 of 100.  Chunkshape (360, 100) processed 62 of size (45, 360)
> in 0.00909495353699
> File 71 of 100.  Chunkshape (360, 100) processed 72 of size (40, 360)
> in 0.00970888137817
> File 81 of 100.  Chunkshape (360, 100) processed 82 of size (38, 360)
> in 0.00800800323486
> File 91 of 100.  Chunkshape (360, 100) processed 92 of size (33, 360)
> in 0.0119268894196
> 100 Processing /part/02/sans-bkp/sitm/data/features/1000/T101.mp3.h5
> Breaking after 100
> Total time 1.661028862
>
> On the contrary if I create a file with chunkshape=(1,324000) then
> the times are much worse
> 653 eckdoug at cerveau
> /part/02/sans-bkp/sitm/data/features>cache_feat.py bfeat_mcmc
> testing4.h5
> 0 Processing /part/02/sans-bkp/sitm/data/features/1000/T1.mp3.h5
> /u/eckdoug/share/lib64/python2.5/site-packages/tables/filters.py:258:
> FiltersWarning: compression library ``lzo`` is not available; using
> ``zlib`` instead
>   % (complib, default_complib), FiltersWarning )
> File 1 of 100.  Chunkshape (1, 324000) processed 2 of size (52, 360)
> in 448.759572029

Well, I'd say that the problem is the compression (all my previous
benchmarks were made without compression).  More specifically, if
compression is used in PyTables, the 'shuffle' filter is activated, and
this filter is *extremely expensive* when you have very large
chunkshapes (as is your case).  Try deactivating it with:

Filters(complevel=1, complib='zlib', shuffle=False)

an re-run your tests.  If you still notice that 'zlib' is too slow, you
may want to use the 'lzo' compressor (but before, you should install it
in your system so that PyTables can use it), which is far more faster,
specially during the compression phase (decompression is also faster
than 'zlib', but the difference is not so large, specially on modern
processors).

If speed is not good yet, then you should try to suppress compression
completely (Filters(complevel=0)), which is the default.

>
> As a baseline here is the output from bench-chunksize.py
> 596 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>python
> ~/test/bench-chunksize.py
> Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1
> ================================
> Chunkshape for original array: (360, 45)
> Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
> Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)
> ================================
> Chunkshape for row-wise chunkshape array: (1, 16200)
> Time to copy the original array: 1253.463 sec (13.4 MB/s)
> Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)
> ================================
> Speed-up with a row-wise chunkshape: 102.9

Yeah, seems good (the read speed in the case of the row-wise chunkshape
can be increased if you set a larger chunkshape, as I pointed out in a
previous message, but your system seems ok).

Cheers,

--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.




Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Douglas Eck
In reply to this post by Douglas Eck
Turning off compression helps immensely.  I will generate the full file with
recommended chunkshape and get back later with some details for the list.

Thanks!
Doug



Dr. Douglas Eck, Associate Professor
Universit? de Montr?al, Department of Computer Science / BRAMS
CP 6128, Succ. Centre-Ville Montr?al, Qu?bec H3C 3J7  CANADA
Office: 3253 Pavillion Andre-Aisenstadt
Phone: 1-514-343-6111 ext 3520  Fax: 1-514-343-5834
http://www.iro.umontreal.ca/~eckdoug
Research Areas: Machine Learning and Music Cognition


On Thu, Nov 6, 2008 at 11:52 AM, Douglas Eck <eckdoug at iro.umontreal.ca>wrote:

>
> Maybe you forgot to pass the 'expectedrows' parameter in order to inform
>> PyTables about the expected size of the EArray (this is critical for
>> allowing for a decent selection of the chunkshape).  See the
>> bench-chunksize.py and you will see that this is used.
>>
>
> I tried with expectedrows and got similar results.
>
>
>> Yeah, I ran into this too.  This is a bug that I solved yesterday:
>>
>> http://www.pytables.org/trac/ticket/195
>>
>
> Great!  I'll grab source from trunk.
>
> I don't see why this have to be slow.  Perhaps are you using
>> compression?  Also, which version of the HDF5 library are you using?
>>
>
> I am using compression. I thought that was a good thing (?)  Learning more
> and more....
>
>
> Again, I don't see why this would be slow.  Could you check whether you
>> can run my bench-chunksize.py at decent speeds in your system?
>>
>> 596 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>python
> ~/test/bench-chunksize.py
> Using tables from /u/eckdoug/test/tables/__init__.pyc version 2.1rc1
> ================================
> Chunkshape for original array: (360, 45)
> Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
> Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)
> ================================
> Chunkshape for row-wise chunkshape array: (1, 16200)
> Time to copy the original array: 1253.463 sec (13.4 MB/s)
> Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)
> ================================
> Speed-up with a row-wise chunkshape: 102.9
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/attachments/20081106/c50a187b/attachment.html>

Reply | Threaded
Open this post in threaded view
|

[hdf-forum] Transposing large array

Francesc Alted
A Thursday 06 November 2008, Douglas Eck escrigu?:
> Turning off compression helps immensely.

I don't think you need to turn off compression completely, but just the
shuffle filter.  In particular, lzo compressor should help you in
getting better performance yet (unless its performance degrades
severely with such a huge chunksizes as you are using).

> I will generate the full
> file with recommended chunkshape and get back later with some details
> for the list.

Great.

>
> Thanks!
> Doug
>
>
>
> Dr. Douglas Eck, Associate Professor
> Universit? de Montr?al, Department of Computer Science / BRAMS
> CP 6128, Succ. Centre-Ville Montr?al, Qu?bec H3C 3J7  CANADA
> Office: 3253 Pavillion Andre-Aisenstadt
> Phone: 1-514-343-6111 ext 3520  Fax: 1-514-343-5834
> http://www.iro.umontreal.ca/~eckdoug
> Research Areas: Machine Learning and Music Cognition
>
> On Thu, Nov 6, 2008 at 11:52 AM, Douglas Eck
<eckdoug at iro.umontreal.ca>wrote:

> > Maybe you forgot to pass the 'expectedrows' parameter in order to
> > inform
> >
> >> PyTables about the expected size of the EArray (this is critical
> >> for allowing for a decent selection of the chunkshape).  See the
> >> bench-chunksize.py and you will see that this is used.
> >
> > I tried with expectedrows and got similar results.
> >
> >> Yeah, I ran into this too.  This is a bug that I solved yesterday:
> >>
> >> http://www.pytables.org/trac/ticket/195
> >
> > Great!  I'll grab source from trunk.
> >
> > I don't see why this have to be slow.  Perhaps are you using
> >
> >> compression?  Also, which version of the HDF5 library are you
> >> using?
> >
> > I am using compression. I thought that was a good thing (?)
> > Learning more and more....
> >
> >
> > Again, I don't see why this would be slow.  Could you check whether
> > you
> >
> >> can run my bench-chunksize.py at decent speeds in your system?
> >>
> >> 596 eckdoug at cerveau /part/02/sans-bkp/sitm/data/features>python
> >
> > ~/test/bench-chunksize.py
> > Using tables from /u/eckdoug/test/tables/__init__.pyc version
> > 2.1rc1 ================================
> > Chunkshape for original array: (360, 45)
> > Time to append 6109666 rows: 323.5 sec (51.9 MB/s)
> > Time to read ten rows in original array: 1580.027 sec (0.3 MB/s)
> > ================================
> > Chunkshape for row-wise chunkshape array: (1, 16200)
> > Time to copy the original array: 1253.463 sec (13.4 MB/s)
> > Time to read with a row-wise chunkshape: 15.353 sec (30.4 MB/s)
> > ================================
> > Speed-up with a row-wise chunkshape: 102.9



--
Francesc Alted

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe at hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe at hdfgroup.org.