hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

Sjaardema, Gregory D

I am having failures running the hdf5-1.10.0-patch1 parallel tests testphdf5.  The t_mpi test passes with no issues.

 

Many of the failures occur in the call stack with PMPI_File_set_view being called by H5FDWrite.  I am using gcc-4.7.2 and openmpi-1.6.4 on a RHEL6 system.  I am also getting failures on OSX El Capitan with gcc-4.9.4 and openmpi.  

 

On RHEL6, the eidsetw2 is one of the tests failing.  The backtrace is:

 

[...] *** Process received signal ***

[...] Signal: Segmentation fault (11)

[...] Signal code: Address not mapped (1)

[...] Failing at address: (nil)

[...] [ 0] /lib64/libpthread.so.0() [0x3481a0f710]

[...] [ 1] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten+0x450) [0x7f65d968e5e0]

[...] [ 2] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten_datatype+0xc5) [0x7f65d9690495]

[...] [ 3] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIO_Set_view+0x1da) [0x7f65d96852ca]

[...] [ 4] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_set_view+0x172) [0x7f65d9695db2]

[...] [ 5] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/libmpi.so.1(MPI_File_set_view+0x107) [0x7f65e1ae3e77]

[...] [ 6] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x5f899f) [0x7f65e242599f]

[...] [ 7] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5FD_write+0x4e0) [0x7f65e203d232]

[...] [ 8] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F__accum_write+0x184a) [0x7f65e1ffbd84]

[...] [ 9] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F_block_write+0x40c) [0x7f65e20023dd]

[...] [10] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x118fcf) [0x7f65e1f45fcf]

[...] [11] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__chunk_allocate+0x1af6) [0x7f65e1f443f5]

[...] [12] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x152fd3) [0x7f65e1f7ffd3]

[...] [13] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__alloc_storage+0x665) [0x7f65e1f7f95f]

[...] [14] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__layout_oh_create+0x57a) [0x7f65e1f8dee0]

[...] [15] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x14bc99) [0x7f65e1f78c99]

[...] [16] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create+0x1162) [0x7f65e1f7a5e8]

[...] [17] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x165662) [0x7f65e1f92662]

[...] [18] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5O_obj_create+0x2ec) [0x7f65e21438e2]

[...] [19] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f1930) [0x7f65e211e930]

[...] [20] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x27eba0) [0x7f65e20abba0]

[...] [21] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5G_traverse+0x4ff) [0x7f65e20acd6b]

[...] [22] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f259a) [0x7f65e211f59a]

[...] [23] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5L_link_object+0x1d3) [0x7f65e211e6ae]

[...] [24] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create_named+0x3d1) [0x7f65e1f75f17]

[...] [25] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5Dcreate2+0x68f) [0x7f65e1f1e703]

[...] [26] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(extend_writeInd2+0x598) [0x416512]

[...] [27] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(PerformTests+0x1ab) [0x45addc]

[...] [28] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(main+0x94c) [0x408949]

[...] [29] /lib64/libc.so.6(__libc_start_main+0xfd) [0x348161ed5d]

[...] *** End of error message ***

 

I’m not really asking for anyone to debug this for me, just wondering if anyone else is having issues running the parallel tests with hdf5-1.10.0-patch1.

 

Thanks,

..Greg

 

-- 

"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

 


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [EXTERNAL] hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

Sjaardema, Gregory D

I think the issue is related to using an older openmpi (or maybe just using openmpi).  In hdf5-1.8.16, H5Dchunk.c, there is a comment about working around a bug for MPI_Type_create_hindexed_block().  The comment says that “should not have a special case for blocks == 0, but ompi (as of 1.8.1) has a bug in file_set_view when a zero size datatype is create with hindexed or hvector.” 

 

This fix is not in hdf5-1.10.0-patch1.  My cases are failing (with openmpi-1.6.4 and openmpi-1.8.1) on processors where blocks == 0 and they are failing with MPI_File_set_view in the backtrace. If I pull the workaround from 1.8.16 in H5Dchunk.c into 1.8.10-patch1, then the code makes it past this point (but then fails an assert at a later point in the test).

 

..Greg

 

-- 

"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

 

From: Hdf-forum <[hidden email]> on behalf of "Sjaardema, Gregory D" <[hidden email]>
Reply-To: HDF Users Discussion List <[hidden email]>
Date: Tuesday, October 25, 2016 at 1:20 PM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] [Hdf-forum] hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

 

I am having failures running the hdf5-1.10.0-patch1 parallel tests testphdf5.  The t_mpi test passes with no issues.

 

Many of the failures occur in the call stack with PMPI_File_set_view being called by H5FDWrite.  I am using gcc-4.7.2 and openmpi-1.6.4 on a RHEL6 system.  I am also getting failures on OSX El Capitan with gcc-4.9.4 and openmpi.  

 

On RHEL6, the eidsetw2 is one of the tests failing.  The backtrace is:

 

[...] *** Process received signal ***

[...] Signal: Segmentation fault (11)

[...] Signal code: Address not mapped (1)

[...] Failing at address: (nil)

[...] [ 0] /lib64/libpthread.so.0() [0x3481a0f710]

[...] [ 1] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten+0x450) [0x7f65d968e5e0]

[...] [ 2] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIOI_Flatten_datatype+0xc5) [0x7f65d9690495]

[...] [ 3] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(ADIO_Set_view+0x1da) [0x7f65d96852ca]

[...] [ 4] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/openmpi/mca_io_romio.so(mca_io_romio_dist_MPI_File_set_view+0x172) [0x7f65d9695db2]

[...] [ 5] ....openmpi/1.6.4-gcc-4.7.2-RHEL6/lib/libmpi.so.1(MPI_File_set_view+0x107) [0x7f65e1ae3e77]

[...] [ 6] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x5f899f) [0x7f65e242599f]

[...] [ 7] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5FD_write+0x4e0) [0x7f65e203d232]

[...] [ 8] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F__accum_write+0x184a) [0x7f65e1ffbd84]

[...] [ 9] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5F_block_write+0x40c) [0x7f65e20023dd]

[...] [10] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x118fcf) [0x7f65e1f45fcf]

[...] [11] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__chunk_allocate+0x1af6) [0x7f65e1f443f5]

[...] [12] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x152fd3) [0x7f65e1f7ffd3]

[...] [13] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__alloc_storage+0x665) [0x7f65e1f7f95f]

[...] [14] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__layout_oh_create+0x57a) [0x7f65e1f8dee0]

[...] [15] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x14bc99) [0x7f65e1f78c99]

[...] [16] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create+0x1162) [0x7f65e1f7a5e8]

[...] [17] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x165662) [0x7f65e1f92662]

[...] [18] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5O_obj_create+0x2ec) [0x7f65e21438e2]

[...] [19] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f1930) [0x7f65e211e930]

[...] [20] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x27eba0) [0x7f65e20abba0]

[...] [21] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5G_traverse+0x4ff) [0x7f65e20acd6b]

[...] [22] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(+0x2f259a) [0x7f65e211f59a]

[...] [23] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5L_link_object+0x1d3) [0x7f65e211e6ae]

[...] [24] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5D__create_named+0x3d1) [0x7f65e1f75f17]

[...] [25] hdf5-1.10.0-patch1/src/.libs/libhdf5.so.100(H5Dcreate2+0x68f) [0x7f65e1f1e703]

[...] [26] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(extend_writeInd2+0x598) [0x416512]

[...] [27] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(PerformTests+0x1ab) [0x45addc]

[...] [28] hdf5-1.10.0-patch1/testpar/.libs/lt-testphdf5(main+0x94c) [0x408949]

[...] [29] /lib64/libc.so.6(__libc_start_main+0xfd) [0x348161ed5d]

[...] *** End of error message ***

 

I’m not really asking for anyone to debug this for me, just wondering if anyone else is having issues running the parallel tests with hdf5-1.10.0-patch1.

 

Thanks,

..Greg

 

-- 

"A supercomputer is a device for turning compute-bound problems into I/O-bound problems”

 


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [EXTERNAL] hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

Rob Latham


On 10/25/2016 06:41 PM, Sjaardema, Gregory D wrote:

> I think the issue is related to using an older openmpi (or maybe just
> using openmpi).  In hdf5-1.8.16, H5Dchunk.c, there is a comment about
> working around a bug for MPI_Type_create_hindexed_block().  The comment
> says that “should not have a special case for blocks == 0, but ompi (as
> of 1.8.1) has a bug in file_set_view when a zero size datatype is create
> with hindexed or hvector.”
>
>
>
> This fix is not in hdf5-1.10.0-patch1.  My cases are failing (with
> openmpi-1.6.4 and openmpi-1.8.1) on processors where blocks == 0 and
> they are failing with MPI_File_set_view in the backtrace. If I pull the
> workaround from 1.8.16 in H5Dchunk.c into 1.8.10-patch1, then the code
> makes it past this point (but then fails an assert at a later point in
> the test).

Good hunch about ompi.  OpenMPI fixed this bug a couple years back.

==rob

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [EXTERNAL] hdf5-1.10.0-patch1 -- parallel tests failing in PMPI_File_set_view

Sjaardema, Gregory D
On 11/4/16, 2:38 PM, "Hdf-forum on behalf of Rob Latham" <[hidden email] on behalf of [hidden email]> wrote:

   
   
    On 10/25/2016 06:41 PM, Sjaardema, Gregory D wrote:
    > I think the issue is related to using an older openmpi (or maybe just
    > using openmpi).  In hdf5-1.8.16, H5Dchunk.c, there is a comment about
    > working around a bug for MPI_Type_create_hindexed_block().  The comment
    > says that “should not have a special case for blocks == 0, but ompi (as
    > of 1.8.1) has a bug in file_set_view when a zero size datatype is create
    > with hindexed or hvector.”
    >
    >
    >
    > This fix is not in hdf5-1.10.0-patch1.  My cases are failing (with
    > openmpi-1.6.4 and openmpi-1.8.1) on processors where blocks == 0 and
    > they are failing with MPI_File_set_view in the backtrace. If I pull the
    > workaround from 1.8.16 in H5Dchunk.c into 1.8.10-patch1, then the code
    > makes it past this point (but then fails an assert at a later point in
    > the test).
   
    Good hunch about ompi.  OpenMPI fixed this bug a couple years back.
   
    ==rob

Do you happen to know the version this was fixed in?  I can look, but if you know off-hand, it would save me some searching.
..Greg

   
    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [hidden email]
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5
   

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Loading...