Collective IO and filters

classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Collective IO and filters

Michael K. Edwards
I'm trying to write an HDF5 file with dataset compression from an MPI
job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
After running into the "Parallel I/O does not support filters yet"
error message in release versions of HDF5, I have turned to the
develop branch.  Clearly there has been much work towards collective
filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
it is not quite ready for prime time yet.  So far I've encountered a
livelock scenario with ZFP, reproduced it with SZIP, and, with no
filters at all, obtained this nifty error message:

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
`fm->m_ndims==fm->f_ndims' failed.

Has anyone on this list been able to write parallel HDF5 using a
recent state of the develop branch, with or without filters
configured?

Thanks,
- Michael

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Miller, Mark C.

Hi Michael,

 

I have not tried this in parallel yet. That said, what scale are you trying to do this at? 1000 ranks or 1,000,000 ranks? Something in between?

 

My understanding is that there are some known scaling issues out past maybe 10,000 ranks. Not heard of outright assertion failures there though.

 

Mark

 

 

"Hdf-forum on behalf of Michael K. Edwards" wrote:

 

I'm trying to write an HDF5 file with dataset compression from an MPI

job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)

After running into the "Parallel I/O does not support filters yet"

error message in release versions of HDF5, I have turned to the

develop branch.  Clearly there has been much work towards collective

filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly

it is not quite ready for prime time yet.  So far I've encountered a

livelock scenario with ZFP, reproduced it with SZIP, and, with no

filters at all, obtained this nifty error message:

 

ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion

`fm->m_ndims==fm->f_ndims' failed.

 

Has anyone on this list been able to write parallel HDF5 using a

recent state of the develop branch, with or without filters

configured?

 

Thanks,

- Michael

 

_______________________________________________

Hdf-forum is for HDF software users discussion.

 


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Closer to 1000 ranks initially.  There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t *fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+    else if(H5SL_count(fm->sel_chunks)==0) {
+        /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+    } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) < 0)

That makes the assert go away.  Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1  0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2  0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7  0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8  0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1  0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2  0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3  0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7  0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8  0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge.  I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack.  I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.


On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[hidden email]> wrote:

> Hi Michael,
>
>
>
> I have not tried this in parallel yet. That said, what scale are you trying
> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>
>
>
> My understanding is that there are some known scaling issues out past maybe
> 10,000 ranks. Not heard of outright assertion failures there though.
>
>
>
> Mark
>
>
>
>
>
> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>
>
>
> I'm trying to write an HDF5 file with dataset compression from an MPI
>
> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>
> After running into the "Parallel I/O does not support filters yet"
>
> error message in release versions of HDF5, I have turned to the
>
> develop branch.  Clearly there has been much work towards collective
>
> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>
> it is not quite ready for prime time yet.  So far I've encountered a
>
> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>
> filters at all, obtained this nifty error message:
>
>
>
> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>
> `fm->m_ndims==fm->f_ndims' failed.
>
>
>
> Has anyone on this list been able to write parallel HDF5 using a
>
> recent state of the develop branch, with or without filters
>
> configured?
>
>
>
> Thanks,
>
> - Michael
>
>
>
> _______________________________________________
>
> Hdf-forum is for HDF software users discussion.
>
> [hidden email]
>
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>
> Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson

Hi Michael,


during the design phase of this feature I tried to both account for and test the case where some of the writers do not have any data to contribute. However, it seems like your use case falls outside of what I have tested (perhaps I have not used enough ranks?). In particular my test cases were small and simply had some of the ranks call H5Sselect_none(), which doesn't seem to trigger this particular assertion failure. Is this how you're approaching these particular ranks in your code or is there a different way you are having them participate in the write operation?


As for the hanging issue, it looks as though rank 0 is waiting to receive some modification data from another rank for a particular chunk. Whether or not there is actually valid data that rank 0 should be waiting for, I cannot easily tell without being able to trace it through. As the other ranks have finished modifying their particular sets of chunks, they have moved on and are waiting for everyone to get together and broadcast their new chunk sizes so that free space in the file can be collectively re-allocated, but of course rank 0 is not proceeding forward. My best guess is that either:


  • The "num_writers" field for the chunk struct corresponding to the particular chunk that rank 0 is working on has been incorrectly set, causing rank 0 to think that there are more ranks writing to the chunk than the actual amount and consequently causing rank 0 to wait forever for a non-existent MPI message


or


  • The "new_owner" field of the chunk struct for this chunk was incorrectly set on the other ranks, causing them to never issue an MPI_Isend to rank 0, also causing rank 0 to wait for a non-existent MPI message

This feature should still be regarded as being in beta and its complexity can lead to difficult to track down bugs such as the ones you are currently encountering. That being said, your feedback is very useful and will help to push this feature towards a production-ready level of quality. Also, if it is feasible to come up with a minimal example that reproduces this issue, it would be very helpful and would make it much easier to diagnose why exactly these failures are occurring.

Thanks,
Jordan


From: Hdf-forum <[hidden email]> on behalf of Michael K. Edwards <[hidden email]>
Sent: Wednesday, November 8, 2017 11:23 AM
To: Miller, Mark C.
Cc: HDF Users Discussion List
Subject: Re: [Hdf-forum] Collective IO and filters
 
Closer to 1000 ranks initially.  There's a bug in handling the case
where some of the writers don't have any data to contribute (because
there's a dimension smaller than the number of ranks), which I have
worked around like this:

diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
index af6599a..9522478 100644
--- a/src/H5Dchunk.c
+++ b/src/H5Dchunk.c
@@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t *fm)
         /* Indicate that the chunk's memory space is shared */
         chunk_info->mspace_shared = TRUE;
     } /* end if */
+    else if(H5SL_count(fm->sel_chunks)==0) {
+        /* No chunks, because no local data; avoid
HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
+    } /* end else if */
     else {
         /* Get bounding box for file selection */
         if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) < 0)

That makes the assert go away.  Now I'm investigating a hang in the
chunk redistribution logic in rank 0, with a backtrace that looks like
this:

#0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1  0x00007f4bd5d3b341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2  0x00007f4bd5d3012d in MPID_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
local_chunk_array=0x17f0f80,
    local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
#5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
    at H5Dmpio.c:2794
#6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
dx_plist=0x16f7230) at H5Dmpio.c:1447
#7  0x00007f4bd81a027d in H5D__chunk_collective_io
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
H5Dmpio.c:933
#8  0x00007f4bd81a0968 in H5D__chunk_collective_write
(io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
H5Dmpio.c:1018
#9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
mem_type_id=216172782113783851, mem_space=0x17dc770,
file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
H5Dio.c:835
#10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
buf=0x17d6240)
    at H5Dio.c:394
#11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x17d6240) at H5Dio.c:318

The other ranks have moved past this and are hanging here:

#0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
#1  0x00007feb6fe25341 in psm_progress_wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#2  0x00007feb6fdd8975 in MPIC_Wait () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#3  0x00007feb6fdd918b in MPIC_Sendrecv () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#7  0x00007feb6fca1b93 in PMPI_Allreduce () from
/usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
#8  0x00007feb72287c2a in H5D__mpio_array_gatherv
(local_array=0x125f2d0, local_array_num_entries=0,
array_entry_size=368, _gathered_array=0x7ffff083f1d8,
    _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
H5Dmpio.c:479
#9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
dx_plist=0x11cf240) at H5Dmpio.c:1479
#10 0x00007feb7228a27d in H5D__chunk_collective_io
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
H5Dmpio.c:933
#11 0x00007feb7228a968 in H5D__chunk_collective_write
(io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
H5Dmpio.c:1018
#12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
mem_type_id=216172782113783851, mem_space=0x124b450,
file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
H5Dio.c:835
#13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
direct_write=false, mem_type_id=216172782113783851,
mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
buf=0x1244e80)
    at H5Dio.c:394
#14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
mem_type_id=216172782113783851, mem_space_id=288230376151711749,
file_space_id=288230376151711750, dxpl_id=720575940379279384,
    buf=0x1244e80) at H5Dio.c:318

(I'm currently running with this patch atop commit bf570b1, on an
earlier theory that the crashing bug may have crept in after Jordan's
big merge.  I'll rebase on current develop but I doubt that'll change
much.)

The hang may or may not be directly related to the workaround being a
bit of a hack.  I can set you up with full reproduction details if you
like; I seem to be getting some traction on it, but more eyeballs are
always good, especially if they're better set up for MPI tracing than
I am right now.


On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[hidden email]> wrote:
> Hi Michael,
>
>
>
> I have not tried this in parallel yet. That said, what scale are you trying
> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>
>
>
> My understanding is that there are some known scaling issues out past maybe
> 10,000 ranks. Not heard of outright assertion failures there though.
>
>
>
> Mark
>
>
>
>
>
> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>
>
>
> I'm trying to write an HDF5 file with dataset compression from an MPI
>
> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>
> After running into the "Parallel I/O does not support filters yet"
>
> error message in release versions of HDF5, I have turned to the
>
> develop branch.  Clearly there has been much work towards collective
>
> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>
> it is not quite ready for prime time yet.  So far I've encountered a
>
> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>
> filters at all, obtained this nifty error message:
>
>
>
> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>
> `fm->m_ndims==fm->f_ndims' failed.
>
>
>
> Has anyone on this list been able to write parallel HDF5 using a
>
> recent state of the develop branch, with or without filters
>
> configured?
>
>
>
> Thanks,
>
> - Michael
>
>
>
> _______________________________________________
>
> Hdf-forum is for HDF software users discussion.
>
> [hidden email]
>
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>
> Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Thanks, Jordan.  I recognize that this is very recent feature work and
my goal is to help push it forward.

My current use case is relatively straightforward, though there are a
couple of layers on top of HDF5 itself.  The problem can be reproduced
by building PETSc 3.8.1 against libraries built from the develop
branch of HDF5, adding in the H5Dset_filter() calls, and running an
example that exercises them.  (I'm using
src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
induce HDF5 writes.)  If you want, I can supply full details for you
to reproduce it locally, or I can do any experiments you'd like me to
within this setup.  (It also involves patches to the out-of-tree H5Z
plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
malloc/free, which in turn involves exposing H5MMprivate.h to the
plugins.  Is this something you've solved in a different way?)


On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson
<[hidden email]> wrote:

> Hi Michael,
>
>
> during the design phase of this feature I tried to both account for and test
> the case where some of the writers do not have any data to contribute.
> However, it seems like your use case falls outside of what I have tested
> (perhaps I have not used enough ranks?). In particular my test cases were
> small and simply had some of the ranks call H5Sselect_none(), which doesn't
> seem to trigger this particular assertion failure. Is this how you're
> approaching these particular ranks in your code or is there a different way
> you are having them participate in the write operation?
>
>
> As for the hanging issue, it looks as though rank 0 is waiting to receive
> some modification data from another rank for a particular chunk. Whether or
> not there is actually valid data that rank 0 should be waiting for, I cannot
> easily tell without being able to trace it through. As the other ranks have
> finished modifying their particular sets of chunks, they have moved on and
> are waiting for everyone to get together and broadcast their new chunk sizes
> so that free space in the file can be collectively re-allocated, but of
> course rank 0 is not proceeding forward. My best guess is that either:
>
>
> The "num_writers" field for the chunk struct corresponding to the particular
> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
> think that there are more ranks writing to the chunk than the actual amount
> and consequently causing rank 0 to wait forever for a non-existent MPI
> message
>
>
> or
>
>
> The "new_owner" field of the chunk struct for this chunk was incorrectly set
> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
> causing rank 0 to wait for a non-existent MPI message
>
>
> This feature should still be regarded as being in beta and its complexity
> can lead to difficult to track down bugs such as the ones you are currently
> encountering. That being said, your feedback is very useful and will help to
> push this feature towards a production-ready level of quality. Also, if it
> is feasible to come up with a minimal example that reproduces this issue, it
> would be very helpful and would make it much easier to diagnose why exactly
> these failures are occurring.
>
> Thanks,
> Jordan
>
> ________________________________
> From: Hdf-forum <[hidden email]> on behalf of Michael
> K. Edwards <[hidden email]>
> Sent: Wednesday, November 8, 2017 11:23 AM
> To: Miller, Mark C.
> Cc: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Collective IO and filters
>
> Closer to 1000 ranks initially.  There's a bug in handling the case
> where some of the writers don't have any data to contribute (because
> there's a dimension smaller than the number of ranks), which I have
> worked around like this:
>
> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
> index af6599a..9522478 100644
> --- a/src/H5Dchunk.c
> +++ b/src/H5Dchunk.c
> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
> *fm)
>          /* Indicate that the chunk's memory space is shared */
>          chunk_info->mspace_shared = TRUE;
>      } /* end if */
> +    else if(H5SL_count(fm->sel_chunks)==0) {
> +        /* No chunks, because no local data; avoid
> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
> +    } /* end else if */
>      else {
>          /* Get bounding box for file selection */
>          if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
> < 0)
>
> That makes the assert go away.  Now I'm investigating a hang in the
> chunk redistribution logic in rank 0, with a backtrace that looks like
> this:
>
> #0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
> #1  0x00007f4bd5d3b341 in psm_progress_wait () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #2  0x00007f4bd5d3012d in MPID_Mprobe () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
> local_chunk_array=0x17f0f80,
>     local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
> #5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
>     at H5Dmpio.c:2794
> #6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
> dx_plist=0x16f7230) at H5Dmpio.c:1447
> #7  0x00007f4bd81a027d in H5D__chunk_collective_io
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
> H5Dmpio.c:933
> #8  0x00007f4bd81a0968 in H5D__chunk_collective_write
> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
> H5Dmpio.c:1018
> #9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
> mem_type_id=216172782113783851, mem_space=0x17dc770,
> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
> H5Dio.c:835
> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
> direct_write=false, mem_type_id=216172782113783851,
> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
> buf=0x17d6240)
>     at H5Dio.c:394
> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>     buf=0x17d6240) at H5Dio.c:318
>
> The other ranks have moved past this and are hanging here:
>
> #0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
> #1  0x00007feb6fe25341 in psm_progress_wait () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #2  0x00007feb6fdd8975 in MPIC_Wait () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #3  0x00007feb6fdd918b in MPIC_Sendrecv () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #7  0x00007feb6fca1b93 in PMPI_Allreduce () from
> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
> #8  0x00007feb72287c2a in H5D__mpio_array_gatherv
> (local_array=0x125f2d0, local_array_num_entries=0,
> array_entry_size=368, _gathered_array=0x7ffff083f1d8,
>     _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
> H5Dmpio.c:479
> #9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
> dx_plist=0x11cf240) at H5Dmpio.c:1479
> #10 0x00007feb7228a27d in H5D__chunk_collective_io
> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
> H5Dmpio.c:933
> #11 0x00007feb7228a968 in H5D__chunk_collective_write
> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
> H5Dmpio.c:1018
> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
> mem_type_id=216172782113783851, mem_space=0x124b450,
> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
> H5Dio.c:835
> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
> direct_write=false, mem_type_id=216172782113783851,
> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
> buf=0x1244e80)
>     at H5Dio.c:394
> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>     buf=0x1244e80) at H5Dio.c:318
>
> (I'm currently running with this patch atop commit bf570b1, on an
> earlier theory that the crashing bug may have crept in after Jordan's
> big merge.  I'll rebase on current develop but I doubt that'll change
> much.)
>
> The hang may or may not be directly related to the workaround being a
> bit of a hack.  I can set you up with full reproduction details if you
> like; I seem to be getting some traction on it, but more eyeballs are
> always good, especially if they're better set up for MPI tracing than
> I am right now.
>
>
> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[hidden email]> wrote:
>> Hi Michael,
>>
>>
>>
>> I have not tried this in parallel yet. That said, what scale are you
>> trying
>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>>
>>
>>
>> My understanding is that there are some known scaling issues out past
>> maybe
>> 10,000 ranks. Not heard of outright assertion failures there though.
>>
>>
>>
>> Mark
>>
>>
>>
>>
>>
>> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>>
>>
>>
>> I'm trying to write an HDF5 file with dataset compression from an MPI
>>
>> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>>
>> After running into the "Parallel I/O does not support filters yet"
>>
>> error message in release versions of HDF5, I have turned to the
>>
>> develop branch.  Clearly there has been much work towards collective
>>
>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>>
>> it is not quite ready for prime time yet.  So far I've encountered a
>>
>> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>>
>> filters at all, obtained this nifty error message:
>>
>>
>>
>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>>
>> `fm->m_ndims==fm->f_ndims' failed.
>>
>>
>>
>> Has anyone on this list been able to write parallel HDF5 using a
>>
>> recent state of the develop branch, with or without filters
>>
>> configured?
>>
>>
>>
>> Thanks,
>>
>> - Michael
>>
>>
>>
>> _______________________________________________
>>
>> Hdf-forum is for HDF software users discussion.
>>
>> [hidden email]
>>
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>
>> Twitter: https://twitter.com/hdf5
>>
>>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
> The HDF Group (@hdf5) | Twitter
> twitter.com
> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
> services that make possible the management of large, complex data
> collections. Support ...
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
It's not even clear to me yet whether this is the same dataset that
triggered the assert.  Working on getting complete details.  But FWIW
the PETSc code does not call H5Sselect_none().  It calls
H5Sselect_hyperslab() in all ranks, and that's why the ranks in which
the slice is zero columns wide hit the "empty sel_chunks" pathway I
added to H5D__create_chunk_mem_map_hyper().


On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards
<[hidden email]> wrote:

> Thanks, Jordan.  I recognize that this is very recent feature work and
> my goal is to help push it forward.
>
> My current use case is relatively straightforward, though there are a
> couple of layers on top of HDF5 itself.  The problem can be reproduced
> by building PETSc 3.8.1 against libraries built from the develop
> branch of HDF5, adding in the H5Dset_filter() calls, and running an
> example that exercises them.  (I'm using
> src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
> induce HDF5 writes.)  If you want, I can supply full details for you
> to reproduce it locally, or I can do any experiments you'd like me to
> within this setup.  (It also involves patches to the out-of-tree H5Z
> plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
> malloc/free, which in turn involves exposing H5MMprivate.h to the
> plugins.  Is this something you've solved in a different way?)
>
>
> On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson
> <[hidden email]> wrote:
>> Hi Michael,
>>
>>
>> during the design phase of this feature I tried to both account for and test
>> the case where some of the writers do not have any data to contribute.
>> However, it seems like your use case falls outside of what I have tested
>> (perhaps I have not used enough ranks?). In particular my test cases were
>> small and simply had some of the ranks call H5Sselect_none(), which doesn't
>> seem to trigger this particular assertion failure. Is this how you're
>> approaching these particular ranks in your code or is there a different way
>> you are having them participate in the write operation?
>>
>>
>> As for the hanging issue, it looks as though rank 0 is waiting to receive
>> some modification data from another rank for a particular chunk. Whether or
>> not there is actually valid data that rank 0 should be waiting for, I cannot
>> easily tell without being able to trace it through. As the other ranks have
>> finished modifying their particular sets of chunks, they have moved on and
>> are waiting for everyone to get together and broadcast their new chunk sizes
>> so that free space in the file can be collectively re-allocated, but of
>> course rank 0 is not proceeding forward. My best guess is that either:
>>
>>
>> The "num_writers" field for the chunk struct corresponding to the particular
>> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
>> think that there are more ranks writing to the chunk than the actual amount
>> and consequently causing rank 0 to wait forever for a non-existent MPI
>> message
>>
>>
>> or
>>
>>
>> The "new_owner" field of the chunk struct for this chunk was incorrectly set
>> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
>> causing rank 0 to wait for a non-existent MPI message
>>
>>
>> This feature should still be regarded as being in beta and its complexity
>> can lead to difficult to track down bugs such as the ones you are currently
>> encountering. That being said, your feedback is very useful and will help to
>> push this feature towards a production-ready level of quality. Also, if it
>> is feasible to come up with a minimal example that reproduces this issue, it
>> would be very helpful and would make it much easier to diagnose why exactly
>> these failures are occurring.
>>
>> Thanks,
>> Jordan
>>
>> ________________________________
>> From: Hdf-forum <[hidden email]> on behalf of Michael
>> K. Edwards <[hidden email]>
>> Sent: Wednesday, November 8, 2017 11:23 AM
>> To: Miller, Mark C.
>> Cc: HDF Users Discussion List
>> Subject: Re: [Hdf-forum] Collective IO and filters
>>
>> Closer to 1000 ranks initially.  There's a bug in handling the case
>> where some of the writers don't have any data to contribute (because
>> there's a dimension smaller than the number of ranks), which I have
>> worked around like this:
>>
>> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
>> index af6599a..9522478 100644
>> --- a/src/H5Dchunk.c
>> +++ b/src/H5Dchunk.c
>> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
>> *fm)
>>          /* Indicate that the chunk's memory space is shared */
>>          chunk_info->mspace_shared = TRUE;
>>      } /* end if */
>> +    else if(H5SL_count(fm->sel_chunks)==0) {
>> +        /* No chunks, because no local data; avoid
>> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
>> +    } /* end else if */
>>      else {
>>          /* Get bounding box for file selection */
>>          if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
>> < 0)
>>
>> That makes the assert go away.  Now I'm investigating a hang in the
>> chunk redistribution logic in rank 0, with a backtrace that looks like
>> this:
>>
>> #0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>> #1  0x00007f4bd5d3b341 in psm_progress_wait () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #2  0x00007f4bd5d3012d in MPID_Mprobe () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>> local_chunk_array=0x17f0f80,
>>     local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
>> #5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
>>     at H5Dmpio.c:2794
>> #6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>> dx_plist=0x16f7230) at H5Dmpio.c:1447
>> #7  0x00007f4bd81a027d in H5D__chunk_collective_io
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
>> H5Dmpio.c:933
>> #8  0x00007f4bd81a0968 in H5D__chunk_collective_write
>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
>> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
>> H5Dmpio.c:1018
>> #9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
>> mem_type_id=216172782113783851, mem_space=0x17dc770,
>> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
>> H5Dio.c:835
>> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
>> direct_write=false, mem_type_id=216172782113783851,
>> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
>> buf=0x17d6240)
>>     at H5Dio.c:394
>> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>     buf=0x17d6240) at H5Dio.c:318
>>
>> The other ranks have moved past this and are hanging here:
>>
>> #0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>> #1  0x00007feb6fe25341 in psm_progress_wait () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #2  0x00007feb6fdd8975 in MPIC_Wait () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #3  0x00007feb6fdd918b in MPIC_Sendrecv () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #7  0x00007feb6fca1b93 in PMPI_Allreduce () from
>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>> #8  0x00007feb72287c2a in H5D__mpio_array_gatherv
>> (local_array=0x125f2d0, local_array_num_entries=0,
>> array_entry_size=368, _gathered_array=0x7ffff083f1d8,
>>     _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
>> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
>> H5Dmpio.c:479
>> #9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
>> dx_plist=0x11cf240) at H5Dmpio.c:1479
>> #10 0x00007feb7228a27d in H5D__chunk_collective_io
>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
>> H5Dmpio.c:933
>> #11 0x00007feb7228a968 in H5D__chunk_collective_write
>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
>> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
>> H5Dmpio.c:1018
>> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
>> mem_type_id=216172782113783851, mem_space=0x124b450,
>> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
>> H5Dio.c:835
>> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
>> direct_write=false, mem_type_id=216172782113783851,
>> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
>> buf=0x1244e80)
>>     at H5Dio.c:394
>> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>     buf=0x1244e80) at H5Dio.c:318
>>
>> (I'm currently running with this patch atop commit bf570b1, on an
>> earlier theory that the crashing bug may have crept in after Jordan's
>> big merge.  I'll rebase on current develop but I doubt that'll change
>> much.)
>>
>> The hang may or may not be directly related to the workaround being a
>> bit of a hack.  I can set you up with full reproduction details if you
>> like; I seem to be getting some traction on it, but more eyeballs are
>> always good, especially if they're better set up for MPI tracing than
>> I am right now.
>>
>>
>> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[hidden email]> wrote:
>>> Hi Michael,
>>>
>>>
>>>
>>> I have not tried this in parallel yet. That said, what scale are you
>>> trying
>>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>>>
>>>
>>>
>>> My understanding is that there are some known scaling issues out past
>>> maybe
>>> 10,000 ranks. Not heard of outright assertion failures there though.
>>>
>>>
>>>
>>> Mark
>>>
>>>
>>>
>>>
>>>
>>> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>>>
>>>
>>>
>>> I'm trying to write an HDF5 file with dataset compression from an MPI
>>>
>>> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>>>
>>> After running into the "Parallel I/O does not support filters yet"
>>>
>>> error message in release versions of HDF5, I have turned to the
>>>
>>> develop branch.  Clearly there has been much work towards collective
>>>
>>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>>>
>>> it is not quite ready for prime time yet.  So far I've encountered a
>>>
>>> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>>>
>>> filters at all, obtained this nifty error message:
>>>
>>>
>>>
>>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>>>
>>> `fm->m_ndims==fm->f_ndims' failed.
>>>
>>>
>>>
>>> Has anyone on this list been able to write parallel HDF5 using a
>>>
>>> recent state of the develop branch, with or without filters
>>>
>>> configured?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> - Michael
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Hdf-forum is for HDF software users discussion.
>>>
>>> [hidden email]
>>>
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>
>>> Twitter: https://twitter.com/hdf5
>>>
>>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>> The HDF Group (@hdf5) | Twitter
>> twitter.com
>> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
>> services that make possible the management of large, complex data
>> collections. Support ...
>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson
In reply to this post by Michael K. Edwards

For ease of development I currently use the in-tree filters in my tests so I haven't had to deal with the issue of H5MM_ vs raw memory routines inside the filters, though I don't suspect this should make a difference anyway.


I had suspected that the underlying code might be approaching the write in a different way and certainly this will need to be addressed. I am surprised however that this kind of behavior hasn't been seen before, as it is legacy code and should have still been hit in the library before my merge of the new code during parallel HDF5 operations which did not use filters; this is worth looking into.


I should be able to look into building HDF5 against PETSc with MVAPICH2, but if there are any "gotchas" I should be aware of beforehand, please let me know. Also, if you happen to run into any revelations on the behavior you're seeing, I'd also be happy to discuss them and see what arises in the way of a workable solution.



_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Michael K. Edwards
Also, I should add that the HDF5 files appear to be written properly
when run under "mpiexec -n 1", and valgrind doesn't report any bogus
malloc/free calls or wild pointers.  So I don't think it's a problem
with how I've massaged the H5Z plugins or the PETSc code.


On Wed, Nov 8, 2017 at 12:22 PM, Michael K. Edwards
<[hidden email]> wrote:

> It's not even clear to me yet whether this is the same dataset that
> triggered the assert.  Working on getting complete details.  But FWIW
> the PETSc code does not call H5Sselect_none().  It calls
> H5Sselect_hyperslab() in all ranks, and that's why the ranks in which
> the slice is zero columns wide hit the "empty sel_chunks" pathway I
> added to H5D__create_chunk_mem_map_hyper().
>
>
> On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards
> <[hidden email]> wrote:
>> Thanks, Jordan.  I recognize that this is very recent feature work and
>> my goal is to help push it forward.
>>
>> My current use case is relatively straightforward, though there are a
>> couple of layers on top of HDF5 itself.  The problem can be reproduced
>> by building PETSc 3.8.1 against libraries built from the develop
>> branch of HDF5, adding in the H5Dset_filter() calls, and running an
>> example that exercises them.  (I'm using
>> src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
>> induce HDF5 writes.)  If you want, I can supply full details for you
>> to reproduce it locally, or I can do any experiments you'd like me to
>> within this setup.  (It also involves patches to the out-of-tree H5Z
>> plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
>> malloc/free, which in turn involves exposing H5MMprivate.h to the
>> plugins.  Is this something you've solved in a different way?)
>>
>>
>> On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson
>> <[hidden email]> wrote:
>>> Hi Michael,
>>>
>>>
>>> during the design phase of this feature I tried to both account for and test
>>> the case where some of the writers do not have any data to contribute.
>>> However, it seems like your use case falls outside of what I have tested
>>> (perhaps I have not used enough ranks?). In particular my test cases were
>>> small and simply had some of the ranks call H5Sselect_none(), which doesn't
>>> seem to trigger this particular assertion failure. Is this how you're
>>> approaching these particular ranks in your code or is there a different way
>>> you are having them participate in the write operation?
>>>
>>>
>>> As for the hanging issue, it looks as though rank 0 is waiting to receive
>>> some modification data from another rank for a particular chunk. Whether or
>>> not there is actually valid data that rank 0 should be waiting for, I cannot
>>> easily tell without being able to trace it through. As the other ranks have
>>> finished modifying their particular sets of chunks, they have moved on and
>>> are waiting for everyone to get together and broadcast their new chunk sizes
>>> so that free space in the file can be collectively re-allocated, but of
>>> course rank 0 is not proceeding forward. My best guess is that either:
>>>
>>>
>>> The "num_writers" field for the chunk struct corresponding to the particular
>>> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
>>> think that there are more ranks writing to the chunk than the actual amount
>>> and consequently causing rank 0 to wait forever for a non-existent MPI
>>> message
>>>
>>>
>>> or
>>>
>>>
>>> The "new_owner" field of the chunk struct for this chunk was incorrectly set
>>> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
>>> causing rank 0 to wait for a non-existent MPI message
>>>
>>>
>>> This feature should still be regarded as being in beta and its complexity
>>> can lead to difficult to track down bugs such as the ones you are currently
>>> encountering. That being said, your feedback is very useful and will help to
>>> push this feature towards a production-ready level of quality. Also, if it
>>> is feasible to come up with a minimal example that reproduces this issue, it
>>> would be very helpful and would make it much easier to diagnose why exactly
>>> these failures are occurring.
>>>
>>> Thanks,
>>> Jordan
>>>
>>> ________________________________
>>> From: Hdf-forum <[hidden email]> on behalf of Michael
>>> K. Edwards <[hidden email]>
>>> Sent: Wednesday, November 8, 2017 11:23 AM
>>> To: Miller, Mark C.
>>> Cc: HDF Users Discussion List
>>> Subject: Re: [Hdf-forum] Collective IO and filters
>>>
>>> Closer to 1000 ranks initially.  There's a bug in handling the case
>>> where some of the writers don't have any data to contribute (because
>>> there's a dimension smaller than the number of ranks), which I have
>>> worked around like this:
>>>
>>> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
>>> index af6599a..9522478 100644
>>> --- a/src/H5Dchunk.c
>>> +++ b/src/H5Dchunk.c
>>> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
>>> *fm)
>>>          /* Indicate that the chunk's memory space is shared */
>>>          chunk_info->mspace_shared = TRUE;
>>>      } /* end if */
>>> +    else if(H5SL_count(fm->sel_chunks)==0) {
>>> +        /* No chunks, because no local data; avoid
>>> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
>>> +    } /* end else if */
>>>      else {
>>>          /* Get bounding box for file selection */
>>>          if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
>>> < 0)
>>>
>>> That makes the assert go away.  Now I'm investigating a hang in the
>>> chunk redistribution logic in rank 0, with a backtrace that looks like
>>> this:
>>>
>>> #0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>>> #1  0x00007f4bd5d3b341 in psm_progress_wait () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #2  0x00007f4bd5d3012d in MPID_Mprobe () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>>> local_chunk_array=0x17f0f80,
>>>     local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
>>> #5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>>> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
>>>     at H5Dmpio.c:2794
>>> #6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>>> dx_plist=0x16f7230) at H5Dmpio.c:1447
>>> #7  0x00007f4bd81a027d in H5D__chunk_collective_io
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
>>> H5Dmpio.c:933
>>> #8  0x00007f4bd81a0968 in H5D__chunk_collective_write
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
>>> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
>>> H5Dmpio.c:1018
>>> #9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
>>> mem_type_id=216172782113783851, mem_space=0x17dc770,
>>> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
>>> H5Dio.c:835
>>> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
>>> direct_write=false, mem_type_id=216172782113783851,
>>> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
>>> buf=0x17d6240)
>>>     at H5Dio.c:394
>>> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
>>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>>     buf=0x17d6240) at H5Dio.c:318
>>>
>>> The other ranks have moved past this and are hanging here:
>>>
>>> #0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>>> #1  0x00007feb6fe25341 in psm_progress_wait () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #2  0x00007feb6fdd8975 in MPIC_Wait () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #3  0x00007feb6fdd918b in MPIC_Sendrecv () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #7  0x00007feb6fca1b93 in PMPI_Allreduce () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #8  0x00007feb72287c2a in H5D__mpio_array_gatherv
>>> (local_array=0x125f2d0, local_array_num_entries=0,
>>> array_entry_size=368, _gathered_array=0x7ffff083f1d8,
>>>     _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
>>> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
>>> H5Dmpio.c:479
>>> #9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
>>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
>>> dx_plist=0x11cf240) at H5Dmpio.c:1479
>>> #10 0x00007feb7228a27d in H5D__chunk_collective_io
>>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
>>> H5Dmpio.c:933
>>> #11 0x00007feb7228a968 in H5D__chunk_collective_write
>>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
>>> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
>>> H5Dmpio.c:1018
>>> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
>>> mem_type_id=216172782113783851, mem_space=0x124b450,
>>> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
>>> H5Dio.c:835
>>> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
>>> direct_write=false, mem_type_id=216172782113783851,
>>> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
>>> buf=0x1244e80)
>>>     at H5Dio.c:394
>>> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
>>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>>     buf=0x1244e80) at H5Dio.c:318
>>>
>>> (I'm currently running with this patch atop commit bf570b1, on an
>>> earlier theory that the crashing bug may have crept in after Jordan's
>>> big merge.  I'll rebase on current develop but I doubt that'll change
>>> much.)
>>>
>>> The hang may or may not be directly related to the workaround being a
>>> bit of a hack.  I can set you up with full reproduction details if you
>>> like; I seem to be getting some traction on it, but more eyeballs are
>>> always good, especially if they're better set up for MPI tracing than
>>> I am right now.
>>>
>>>
>>> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <[hidden email]> wrote:
>>>> Hi Michael,
>>>>
>>>>
>>>>
>>>> I have not tried this in parallel yet. That said, what scale are you
>>>> trying
>>>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>>>>
>>>>
>>>>
>>>> My understanding is that there are some known scaling issues out past
>>>> maybe
>>>> 10,000 ranks. Not heard of outright assertion failures there though.
>>>>
>>>>
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>>>>
>>>>
>>>>
>>>> I'm trying to write an HDF5 file with dataset compression from an MPI
>>>>
>>>> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>>>>
>>>> After running into the "Parallel I/O does not support filters yet"
>>>>
>>>> error message in release versions of HDF5, I have turned to the
>>>>
>>>> develop branch.  Clearly there has been much work towards collective
>>>>
>>>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>>>>
>>>> it is not quite ready for prime time yet.  So far I've encountered a
>>>>
>>>> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>>>>
>>>> filters at all, obtained this nifty error message:
>>>>
>>>>
>>>>
>>>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>>>>
>>>> `fm->m_ndims==fm->f_ndims' failed.
>>>>
>>>>
>>>>
>>>> Has anyone on this list been able to write parallel HDF5 using a
>>>>
>>>> recent state of the develop branch, with or without filters
>>>>
>>>> configured?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> - Michael
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>>
>>>> Hdf-forum is for HDF software users discussion.
>>>>
>>>> [hidden email]
>>>>
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>
>>>> Twitter: https://twitter.com/hdf5
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>> The HDF Group (@hdf5) | Twitter
>>> twitter.com
>>> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
>>> services that make possible the management of large, complex data
>>> collections. Support ...
>>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Jordan Henderson
The raw malloc/free calls inside the out-of-tree filters definitely
broke with the develop branch.  The buffer pointer allocated by the
caller using H5MM_malloc(), and passed into H5Z_filter_zfp() with the
expectation that it will be replaced with a newly allocated buffer of
a different size, cannot be manipulated with raw free/malloc.

On Wed, Nov 8, 2017 at 12:32 PM, Jordan Henderson
<[hidden email]> wrote:

> For ease of development I currently use the in-tree filters in my tests so I
> haven't had to deal with the issue of H5MM_ vs raw memory routines inside
> the filters, though I don't suspect this should make a difference anyway.
>
>
> I had suspected that the underlying code might be approaching the write in a
> different way and certainly this will need to be addressed. I am surprised
> however that this kind of behavior hasn't been seen before, as it is legacy
> code and should have still been hit in the library before my merge of the
> new code during parallel HDF5 operations which did not use filters; this is
> worth looking into.
>
>
> I should be able to look into building HDF5 against PETSc with MVAPICH2, but
> if there are any "gotchas" I should be aware of beforehand, please let me
> know. Also, if you happen to run into any revelations on the behavior you're
> seeing, I'd also be happy to discuss them and see what arises in the way of
> a workable solution.
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In case it helps, here's an example of a patch to an out-of-tree
compressor plugin.  It's not the right solution, because H5MMprivate.h
(and its dependencies) ought to stay private.  Presumably plugins will
either need an isolated header with these two functions in it, or a
variant API that passes in a pair of function pointers.

On Wed, Nov 8, 2017 at 12:38 PM, Michael K. Edwards
<[hidden email]> wrote:

> The raw malloc/free calls inside the out-of-tree filters definitely
> broke with the develop branch.  The buffer pointer allocated by the
> caller using H5MM_malloc(), and passed into H5Z_filter_zfp() with the
> expectation that it will be replaced with a newly allocated buffer of
> a different size, cannot be manipulated with raw free/malloc.
>
> On Wed, Nov 8, 2017 at 12:32 PM, Jordan Henderson
> <[hidden email]> wrote:
>> For ease of development I currently use the in-tree filters in my tests so I
>> haven't had to deal with the issue of H5MM_ vs raw memory routines inside
>> the filters, though I don't suspect this should make a difference anyway.
>>
>>
>> I had suspected that the underlying code might be approaching the write in a
>> different way and certainly this will need to be addressed. I am surprised
>> however that this kind of behavior hasn't been seen before, as it is legacy
>> code and should have still been hit in the library before my merge of the
>> new code during parallel HDF5 operations which did not use filters; this is
>> worth looking into.
>>
>>
>> I should be able to look into building HDF5 against PETSc with MVAPICH2, but
>> if there are any "gotchas" I should be aware of beforehand, please let me
>> know. Also, if you happen to run into any revelations on the behavior you're
>> seeing, I'd also be happy to discuss them and see what arises in the way of
>> a workable solution.
>>
>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

H5Z-zfp.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson

Ah yes, I can see what you mean by the difference between the use of these causing issues between in-tree and out-of-tree plugins. This is particularly interesting in that it makes sense to allocate the chunk data buffers using the H5MM_ routines to be compliant with the standards of HDF5 library development, but causes issues with those plugins which use the raw memory routines. Conversely, if the chunk buffers were to be allocated using the raw routines, it would break compatibility with the in-tree filters. Thank you for bringing this to my attention; I believe I will need to think on this one, as there are a few different ways of approaching the problem, with some being more "correct" than others.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Dana Robinson

The public H5allocate/resize/free_memory() API calls use the library's memory allocator to manage memory, if that is what you are looking for.

 

https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html

 

Dana Robinson

Software Developer

The HDF Group

 

From: Hdf-forum <[hidden email]> on behalf of Jordan Henderson <[hidden email]>
Reply-To: HDF List <[hidden email]>
Date: Wednesday, November 8, 2017 at 12:59
To: "[hidden email]" <[hidden email]>
Cc: HDF List <[hidden email]>
Subject: Re: [Hdf-forum] Collective IO and filters

 

Ah yes, I can see what you mean by the difference between the use of these causing issues between in-tree and out-of-tree plugins. This is particularly interesting in that it makes sense to allocate the chunk data buffers using the H5MM_ routines to be compliant with the standards of HDF5 library development, but causes issues with those plugins which use the raw memory routines. Conversely, if the chunk buffers were to be allocated using the raw routines, it would break compatibility with the in-tree filters. Thank you for bringing this to my attention; I believe I will need to think on this one, as there are a few different ways of approaching the problem, with some being more "correct" than others.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Jordan Henderson
I'm reasonably confident now that this hang is unrelated to the
"writers contributing zero data" workaround.  The three ranks that
have made it to H5Dmpio.c:1479 all have nonzero nelmts in the call to
H5D__chunk_collective_write() up the stack.  (And I did check that
they're all still trying to write to the same dataset.)

Here's what I see in rank 0:

(gdb) p *chunk_entry
$5 = {index = 0, scaled = {0, 0, 0, 18446744073709551615 <repeats 30
times>}, full_overwrite = false, num_writers = 4, io_size = 832, buf =
0x0, chunk_states = {chunk_current = {offset = 4720,
      length = 6}, new_chunk = {offset = 4720, length = 6}}, owners =
{original_owner = 0, new_owner = 0}, async_info =
{receive_requests_array = 0x30c2870, receive_buffer_array = 0x30c2f20,
    num_receive_requests = 3}}

And here's what I see in rank 3:

(gdb) p *chunk_list
$3 = {index = 0, scaled = {0 <repeats 33 times>}, full_overwrite =
false, num_writers = 4, io_size = 592, buf = 0x0, chunk_states =
{chunk_current = {offset = 4720, length = 6}, new_chunk = {
      offset = 4720, length = 6}}, owners = {original_owner = 3,
new_owner = 0}, async_info = {receive_requests_array = 0x0,
receive_buffer_array = 0x0, num_receive_requests = 0}}

The loop index "j" in the receive loop in rank 0 is still 0, which
suggests that it has not received any messages from the other ranks.
The breakage could certainly be down in the MPI implementation.  I am
running Intel's build of MVAPICH2 2.2 (as bundled with their current
Omni-Path release blob), and it visibly has performance "issues" in my
dev environment.  It's not out of the realm of the plausible that it's
not delivering these messages.  It's just odd that it manages to slog
through in the unfiltered case and not in this filtered case.



On Wed, Nov 8, 2017 at 12:57 PM, Jordan Henderson
<[hidden email]> wrote:

> Ah yes, I can see what you mean by the difference between the use of these
> causing issues between in-tree and out-of-tree plugins. This is particularly
> interesting in that it makes sense to allocate the chunk data buffers using
> the H5MM_ routines to be compliant with the standards of HDF5 library
> development, but causes issues with those plugins which use the raw memory
> routines. Conversely, if the chunk buffers were to be allocated using the
> raw routines, it would break compatibility with the in-tree filters. Thank
> you for bringing this to my attention; I believe I will need to think on
> this one, as there are a few different ways of approaching the problem, with
> some being more "correct" than others.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Dana Robinson
Thank you, Dana!  Do you think it would be appropriate (not just as of
the current implementation, but in terms of the interface contract) to
use H5free_memory() on the buffer passed into an H5Z plugin, replacing
it with a new (post-compression) buffer allocated via H5allocate()?

On Wed, Nov 8, 2017 at 1:23 PM, Dana Robinson <[hidden email]> wrote:

> The public H5allocate/resize/free_memory() API calls use the library's
> memory allocator to manage memory, if that is what you are looking for.
>
>
>
> https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html
>
>
>
> Dana Robinson
>
> Software Developer
>
> The HDF Group
>
>
>
> From: Hdf-forum <[hidden email]> on behalf of Jordan
> Henderson <[hidden email]>
> Reply-To: HDF List <[hidden email]>
> Date: Wednesday, November 8, 2017 at 12:59
> To: "[hidden email]" <[hidden email]>
> Cc: HDF List <[hidden email]>
> Subject: Re: [Hdf-forum] Collective IO and filters
>
>
>
> Ah yes, I can see what you mean by the difference between the use of these
> causing issues between in-tree and out-of-tree plugins. This is particularly
> interesting in that it makes sense to allocate the chunk data buffers using
> the H5MM_ routines to be compliant with the standards of HDF5 library
> development, but causes issues with those plugins which use the raw memory
> routines. Conversely, if the chunk buffers were to be allocated using the
> raw routines, it would break compatibility with the in-tree filters. Thank
> you for bringing this to my attention; I believe I will need to think on
> this one, as there are a few different ways of approaching the problem, with
> some being more "correct" than others.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Dana Robinson
Yes. We already do this in our test harness. See test/dynlib3.c in the source distribution. It's a very short source file and should be easy to understand.

Dana

On 11/8/17, 13:28, "Michael K. Edwards" <[hidden email]> wrote:

    Thank you, Dana!  Do you think it would be appropriate (not just as of
    the current implementation, but in terms of the interface contract) to
    use H5free_memory() on the buffer passed into an H5Z plugin, replacing
    it with a new (post-compression) buffer allocated via H5allocate()?
   
    On Wed, Nov 8, 2017 at 1:23 PM, Dana Robinson <[hidden email]> wrote:
    > The public H5allocate/resize/free_memory() API calls use the library's
    > memory allocator to manage memory, if that is what you are looking for.
    >
    >
    >
    > https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html
    >
    >
    >
    > Dana Robinson
    >
    > Software Developer
    >
    > The HDF Group
    >
    >
    >
    > From: Hdf-forum <[hidden email]> on behalf of Jordan
    > Henderson <[hidden email]>
    > Reply-To: HDF List <[hidden email]>
    > Date: Wednesday, November 8, 2017 at 12:59
    > To: "[hidden email]" <[hidden email]>
    > Cc: HDF List <[hidden email]>
    > Subject: Re: [Hdf-forum] Collective IO and filters
    >
    >
    >
    > Ah yes, I can see what you mean by the difference between the use of these
    > causing issues between in-tree and out-of-tree plugins. This is particularly
    > interesting in that it makes sense to allocate the chunk data buffers using
    > the H5MM_ routines to be compliant with the standards of HDF5 library
    > development, but causes issues with those plugins which use the raw memory
    > routines. Conversely, if the chunk buffers were to be allocated using the
    > raw routines, it would break compatibility with the in-tree filters. Thank
    > you for bringing this to my attention; I believe I will need to think on
    > this one, as there are a few different ways of approaching the problem, with
    > some being more "correct" than others.
   

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Great.  What's the best way to communicate this to plugin developers,
so that their code gets updated appropriately in advance of the 1.12
release?

On Wed, Nov 8, 2017 at 1:41 PM, Dana Robinson <[hidden email]> wrote:

> Yes. We already do this in our test harness. See test/dynlib3.c in the source distribution. It's a very short source file and should be easy to understand.
>
> Dana
>
> On 11/8/17, 13:28, "Michael K. Edwards" <[hidden email]> wrote:
>
>     Thank you, Dana!  Do you think it would be appropriate (not just as of
>     the current implementation, but in terms of the interface contract) to
>     use H5free_memory() on the buffer passed into an H5Z plugin, replacing
>     it with a new (post-compression) buffer allocated via H5allocate()?
>
>     On Wed, Nov 8, 2017 at 1:23 PM, Dana Robinson <[hidden email]> wrote:
>     > The public H5allocate/resize/free_memory() API calls use the library's
>     > memory allocator to manage memory, if that is what you are looking for.
>     >
>     >
>     >
>     > https://support.hdfgroup.org/HDF5/doc/RM/RM_H5.html
>     >
>     >
>     >
>     > Dana Robinson
>     >
>     > Software Developer
>     >
>     > The HDF Group
>     >
>     >
>     >
>     > From: Hdf-forum <[hidden email]> on behalf of Jordan
>     > Henderson <[hidden email]>
>     > Reply-To: HDF List <[hidden email]>
>     > Date: Wednesday, November 8, 2017 at 12:59
>     > To: "[hidden email]" <[hidden email]>
>     > Cc: HDF List <[hidden email]>
>     > Subject: Re: [Hdf-forum] Collective IO and filters
>     >
>     >
>     >
>     > Ah yes, I can see what you mean by the difference between the use of these
>     > causing issues between in-tree and out-of-tree plugins. This is particularly
>     > interesting in that it makes sense to allocate the chunk data buffers using
>     > the H5MM_ routines to be compliant with the standards of HDF5 library
>     > development, but causes issues with those plugins which use the raw memory
>     > routines. Conversely, if the chunk buffers were to be allocated using the
>     > raw routines, it would break compatibility with the in-tree filters. Thank
>     > you for bringing this to my attention; I believe I will need to think on
>     > this one, as there are a few different ways of approaching the problem, with
>     > some being more "correct" than others.
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson
In reply to this post by Michael K. Edwards

Dana,


would it then make sense for all outside filters to use these routines? Due to Parallel Compression's internal nature, it uses buffers allocated via H5MM_ routines to collect and scatter data, which works fine for the internal filters like deflate, since they use these as well. However, since some of the outside filters use the raw malloc/free routines, causing issues, I'm wondering if having all outside filters use the H5_ routines is the cleanest solution..


Michael,


Based on the "num_writers: 4" field, the NULL "receive_requests_array" and the fact that for the same chunk, rank 0 shows "original owner: 0, new owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as though everyone IS interested in the chunk the rank 0 is now working on, but now I'm more confident that at some point either the messages may have failed to send or rank 0 is having problems finding the messages.


Since in the unfiltered case it won't hit this particular code path, I'm not surprised that that case succeeds. If I had to make another guess based on this, I would be inclined to think that rank 0 must be hanging on the MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the chunk as the tag for the message in order to funnel specific messages to the correct rank for the correct chunk during the last part of the chunk redistribution and if rank 0 can't match the tag it of course won't find the message. Why this might be happening, I'm not entirely certain currently.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Dana Robinson

Yes. All outside code that frees, allocates, or reallocates memory created inside the library (or that will be passed back into the library, where it could be freed or reallocated) should use these functions. This includes filters.

 

Dana

 

From: Jordan Henderson <[hidden email]>
Date: Wednesday, November 8, 2017 at 13:46
To: Dana Robinson <[hidden email]>, "[hidden email]" <[hidden email]>, HDF List <[hidden email]>
Subject: Re: [Hdf-forum] Collective IO and filters

 

Dana,

 

would it then make sense for all outside filters to use these routines? Due to Parallel Compression's internal nature, it uses buffers allocated via H5MM_ routines to collect and scatter data, which works fine for the internal filters like deflate, since they use these as well. However, since some of the outside filters use the raw malloc/free routines, causing issues, I'm wondering if having all outside filters use the H5_ routines is the cleanest solution..

 

Michael,

 

Based on the "num_writers: 4" field, the NULL "receive_requests_array" and the fact that for the same chunk, rank 0 shows "original owner: 0, new owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as though everyone IS interested in the chunk the rank 0 is now working on, but now I'm more confident that at some point either the messages may have failed to send or rank 0 is having problems finding the messages.

 

Since in the unfiltered case it won't hit this particular code path, I'm not surprised that that case succeeds. If I had to make another guess based on this, I would be inclined to think that rank 0 must be hanging on the MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the chunk as the tag for the message in order to funnel specific messages to the correct rank for the correct chunk during the last part of the chunk redistribution and if rank 0 can't match the tag it of course won't find the message. Why this might be happening, I'm not entirely certain currently.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Dana Robinson

A bit of historical note:

 

The H5*_memory() API calls were added primarily to help Windows users. In Windows, the C run-time is implemented in shared libraries tied to a particular version of Visual Studio. Even for a given version of Visual Studio, there are independent debug and release libraries. This caused problems when people allocated memory in, say, a release version of the HDF5 library and freed it in their debug version application since the different C runtimes have different memory allocator state that is not shared. Users on other systems care less about this problem since there is rarely a plethora of C libraries to link to (though it can be a problem when people use debug memory allocators).

 

H5free_memory() was initially introduced because a few of our API calls return buffers that the user must free. The allocate/reallocate calls came later, for use in filters on Windows. It's interesting that those functions will now be needed for parallel compression.

 

Dana

 

From: Hdf-forum <[hidden email]> on behalf of Dana Robinson <[hidden email]>
Reply-To: HDF List <[hidden email]>
Date: Wednesday, November 8, 2017 at 13:52
To: Jordan Henderson <[hidden email]>, "[hidden email]" <[hidden email]>, HDF List <[hidden email]>
Subject: Re: [Hdf-forum] Collective IO and filters

 

Yes. All outside code that frees, allocates, or reallocates memory created inside the library (or that will be passed back into the library, where it could be freed or reallocated) should use these functions. This includes filters.

 

Dana

 

From: Jordan Henderson <[hidden email]>
Date: Wednesday, November 8, 2017 at 13:46
To: Dana Robinson <[hidden email]>, "[hidden email]" <[hidden email]>, HDF List <[hidden email]>
Subject: Re: [Hdf-forum] Collective IO and filters

 

Dana,

 

would it then make sense for all outside filters to use these routines? Due to Parallel Compression's internal nature, it uses buffers allocated via H5MM_ routines to collect and scatter data, which works fine for the internal filters like deflate, since they use these as well. However, since some of the outside filters use the raw malloc/free routines, causing issues, I'm wondering if having all outside filters use the H5_ routines is the cleanest solution..

 

Michael,

 

Based on the "num_writers: 4" field, the NULL "receive_requests_array" and the fact that for the same chunk, rank 0 shows "original owner: 0, new owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as though everyone IS interested in the chunk the rank 0 is now working on, but now I'm more confident that at some point either the messages may have failed to send or rank 0 is having problems finding the messages.

 

Since in the unfiltered case it won't hit this particular code path, I'm not surprised that that case succeeds. If I had to make another guess based on this, I would be inclined to think that rank 0 must be hanging on the MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the chunk as the tag for the message in order to funnel specific messages to the correct rank for the correct chunk during the last part of the chunk redistribution and if rank 0 can't match the tag it of course won't find the message. Why this might be happening, I'm not entirely certain currently.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Dana Robinson
I see that you're re-sorting by owner using a comparator called
H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
by a secondary key within items with equal owners.  That, together
with a sort that isn't stable (which HDqsort() probably isn't on most
platforms; quicksort/introsort is not stable), will scramble the order
in which different ranks traverse their local chunk arrays.  That will
cause deadly embraces between ranks that are waiting for each other's
chunks to be sent.  To fix that, it's probably sufficient to use the
chunk offset as a secondary sort key in that comparator.

That's not the root cause of the hang I'm currently experiencing,
though.  Still digging into that.


On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <[hidden email]> wrote:

> Yes. All outside code that frees, allocates, or reallocates memory created
> inside the library (or that will be passed back into the library, where it
> could be freed or reallocated) should use these functions. This includes
> filters.
>
>
>
> Dana
>
>
>
> From: Jordan Henderson <[hidden email]>
> Date: Wednesday, November 8, 2017 at 13:46
> To: Dana Robinson <[hidden email]>, "[hidden email]"
> <[hidden email]>, HDF List <[hidden email]>
> Subject: Re: [Hdf-forum] Collective IO and filters
>
>
>
> Dana,
>
>
>
> would it then make sense for all outside filters to use these routines? Due
> to Parallel Compression's internal nature, it uses buffers allocated via
> H5MM_ routines to collect and scatter data, which works fine for the
> internal filters like deflate, since they use these as well. However, since
> some of the outside filters use the raw malloc/free routines, causing
> issues, I'm wondering if having all outside filters use the H5_ routines is
> the cleanest solution..
>
>
>
> Michael,
>
>
>
> Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
> the fact that for the same chunk, rank 0 shows "original owner: 0, new
> owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
> though everyone IS interested in the chunk the rank 0 is now working on, but
> now I'm more confident that at some point either the messages may have
> failed to send or rank 0 is having problems finding the messages.
>
>
>
> Since in the unfiltered case it won't hit this particular code path, I'm not
> surprised that that case succeeds. If I had to make another guess based on
> this, I would be inclined to think that rank 0 must be hanging on the
> MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
> chunk as the tag for the message in order to funnel specific messages to the
> correct rank for the correct chunk during the last part of the chunk
> redistribution and if rank 0 can't match the tag it of course won't find the
> message. Why this might be happening, I'm not entirely certain currently.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
123