Collective IO and filters

classic Classic list List threaded Threaded
51 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Oddly enough, it is not the tag that is mismatched between receiver
and senders; it is io_info->comm.  Something is decidedly out of whack
here.

Rank 0, owner 0 probing with tag 0 on comm -1006632942
Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0


On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards
<[hidden email]> wrote:

>
> I see that you're re-sorting by owner using a comparator called
> H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
> by a secondary key within items with equal owners.  That, together
> with a sort that isn't stable (which HDqsort() probably isn't on most
> platforms; quicksort/introsort is not stable), will scramble the order
> in which different ranks traverse their local chunk arrays.  That will
> cause deadly embraces between ranks that are waiting for each other's
> chunks to be sent.  To fix that, it's probably sufficient to use the
> chunk offset as a secondary sort key in that comparator.
>
> That's not the root cause of the hang I'm currently experiencing,
> though.  Still digging into that.
>
>
> On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <[hidden email]> wrote:
> > Yes. All outside code that frees, allocates, or reallocates memory created
> > inside the library (or that will be passed back into the library, where it
> > could be freed or reallocated) should use these functions. This includes
> > filters.
> >
> >
> >
> > Dana
> >
> >
> >
> > From: Jordan Henderson <[hidden email]>
> > Date: Wednesday, November 8, 2017 at 13:46
> > To: Dana Robinson <[hidden email]>, "[hidden email]"
> > <[hidden email]>, HDF List <[hidden email]>
> > Subject: Re: [Hdf-forum] Collective IO and filters
> >
> >
> >
> > Dana,
> >
> >
> >
> > would it then make sense for all outside filters to use these routines? Due
> > to Parallel Compression's internal nature, it uses buffers allocated via
> > H5MM_ routines to collect and scatter data, which works fine for the
> > internal filters like deflate, since they use these as well. However, since
> > some of the outside filters use the raw malloc/free routines, causing
> > issues, I'm wondering if having all outside filters use the H5_ routines is
> > the cleanest solution..
> >
> >
> >
> > Michael,
> >
> >
> >
> > Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
> > the fact that for the same chunk, rank 0 shows "original owner: 0, new
> > owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
> > though everyone IS interested in the chunk the rank 0 is now working on, but
> > now I'm more confident that at some point either the messages may have
> > failed to send or rank 0 is having problems finding the messages.
> >
> >
> >
> > Since in the unfiltered case it won't hit this particular code path, I'm not
> > surprised that that case succeeds. If I had to make another guess based on
> > this, I would be inclined to think that rank 0 must be hanging on the
> > MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
> > chunk as the tag for the message in order to funnel specific messages to the
> > correct rank for the correct chunk during the last part of the chunk
> > redistribution and if rank 0 can't match the tag it of course won't find the
> > message. Why this might be happening, I'm not entirely certain currently.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Replacing Intel's build of MVAPICH2 2.2 with a fresh build of MVAPICH2
2.3b got me farther along.  The comm mismatch does not seem to be a
problem.  I am guessing that the root cause was whatever bug is listed
in http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2_CHANGELOG-2.3b.txt
as:

    - Fix hang in MPI_Probe
        - Thanks to John Westlund@Intel for the report

I fixed the H5D__cmp_filtered_collective_io_info_entry_owner
comparator, and now I'm back to fixing things about my patch to PETSc.
I seem to be trying to filter a dataset that I shouldn't be.

HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 831 in H5D__write(): unable to adjust I/O info
for parallel I/O
    major: Dataset
    minor: Unable to initialize object
  #003: H5Dio.c line 1264 in H5D__ioinfo_adjust(): Can't perform
independent write with filters in pipeline.
    The following caused a break from collective I/O:
        Local causes:
        Global causes: one of the dataspaces was neither simple nor scalar
    major: Low-level I/O
    minor: Can't perform independent IO


On Wed, Nov 8, 2017 at 11:37 PM, Michael K. Edwards
<[hidden email]> wrote:

> Oddly enough, it is not the tag that is mismatched between receiver
> and senders; it is io_info->comm.  Something is decidedly out of whack
> here.
>
> Rank 0, owner 0 probing with tag 0 on comm -1006632942
> Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
> Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
> Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0
>
>
> On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards
> <[hidden email]> wrote:
>>
>> I see that you're re-sorting by owner using a comparator called
>> H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
>> by a secondary key within items with equal owners.  That, together
>> with a sort that isn't stable (which HDqsort() probably isn't on most
>> platforms; quicksort/introsort is not stable), will scramble the order
>> in which different ranks traverse their local chunk arrays.  That will
>> cause deadly embraces between ranks that are waiting for each other's
>> chunks to be sent.  To fix that, it's probably sufficient to use the
>> chunk offset as a secondary sort key in that comparator.
>>
>> That's not the root cause of the hang I'm currently experiencing,
>> though.  Still digging into that.
>>
>>
>> On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <[hidden email]> wrote:
>> > Yes. All outside code that frees, allocates, or reallocates memory created
>> > inside the library (or that will be passed back into the library, where it
>> > could be freed or reallocated) should use these functions. This includes
>> > filters.
>> >
>> >
>> >
>> > Dana
>> >
>> >
>> >
>> > From: Jordan Henderson <[hidden email]>
>> > Date: Wednesday, November 8, 2017 at 13:46
>> > To: Dana Robinson <[hidden email]>, "[hidden email]"
>> > <[hidden email]>, HDF List <[hidden email]>
>> > Subject: Re: [Hdf-forum] Collective IO and filters
>> >
>> >
>> >
>> > Dana,
>> >
>> >
>> >
>> > would it then make sense for all outside filters to use these routines? Due
>> > to Parallel Compression's internal nature, it uses buffers allocated via
>> > H5MM_ routines to collect and scatter data, which works fine for the
>> > internal filters like deflate, since they use these as well. However, since
>> > some of the outside filters use the raw malloc/free routines, causing
>> > issues, I'm wondering if having all outside filters use the H5_ routines is
>> > the cleanest solution..
>> >
>> >
>> >
>> > Michael,
>> >
>> >
>> >
>> > Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
>> > the fact that for the same chunk, rank 0 shows "original owner: 0, new
>> > owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
>> > though everyone IS interested in the chunk the rank 0 is now working on, but
>> > now I'm more confident that at some point either the messages may have
>> > failed to send or rank 0 is having problems finding the messages.
>> >
>> >
>> >
>> > Since in the unfiltered case it won't hit this particular code path, I'm not
>> > surprised that that case succeeds. If I had to make another guess based on
>> > this, I would be inclined to think that rank 0 must be hanging on the
>> > MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
>> > chunk as the tag for the message in order to funnel specific messages to the
>> > correct rank for the correct chunk during the last part of the chunk
>> > redistribution and if rank 0 can't match the tag it of course won't find the
>> > message. Why this might be happening, I'm not entirely certain currently.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
And that's because of this logic up in PETSc:

  if (n > 0) {
    PetscStackCallHDF5Return(memspace,H5Screate_simple,(dim, count, NULL));
  } else {
    /* Can't create dataspace with zero for any dimension, so create
null dataspace. */
    PetscStackCallHDF5Return(memspace,H5Screate,(H5S_NULL));
  }

where n is the number of elements in the rank's slice of the data.  I
think.  There is a corresponding branch later in the code:

  if (n > 0) {
    PetscStackCallHDF5Return(filespace,H5Dget_space,(dset_id));
    PetscStackCallHDF5(H5Sselect_hyperslab,(filespace, H5S_SELECT_SET,
offset, NULL, count, NULL));
  } else {
    /* Create null filespace to match null memspace. */
    PetscStackCallHDF5Return(filespace,H5Screate,(H5S_NULL));
  }

It seems clear that PETSc is mishandling this situation, but I'm not
sure how to fix it if the comment is right.  Advice?


On Thu, Nov 9, 2017 at 7:43 AM, Michael K. Edwards
<[hidden email]> wrote:

> Replacing Intel's build of MVAPICH2 2.2 with a fresh build of MVAPICH2
> 2.3b got me farther along.  The comm mismatch does not seem to be a
> problem.  I am guessing that the root cause was whatever bug is listed
> in http://mvapich.cse.ohio-state.edu/static/media/mvapich/MV2_CHANGELOG-2.3b.txt
> as:
>
>     - Fix hang in MPI_Probe
>         - Thanks to John Westlund@Intel for the report
>
> I fixed the H5D__cmp_filtered_collective_io_info_entry_owner
> comparator, and now I'm back to fixing things about my patch to PETSc.
> I seem to be trying to filter a dataset that I shouldn't be.
>
> HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
>   #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
>     major: Dataset
>     minor: Write failed
>   #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
>     major: Dataset
>     minor: Write failed
>   #002: H5Dio.c line 831 in H5D__write(): unable to adjust I/O info
> for parallel I/O
>     major: Dataset
>     minor: Unable to initialize object
>   #003: H5Dio.c line 1264 in H5D__ioinfo_adjust(): Can't perform
> independent write with filters in pipeline.
>     The following caused a break from collective I/O:
>         Local causes:
>         Global causes: one of the dataspaces was neither simple nor scalar
>     major: Low-level I/O
>     minor: Can't perform independent IO
>
>
> On Wed, Nov 8, 2017 at 11:37 PM, Michael K. Edwards
> <[hidden email]> wrote:
>> Oddly enough, it is not the tag that is mismatched between receiver
>> and senders; it is io_info->comm.  Something is decidedly out of whack
>> here.
>>
>> Rank 0, owner 0 probing with tag 0 on comm -1006632942
>> Rank 2, owner 0 sent with tag 0 to comm -1006632952 as request 0
>> Rank 3, owner 0 sent with tag 0 to comm -1006632952 as request 0
>> Rank 1, owner 0 sent with tag 0 to comm -1006632952 as request 0
>>
>>
>> On Wed, Nov 8, 2017 at 2:51 PM, Michael K. Edwards
>> <[hidden email]> wrote:
>>>
>>> I see that you're re-sorting by owner using a comparator called
>>> H5D__cmp_filtered_collective_io_info_entry_owner() which does not sort
>>> by a secondary key within items with equal owners.  That, together
>>> with a sort that isn't stable (which HDqsort() probably isn't on most
>>> platforms; quicksort/introsort is not stable), will scramble the order
>>> in which different ranks traverse their local chunk arrays.  That will
>>> cause deadly embraces between ranks that are waiting for each other's
>>> chunks to be sent.  To fix that, it's probably sufficient to use the
>>> chunk offset as a secondary sort key in that comparator.
>>>
>>> That's not the root cause of the hang I'm currently experiencing,
>>> though.  Still digging into that.
>>>
>>>
>>> On Wed, Nov 8, 2017 at 1:50 PM, Dana Robinson <[hidden email]> wrote:
>>> > Yes. All outside code that frees, allocates, or reallocates memory created
>>> > inside the library (or that will be passed back into the library, where it
>>> > could be freed or reallocated) should use these functions. This includes
>>> > filters.
>>> >
>>> >
>>> >
>>> > Dana
>>> >
>>> >
>>> >
>>> > From: Jordan Henderson <[hidden email]>
>>> > Date: Wednesday, November 8, 2017 at 13:46
>>> > To: Dana Robinson <[hidden email]>, "[hidden email]"
>>> > <[hidden email]>, HDF List <[hidden email]>
>>> > Subject: Re: [Hdf-forum] Collective IO and filters
>>> >
>>> >
>>> >
>>> > Dana,
>>> >
>>> >
>>> >
>>> > would it then make sense for all outside filters to use these routines? Due
>>> > to Parallel Compression's internal nature, it uses buffers allocated via
>>> > H5MM_ routines to collect and scatter data, which works fine for the
>>> > internal filters like deflate, since they use these as well. However, since
>>> > some of the outside filters use the raw malloc/free routines, causing
>>> > issues, I'm wondering if having all outside filters use the H5_ routines is
>>> > the cleanest solution..
>>> >
>>> >
>>> >
>>> > Michael,
>>> >
>>> >
>>> >
>>> > Based on the "num_writers: 4" field, the NULL "receive_requests_array" and
>>> > the fact that for the same chunk, rank 0 shows "original owner: 0, new
>>> > owner: 0" and rank 3 shows "original owner: 3, new_owner: 0", it seems as
>>> > though everyone IS interested in the chunk the rank 0 is now working on, but
>>> > now I'm more confident that at some point either the messages may have
>>> > failed to send or rank 0 is having problems finding the messages.
>>> >
>>> >
>>> >
>>> > Since in the unfiltered case it won't hit this particular code path, I'm not
>>> > surprised that that case succeeds. If I had to make another guess based on
>>> > this, I would be inclined to think that rank 0 must be hanging on the
>>> > MPI_Mprobe due to a mismatch in the "tag" field. I use the index of the
>>> > chunk as the tag for the message in order to funnel specific messages to the
>>> > correct rank for the correct chunk during the last part of the chunk
>>> > redistribution and if rank 0 can't match the tag it of course won't find the
>>> > message. Why this might be happening, I'm not entirely certain currently.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson
In reply to this post by Michael K. Edwards

It seems you're discovering the issues right as I'm typing this!


I'm glad you were able to solve the issue with the hanging. I was starting to suspect an issue with the MPI implementation but it's usually the last thing on the list after inspecting the code itself.


As you've seen, it seems that PETSc is creating a NULL dataspace for the ranks which are not contributing, instead of creating a Scalar/Simple dataspace on all ranks and calling H5Sselect_none() for those that don't participate. This would most likely explain the reason you saw the assertion failure in the non-filtered case, as the legacy code probably was not expecting to receive a NULL dataspace. On top of that, the NULL dataspace seems like it is causing the parallel operation to break collective mode, which is not allowed when filters are involved. I would need to do some research as to why this happens before deciding whether it's more appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.


Avoiding deadlock from the final sort has been an issue I had to re-tackle a few different times due to the nature of the code's complexity, but I will investigate using the chunk offset as a secondary sort key and see if it will run into problems in any other cases. Ideally, the chunk redistribution might be updated in the future to involve all ranks in the operation instead of just rank 0, also allowing for improvements to the redistribution algorithm that may solve these problems, but for the time being this may be sufficient.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Thank you for the validation, and for the suggestion to use
H5Sselect_none().  That is probably the right thing for the dataspace.
Not quite sure what to do about the memspace, though; the comment is
correct that we crash if any of the dimensions is zero.

On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
<[hidden email]> wrote:

> It seems you're discovering the issues right as I'm typing this!
>
>
> I'm glad you were able to solve the issue with the hanging. I was starting
> to suspect an issue with the MPI implementation but it's usually the last
> thing on the list after inspecting the code itself.
>
>
> As you've seen, it seems that PETSc is creating a NULL dataspace for the
> ranks which are not contributing, instead of creating a Scalar/Simple
> dataspace on all ranks and calling H5Sselect_none() for those that don't
> participate. This would most likely explain the reason you saw the assertion
> failure in the non-filtered case, as the legacy code probably was not
> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
> seems like it is causing the parallel operation to break collective mode,
> which is not allowed when filters are involved. I would need to do some
> research as to why this happens before deciding whether it's more
> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
>
>
> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
> few different times due to the nature of the code's complexity, but I will
> investigate using the chunk offset as a secondary sort key and see if it
> will run into problems in any other cases. Ideally, the chunk redistribution
> might be updated in the future to involve all ranks in the operation instead
> of just rank 0, also allowing for improvements to the redistribution
> algorithm that may solve these problems, but for the time being this may be
> sufficient.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Apparently this has been reported before as a problem with PETSc/HDF5
integration:  https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html

On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
<[hidden email]> wrote:

> Thank you for the validation, and for the suggestion to use
> H5Sselect_none().  That is probably the right thing for the dataspace.
> Not quite sure what to do about the memspace, though; the comment is
> correct that we crash if any of the dimensions is zero.
>
> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
> <[hidden email]> wrote:
>> It seems you're discovering the issues right as I'm typing this!
>>
>>
>> I'm glad you were able to solve the issue with the hanging. I was starting
>> to suspect an issue with the MPI implementation but it's usually the last
>> thing on the list after inspecting the code itself.
>>
>>
>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
>> ranks which are not contributing, instead of creating a Scalar/Simple
>> dataspace on all ranks and calling H5Sselect_none() for those that don't
>> participate. This would most likely explain the reason you saw the assertion
>> failure in the non-filtered case, as the legacy code probably was not
>> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
>> seems like it is causing the parallel operation to break collective mode,
>> which is not allowed when filters are involved. I would need to do some
>> research as to why this happens before deciding whether it's more
>> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
>>
>>
>> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
>> few different times due to the nature of the code's complexity, but I will
>> investigate using the chunk offset as a secondary sort key and see if it
>> will run into problems in any other cases. Ideally, the chunk redistribution
>> might be updated in the future to involve all ranks in the operation instead
>> of just rank 0, also allowing for improvements to the redistribution
>> algorithm that may solve these problems, but for the time being this may be
>> sufficient.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Actually, it's not the H5Screate() that crashes; that works fine since
HDF5 1.8.7.  It's a zero-sized malloc somewhere inside the call to
H5Dwrite(), possibly in the filter.  I think this is close to
resolution; just have to get tools on it.

On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards
<[hidden email]> wrote:

> Apparently this has been reported before as a problem with PETSc/HDF5
> integration:  https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
>
> On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
> <[hidden email]> wrote:
>> Thank you for the validation, and for the suggestion to use
>> H5Sselect_none().  That is probably the right thing for the dataspace.
>> Not quite sure what to do about the memspace, though; the comment is
>> correct that we crash if any of the dimensions is zero.
>>
>> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
>> <[hidden email]> wrote:
>>> It seems you're discovering the issues right as I'm typing this!
>>>
>>>
>>> I'm glad you were able to solve the issue with the hanging. I was starting
>>> to suspect an issue with the MPI implementation but it's usually the last
>>> thing on the list after inspecting the code itself.
>>>
>>>
>>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
>>> ranks which are not contributing, instead of creating a Scalar/Simple
>>> dataspace on all ranks and calling H5Sselect_none() for those that don't
>>> participate. This would most likely explain the reason you saw the assertion
>>> failure in the non-filtered case, as the legacy code probably was not
>>> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
>>> seems like it is causing the parallel operation to break collective mode,
>>> which is not allowed when filters are involved. I would need to do some
>>> research as to why this happens before deciding whether it's more
>>> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
>>>
>>>
>>> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
>>> few different times due to the nature of the code's complexity, but I will
>>> investigate using the chunk offset as a secondary sort key and see if it
>>> will run into problems in any other cases. Ideally, the chunk redistribution
>>> might be updated in the future to involve all ranks in the operation instead
>>> of just rank 0, also allowing for improvements to the redistribution
>>> algorithm that may solve these problems, but for the time being this may be
>>> sufficient.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Dana Robinson
In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is zero. That should not be there and the function docs even say that we return NULL on size zero.

The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking that out and rebuilding.

Dana

On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" <[hidden email] on behalf of [hidden email]> wrote:

    Actually, it's not the H5Screate() that crashes; that works fine since
    HDF5 1.8.7.  It's a zero-sized malloc somewhere inside the call to
    H5Dwrite(), possibly in the filter.  I think this is close to
    resolution; just have to get tools on it.
   
    On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards
    <[hidden email]> wrote:
    > Apparently this has been reported before as a problem with PETSc/HDF5
    > integration:  https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
    >
    > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
    > <[hidden email]> wrote:
    >> Thank you for the validation, and for the suggestion to use
    >> H5Sselect_none().  That is probably the right thing for the dataspace.
    >> Not quite sure what to do about the memspace, though; the comment is
    >> correct that we crash if any of the dimensions is zero.
    >>
    >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
    >> <[hidden email]> wrote:
    >>> It seems you're discovering the issues right as I'm typing this!
    >>>
    >>>
    >>> I'm glad you were able to solve the issue with the hanging. I was starting
    >>> to suspect an issue with the MPI implementation but it's usually the last
    >>> thing on the list after inspecting the code itself.
    >>>
    >>>
    >>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
    >>> ranks which are not contributing, instead of creating a Scalar/Simple
    >>> dataspace on all ranks and calling H5Sselect_none() for those that don't
    >>> participate. This would most likely explain the reason you saw the assertion
    >>> failure in the non-filtered case, as the legacy code probably was not
    >>> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
    >>> seems like it is causing the parallel operation to break collective mode,
    >>> which is not allowed when filters are involved. I would need to do some
    >>> research as to why this happens before deciding whether it's more
    >>> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
    >>>
    >>>
    >>> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
    >>> few different times due to the nature of the code's complexity, but I will
    >>> investigate using the chunk offset as a secondary sort key and see if it
    >>> will run into problems in any other cases. Ideally, the chunk redistribution
    >>> might be updated in the future to involve all ranks in the operation instead
    >>> of just rank 0, also allowing for improvements to the redistribution
    >>> algorithm that may solve these problems, but for the time being this may be
    >>> sufficient.
   
    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [hidden email]
    http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
    Twitter: https://twitter.com/hdf5
   

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Thank you.  That got me farther along.  The crash is now in the
H5Z-blosc filter glue, and should be easy to fix.  It's interesting
that the filter is applied on a per-chunk basis, including on
zero-sized chunks; it's possible that something is wrong higher up the
stack.  I haven't really thought about collective read with filters
yet.  Jordan, can you fill me in on how that's supposed to work,
especially if the reader has a different number of MPI ranks than the
writer had?

HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3277 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed
  #008: /home/centos/blosc/hdf5-blosc/src/blosc_filter.c line 250 in
blosc_filter(): Can't allocate decompression buffer
    major: Data filters
    minor: Callback failed

On Thu, Nov 9, 2017 at 9:22 AM, Dana Robinson <[hidden email]> wrote:

> In develop, H5MM_malloc() and H5MM_calloc() will throw an assert if size is zero. That should not be there and the function docs even say that we return NULL on size zero.
>
> The bad line is at lines 271 and 360 in H5MM.c if you want to try yanking that out and rebuilding.
>
> Dana
>
> On 11/9/17, 09:06, "Hdf-forum on behalf of Michael K. Edwards" <[hidden email] on behalf of [hidden email]> wrote:
>
>     Actually, it's not the H5Screate() that crashes; that works fine since
>     HDF5 1.8.7.  It's a zero-sized malloc somewhere inside the call to
>     H5Dwrite(), possibly in the filter.  I think this is close to
>     resolution; just have to get tools on it.
>
>     On Thu, Nov 9, 2017 at 8:47 AM, Michael K. Edwards
>     <[hidden email]> wrote:
>     > Apparently this has been reported before as a problem with PETSc/HDF5
>     > integration:  https://lists.mcs.anl.gov/pipermail/petsc-users/2012-January/011980.html
>     >
>     > On Thu, Nov 9, 2017 at 8:37 AM, Michael K. Edwards
>     > <[hidden email]> wrote:
>     >> Thank you for the validation, and for the suggestion to use
>     >> H5Sselect_none().  That is probably the right thing for the dataspace.
>     >> Not quite sure what to do about the memspace, though; the comment is
>     >> correct that we crash if any of the dimensions is zero.
>     >>
>     >> On Thu, Nov 9, 2017 at 8:34 AM, Jordan Henderson
>     >> <[hidden email]> wrote:
>     >>> It seems you're discovering the issues right as I'm typing this!
>     >>>
>     >>>
>     >>> I'm glad you were able to solve the issue with the hanging. I was starting
>     >>> to suspect an issue with the MPI implementation but it's usually the last
>     >>> thing on the list after inspecting the code itself.
>     >>>
>     >>>
>     >>> As you've seen, it seems that PETSc is creating a NULL dataspace for the
>     >>> ranks which are not contributing, instead of creating a Scalar/Simple
>     >>> dataspace on all ranks and calling H5Sselect_none() for those that don't
>     >>> participate. This would most likely explain the reason you saw the assertion
>     >>> failure in the non-filtered case, as the legacy code probably was not
>     >>> expecting to receive a NULL dataspace. On top of that, the NULL dataspace
>     >>> seems like it is causing the parallel operation to break collective mode,
>     >>> which is not allowed when filters are involved. I would need to do some
>     >>> research as to why this happens before deciding whether it's more
>     >>> appropriate to modify this in HDF5 or to have PETSc not use NULL dataspaces.
>     >>>
>     >>>
>     >>> Avoiding deadlock from the final sort has been an issue I had to re-tackle a
>     >>> few different times due to the nature of the code's complexity, but I will
>     >>> investigate using the chunk offset as a secondary sort key and see if it
>     >>> will run into problems in any other cases. Ideally, the chunk redistribution
>     >>> might be updated in the future to involve all ranks in the operation instead
>     >>> of just rank 0, also allowing for improvements to the redistribution
>     >>> algorithm that may solve these problems, but for the time being this may be
>     >>> sufficient.
>
>     _______________________________________________
>     Hdf-forum is for HDF software users discussion.
>     [hidden email]
>     http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>     Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson

Since Parallel Compression operates by applying the filter on a per-chunk-basis, this should be consistent with what you're seeing. However, zero-sized chunks is a case I had not actually considered yet, and I could reasonably see blosc failing due to a zero-sized allocation.


Since reading in the parallel case with filters doesn't affect the metadata, the H5D__construct_filtered_io_info_list() function will simply cause each rank to construct a local list of all the chunks they have selected in the read operation, read their respective chunks into locally-allocated buffers, and decompress the data on a chunk-by-chunk basis, scattering it to the read buffer along the way. Writing works the same way in that each rank works on their own local list of chunks, with the exception that some of the chunks may get shifted around before the actual write operation of "pull data from the read buffer, decompress the chunk, update the chunk, re-compress the chunk and write it" happens. In general, it shouldn't cause an issue that you're reading the Dataset with a different number of MPI ranks than it was written with.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
Would it be better for that read-decompress-update-recompress-write
operation to skip zero-sized chunks?  I imagine it's a bit tricky if
the lowest-indexed rank's contribution to the chunk is zero-sized; but
can that happen?  Doesn't ownership move to the rank that has the
largest contribution to the chunk that's being written?

On Thu, Nov 9, 2017 at 10:26 AM, Jordan Henderson
<[hidden email]> wrote:

> Since Parallel Compression operates by applying the filter on a
> per-chunk-basis, this should be consistent with what you're seeing. However,
> zero-sized chunks is a case I had not actually considered yet, and I could
> reasonably see blosc failing due to a zero-sized allocation.
>
>
> Since reading in the parallel case with filters doesn't affect the metadata,
> the H5D__construct_filtered_io_info_list() function will simply cause each
> rank to construct a local list of all the chunks they have selected in the
> read operation, read their respective chunks into locally-allocated buffers,
> and decompress the data on a chunk-by-chunk basis, scattering it to the read
> buffer along the way. Writing works the same way in that each rank works on
> their own local list of chunks, with the exception that some of the chunks
> may get shifted around before the actual write operation of "pull data from
> the read buffer, decompress the chunk, update the chunk, re-compress the
> chunk and write it" happens. In general, it shouldn't cause an issue that
> you're reading the Dataset with a different number of MPI ranks than it was
> written with.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
It does appear as though it's the "update" chunk that is zero-sized.
Is there any way to know that before decompressing, and to skip the
update higher up in the stack (perhaps in
H5D__link_chunk_filtered_collective_io())?

On Thu, Nov 9, 2017 at 11:18 AM, Michael K. Edwards
<[hidden email]> wrote:

> Would it be better for that read-decompress-update-recompress-write
> operation to skip zero-sized chunks?  I imagine it's a bit tricky if
> the lowest-indexed rank's contribution to the chunk is zero-sized; but
> can that happen?  Doesn't ownership move to the rank that has the
> largest contribution to the chunk that's being written?
>
> On Thu, Nov 9, 2017 at 10:26 AM, Jordan Henderson
> <[hidden email]> wrote:
>> Since Parallel Compression operates by applying the filter on a
>> per-chunk-basis, this should be consistent with what you're seeing. However,
>> zero-sized chunks is a case I had not actually considered yet, and I could
>> reasonably see blosc failing due to a zero-sized allocation.
>>
>>
>> Since reading in the parallel case with filters doesn't affect the metadata,
>> the H5D__construct_filtered_io_info_list() function will simply cause each
>> rank to construct a local list of all the chunks they have selected in the
>> read operation, read their respective chunks into locally-allocated buffers,
>> and decompress the data on a chunk-by-chunk basis, scattering it to the read
>> buffer along the way. Writing works the same way in that each rank works on
>> their own local list of chunks, with the exception that some of the chunks
>> may get shifted around before the actual write operation of "pull data from
>> the read buffer, decompress the chunk, update the chunk, re-compress the
>> chunk and write it" happens. In general, it shouldn't cause an issue that
>> you're reading the Dataset with a different number of MPI ranks than it was
>> written with.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson
By zero-sized chunks do you mean to say that the actual chunks in the dataset are zero-sized or the data going to the write is zero-sized? It would seem odd to me if you were writing to an essentially zero-sized dataset composed of zero-sized chunks.

On the other hand, for ranks that aren't participating, they should never construct a list of chunks in the H5D__construct_filtered_io_info_list() function and thus should never participate in any chunk updating, only the collective file space re-allocations and re-insertion of chunks into the chunk index. That being said, if you are indeed seeing zero-sized malloc calls in the chunk update function, something must be wrong somewhere. While it is true that the chunks currently move to the rank with the largest contribution to the chunk which ALSO has the least amount of chunks currently assigned to it (to try and get a more even distribution of chunks among all the ranks), any rank which has a zero-sized contribution to a chunk should never have created a chunk struct entry for the chunk and thus should not be participating in the chunk updating loop (lines 1471-1474 in the current develop branch). They should pass that loop and wait at the subsequent H5D__mpio_array_gatherv() until the other ranks get done processing. Again, this is what should happen but in your case may not be the actuality of the situation.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
I added a debug printf (I am currently running a test with 4 ranks on
the same host), and here is what I see.  The "M of N" numbers reflect
the size of the memspace and filespace respectively.  The printf is
inserted immediately before H5Dwrite() in my modified version of
ISView_General_HDF5() (in PETSc's
src/vec/is/is/impls/general/general.c).

About to write 148 of 636
About to write 176 of 636
About to write 163 of 636
About to write 149 of 636
About to write 176 of 636
About to write 148 of 636
About to write 149 of 636
About to write 163 of 636
About to write 310 of 1136
About to write 266 of 1136
About to write 258 of 1136
About to write 302 of 1136
About to write 310 of 1136
About to write 266 of 1136
About to write 258 of 1136
About to write 302 of 1136
About to write 124 of 520
About to write 120 of 520
About to write 140 of 520
About to write 136 of 520
About to write 23 of 80
About to write 19 of 80
About to write 14 of 80
About to write 24 of 80
About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3277 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed

I'm trying to do this in the way you suggested, where non-contributing
ranks create a zero-sized memspace (with the appropriate dimensions)
and call H5Sselect_none() on the filespace, then call H5Dwrite() in
the usual way to participate in the collective write.  Where in the
code would you expect the test that filters out zero-sized chunks to
be?


On Thu, Nov 9, 2017 at 11:39 AM, Jordan Henderson
<[hidden email]> wrote:

> By zero-sized chunks do you mean to say that the actual chunks in the
> dataset are zero-sized or the data going to the write is zero-sized? It
> would seem odd to me if you were writing to an essentially zero-sized
> dataset composed of zero-sized chunks.
>
> On the other hand, for ranks that aren't participating, they should never
> construct a list of chunks in the H5D__construct_filtered_io_info_list()
> function and thus should never participate in any chunk updating, only the
> collective file space re-allocations and re-insertion of chunks into the
> chunk index. That being said, if you are indeed seeing zero-sized malloc
> calls in the chunk update function, something must be wrong somewhere. While
> it is true that the chunks currently move to the rank with the largest
> contribution to the chunk which ALSO has the least amount of chunks
> currently assigned to it (to try and get a more even distribution of chunks
> among all the ranks), any rank which has a zero-sized contribution to a
> chunk should never have created a chunk struct entry for the chunk and thus
> should not be participating in the chunk updating loop (lines 1471-1474 in
> the current develop branch). They should pass that loop and wait at the
> subsequent H5D__mpio_array_gatherv() until the other ranks get done
> processing. Again, this is what should happen but in your case may not be
> the actuality of the situation.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson
In the H5D__link_chunk_filtered_collective_io() function, all ranks (after some initialization work) should first hit H5D__construct_filtered_io_info_list(). Inside that function, at line 2741, each rank counts the number of chunks it has selected. Only if a rank has any selected should it then proceed with building its local list of chunks. At that point, all the ranks which aren't participating should skip this and wait for the other ranks to get done before everyone participates in the chunk redistribution. Then, the non-participating ranks shouldn't have any chunks assigned to them since they could not be considered among the crowd of ranks writing the most to any of the chunks. They should then return from the function back to H5D__link_chunk_filtered_collective_io(), with chunk_list_num_entries telling them that they have no chunks to work on. At that point they should skip the loop at 1471-1474 and wait for the others. The only case I can currently imagine where the chunk redistribution could get confused would be where no one at all is writing to anything. Multi-chunk I/O specifically handles this but I'm not sure if Link-chunk I/O will handle the case as well as Multi-Chunk does.

This is all of course if I understand what you mean by the zero-sized chunks, which I believe I understand due to the fact that your file space for the chunks is positive in size.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
So I think the distinction here is between "participating" for
synchronization purposes and having a nonzero slice of data locally.
I think (correct me if I'm wrong) that all ranks have to call
H5Dwrite() even if they have called H5Sselect_none() on the filespace.
That will cause them to send metadata describing their zero-sized
contributions to shared chunks to the rank 0 coordinator.  They won't
get chosen as the new owner, but their metadata will be included in
the chunk_entry list sent from rank 0 to the new owner, which means
they will be expected to send chunks to the new owner.  The crash
happens when these zero-sized chunks are decoded by the filter plugin;
even if I stop the plugin itself from crashing, it has to return size
0 to H5Z_pipeline(), which interprets that as filter failure and
crashes out in H5Z.c line 1256.

That's something I can probably work around, but before I go too far
down that road, I'd love it if you could correct any misapprehensions
in this.  Is it the case that all ranks have to call H5Dwrite()?  Is
there a way to know what the uncompressed data size will be, and skip
the zero-sized chunk_entry units somewhere up the stack?



On Thu, Nov 9, 2017 at 12:21 PM, Jordan Henderson
<[hidden email]> wrote:

> In the H5D__link_chunk_filtered_collective_io() function, all ranks (after
> some initialization work) should first hit
> H5D__construct_filtered_io_info_list(). Inside that function, at line 2741,
> each rank counts the number of chunks it has selected. Only if a rank has
> any selected should it then proceed with building its local list of chunks.
> At that point, all the ranks which aren't participating should skip this and
> wait for the other ranks to get done before everyone participates in the
> chunk redistribution. Then, the non-participating ranks shouldn't have any
> chunks assigned to them since they could not be considered among the crowd
> of ranks writing the most to any of the chunks. They should then return from
> the function back to H5D__link_chunk_filtered_collective_io(), with
> chunk_list_num_entries telling them that they have no chunks to work on. At
> that point they should skip the loop at 1471-1474 and wait for the others.
> The only case I can currently imagine where the chunk redistribution could
> get confused would be where no one at all is writing to anything.
> Multi-chunk I/O specifically handles this but I'm not sure if Link-chunk I/O
> will handle the case as well as Multi-Chunk does.
>
> This is all of course if I understand what you mean by the zero-sized
> chunks, which I believe I understand due to the fact that your file space
> for the chunks is positive in size.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson

For the purpose of collective I/O it is true that all ranks must call H5Dwrite() so that they can participate in those collective operations that are necessary (the file space re-allocation and so on). However, even though they called H5Dwrite() with a valid memspace, the fact that they have a NONE selection in the given file space should cause their chunk-file mapping struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the code for H5D__link_chunk_filtered_collective_io() to see how it uses this built up list of chunks selected in the file) to contain no entries in the "fm->sel_chunks" field. That alone should mean that during the chunk redistribution, they will not actually send anything at all to any of the ranks. They only participate there for the sake that, were the method of redistribution modified, ranks which previously had no chunks selected could potentially be given some chunks to work on.


For all practical purposes, every single chunk_entry seen in the list from rank 0's perspective should be a valid I/O caused by some rank writing some positive amount of bytes to the chunk. On rank 0's side, you should be able to check the io_size field of each of the chunk_entry entries and see how big the I/O is from the "original_owner" to that chunk. If any of these are 0, something is likely very wrong. If that is indeed the case, you could likely pull a hacky workaround by manually removing them from the list, but I'd be more concerned about the root of the problem if there are zero-size I/O chunk_entry entries being added to the list.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Jordan Henderson
In reply to this post by Michael K. Edwards

Also, now that the hanging issue has been resolved, would it be possible to try this same code again with a different filter, perhaps within the gzip/szip family? I'm curious as to whether the filter has anything to do with this issue or not.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Jordan Henderson
Thank you for the explanation.  That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list().  So
I'm now looking into the filter situation.  It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.

About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed



On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson
<[hidden email]> wrote:

> For the purpose of collective I/O it is true that all ranks must call
> H5Dwrite() so that they can participate in those collective operations that
> are necessary (the file space re-allocation and so on). However, even though
> they called H5Dwrite() with a valid memspace, the fact that they have a NONE
> selection in the given file space should cause their chunk-file mapping
> struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
> code for H5D__link_chunk_filtered_collective_io() to see how it uses this
> built up list of chunks selected in the file) to contain no entries in the
> "fm->sel_chunks" field. That alone should mean that during the chunk
> redistribution, they will not actually send anything at all to any of the
> ranks. They only participate there for the sake that, were the method of
> redistribution modified, ranks which previously had no chunks selected could
> potentially be given some chunks to work on.
>
>
> For all practical purposes, every single chunk_entry seen in the list from
> rank 0's perspective should be a valid I/O caused by some rank writing some
> positive amount of bytes to the chunk. On rank 0's side, you should be able
> to check the io_size field of each of the chunk_entry entries and see how
> big the I/O is from the "original_owner" to that chunk. If any of these are
> 0, something is likely very wrong. If that is indeed the case, you could
> likely pull a hacky workaround by manually removing them from the list, but
> I'd be more concerned about the root of the problem if there are zero-size
> I/O chunk_entry entries being added to the list.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: Collective IO and filters

Michael K. Edwards
In reply to this post by Jordan Henderson
I don't think szip will work, because it bombs out when there isn't
enough data to reach its minimal compression unit (usually configured
as 32 bytes).  I can try zlib (deflate).

On Thu, Nov 9, 2017 at 1:35 PM, Jordan Henderson
<[hidden email]> wrote:
> Also, now that the hanging issue has been resolved, would it be possible to
> try this same code again with a different filter, perhaps within the
> gzip/szip family? I'm curious as to whether the filter has anything to do
> with this issue or not.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
123