HDF5 library hang in H5DWrite_f in collective mode

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
Hi,

I have an MPI application where each process sample some data. Each
process can have an arbitrary number of sampling points (or no points at
all). During the simulation each process buffer the sample values in
local memory until the buffer is full. At that point each process send
its data to designated IO processes, and the IO processes open a HDF5
file, extend a dataset and write the data into the file.

The filespace can be quite compicated, constructed with numerous calls
to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
block of data. The chunk size is equal to the buffer size, i.e. each
time the dataset is extended it is extended by exactly one chunk.

The problem is that in some cases, the application hang in h5dwrite_f
(Fortran application). I cannot see why. It happens on multiple systems
with different MPI implementations, so I believe that the problem is in
my application or in the HDF5 library, not in the MPI implementation or
on the system level.

The problem disappear if I turn off collective IO.

I have tried to compile HDF5 with as much error checking as possible
(--enable-debug=all --disable-production) and I do not get any errors or
warnings from the HDF5 library.

I ran the code through TotalView, and got the attached backtrace for the
20 processes that participate in the IO communicator.

Does anyone have any idea on how to continue debugging this problem?

I currently use HDF5 version 1.8.17.

Best regards,
Håkon Strandenes

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Backtrace HDF5 err.png (79K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Quincey Koziol-3
Hi Håkon,
        Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?

        Quincey

> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>
> Hi,
>
> I have an MPI application where each process sample some data. Each
> process can have an arbitrary number of sampling points (or no points at
> all). During the simulation each process buffer the sample values in
> local memory until the buffer is full. At that point each process send
> its data to designated IO processes, and the IO processes open a HDF5
> file, extend a dataset and write the data into the file.
>
> The filespace can be quite compicated, constructed with numerous calls
> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
> block of data. The chunk size is equal to the buffer size, i.e. each
> time the dataset is extended it is extended by exactly one chunk.
>
> The problem is that in some cases, the application hang in h5dwrite_f
> (Fortran application). I cannot see why. It happens on multiple systems
> with different MPI implementations, so I believe that the problem is in
> my application or in the HDF5 library, not in the MPI implementation or
> on the system level.
>
> The problem disappear if I turn off collective IO.
>
> I have tried to compile HDF5 with as much error checking as possible
> (--enable-debug=all --disable-production) and I do not get any errors or
> warnings from the HDF5 library.
>
> I ran the code through TotalView, and got the attached backtrace for the
> 20 processes that participate in the IO communicator.
>
> Does anyone have any idea on how to continue debugging this problem?
>
> I currently use HDF5 version 1.8.17.
>
> Best regards,
> Håkon Strandenes
> <Backtrace HDF5 err.png>_______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
(sorry, forgot to cc mailing list in prev. mail)

A standalone test program would be quite an effort, but I will think
about it. I know that at least all simple test cases pass, so I need a
"complicated" problem to generate the error.

One thing I wonder about is:
Is the requirements for collective IO in this document:
https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
still valid and accurate?

The reason I ask is that my filespace is complicated. Each IO process
create the filespace with MANY calls to select_hyperslab. Hence it is
neither regular nor singular, and according to the above mentioned
document the HDF5 library should not be able to do collective IO in this
case. Still, it seems like it hangs in some collective writing routine.

Am I onto something? Could this be a problem?

Regards,
Håkon


On 05/19/2017 04:46 PM, Quincey Koziol wrote:

> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>
> Quincey
>
>
>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>
>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>
>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>
>> Regards,
>> Håkon
>>
>>
>>
>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>> Hi Håkon,
>>> Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>> Quincey
>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have an MPI application where each process sample some data. Each
>>>> process can have an arbitrary number of sampling points (or no points at
>>>> all). During the simulation each process buffer the sample values in
>>>> local memory until the buffer is full. At that point each process send
>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>> file, extend a dataset and write the data into the file.
>>>>
>>>> The filespace can be quite compicated, constructed with numerous calls
>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>
>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>> with different MPI implementations, so I believe that the problem is in
>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>> on the system level.
>>>>
>>>> The problem disappear if I turn off collective IO.
>>>>
>>>> I have tried to compile HDF5 with as much error checking as possible
>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>> warnings from the HDF5 library.
>>>>
>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>> 20 processes that participate in the IO communicator.
>>>>
>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>
>>>> I currently use HDF5 version 1.8.17.
>>>>
>>>> Best regards,
>>>> Håkon Strandenes
>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [hidden email]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Quincey Koziol-3
Hi Håkon,

> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>
> (sorry, forgot to cc mailing list in prev. mail)
>
> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.

        Yeah, that’s usually the case with these kind of issues.  :-/


> One thing I wonder about is:
> Is the requirements for collective IO in this document:
> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
> still valid and accurate?
>
> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>
> Am I onto something? Could this be a problem?

        Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…

                Quincey


> Regards,
> Håkon
>
>
> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>> Quincey
>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>
>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>
>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>
>>> Regards,
>>> Håkon
>>>
>>>
>>>
>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>> Hi Håkon,
>>>> Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>> Quincey
>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have an MPI application where each process sample some data. Each
>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>> all). During the simulation each process buffer the sample values in
>>>>> local memory until the buffer is full. At that point each process send
>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>> file, extend a dataset and write the data into the file.
>>>>>
>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>
>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>> with different MPI implementations, so I believe that the problem is in
>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>> on the system level.
>>>>>
>>>>> The problem disappear if I turn off collective IO.
>>>>>
>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>> warnings from the HDF5 library.
>>>>>
>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>> 20 processes that participate in the IO communicator.
>>>>>
>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>
>>>>> I currently use HDF5 version 1.8.17.
>>>>>
>>>>> Best regards,
>>>>> Håkon Strandenes
>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [hidden email]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>> Twitter: https://twitter.com/hdf5
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [hidden email]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Scot Breitenfeld
Can you try it with 1.10.1 and see if you still have an issue.

Scot

> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>
> Hi Håkon,
>
>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>>
>> (sorry, forgot to cc mailing list in prev. mail)
>>
>> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.
>
> Yeah, that’s usually the case with these kind of issues.  :-/
>
>
>> One thing I wonder about is:
>> Is the requirements for collective IO in this document:
>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>> still valid and accurate?
>>
>> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>>
>> Am I onto something? Could this be a problem?
>
> Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…
>
> Quincey
>
>
>> Regards,
>> Håkon
>>
>>
>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>>> Quincey
>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>
>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>>
>>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>>
>>>> Regards,
>>>> Håkon
>>>>
>>>>
>>>>
>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>> Hi Håkon,
>>>>> Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>>> Quincey
>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have an MPI application where each process sample some data. Each
>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>> all). During the simulation each process buffer the sample values in
>>>>>> local memory until the buffer is full. At that point each process send
>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>> file, extend a dataset and write the data into the file.
>>>>>>
>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>
>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>> on the system level.
>>>>>>
>>>>>> The problem disappear if I turn off collective IO.
>>>>>>
>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>> warnings from the HDF5 library.
>>>>>>
>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>> 20 processes that participate in the IO communicator.
>>>>>>
>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>
>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>
>>>>>> Best regards,
>>>>>> Håkon Strandenes
>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [hidden email]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>> Twitter: https://twitter.com/hdf5
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [hidden email]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>> Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
Yes, the issue is still there.

I will try to make a dummy program to demonstrate the error. It might be
the easiest thing to debug on in the long run.

Regards,
Håkon


On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:

> Can you try it with 1.10.1 and see if you still have an issue.
>
> Scot
>
>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>
>> Hi Håkon,
>>
>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>>>
>>> (sorry, forgot to cc mailing list in prev. mail)
>>>
>>> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.
>>
>> Yeah, that’s usually the case with these kind of issues.  :-/
>>
>>
>>> One thing I wonder about is:
>>> Is the requirements for collective IO in this document:
>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>> still valid and accurate?
>>>
>>> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>>>
>>> Am I onto something? Could this be a problem?
>>
>> Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…
>>
>> Quincey
>>
>>
>>> Regards,
>>> Håkon
>>>
>>>
>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>>>> Quincey
>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>
>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>>>
>>>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>>>
>>>>> Regards,
>>>>> Håkon
>>>>>
>>>>>
>>>>>
>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>> Hi Håkon,
>>>>>> Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>>>> Quincey
>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>
>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>
>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>> on the system level.
>>>>>>>
>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>
>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>>> warnings from the HDF5 library.
>>>>>>>
>>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>
>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>
>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Håkon Strandenes
>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>> [hidden email]
>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [hidden email]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>> Twitter: https://twitter.com/hdf5
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Quincey Koziol-3

> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[hidden email]> wrote:
>
> Yes, the issue is still there.
>
> I will try to make a dummy program to demonstrate the error. It might be the easiest thing to debug on in the long run.

        That would be very helpful, thanks,
                Quincey

>
> Regards,
> Håkon
>
>
> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>> Can you try it with 1.10.1 and see if you still have an issue.
>> Scot
>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>>
>>> Hi Håkon,
>>>
>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>
>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>
>>>> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.
>>>
>>> Yeah, that’s usually the case with these kind of issues.  :-/
>>>
>>>
>>>> One thing I wonder about is:
>>>> Is the requirements for collective IO in this document:
>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>> still valid and accurate?
>>>>
>>>> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>>>>
>>>> Am I onto something? Could this be a problem?
>>>
>>> Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…
>>>
>>> Quincey
>>>
>>>
>>>> Regards,
>>>> Håkon
>>>>
>>>>
>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>>>>> Quincey
>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>
>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>>>>
>>>>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>>>>
>>>>>> Regards,
>>>>>> Håkon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>> Hi Håkon,
>>>>>>> Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>>>>> Quincey
>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>
>>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>
>>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>>> on the system level.
>>>>>>>>
>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>
>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>>>> warnings from the HDF5 library.
>>>>>>>>
>>>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>
>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>>
>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Håkon Strandenes
>>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>> [hidden email]
>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>> _______________________________________________
>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>> [hidden email]
>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>> Twitter: https://twitter.com/hdf5
>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
I have managed to prepare an example program. I got away a
lot of non-essential stuff, by preparing some datafiles in advance. The
example is for 20 processes *only*.

I reported earlier that I also found the bug on a system with SGI MPT,
this example runs fine on this system, so let's for the moment disregard
that.

The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.

I tested for instance:
HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS

And the following does not work:
HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING

Does anyone have any idea on how to proceed in the debugging? Does
anyone see any obvious flaws in my example program?

Thanks for all help.

Regards,
Håkon Strandenes


On 05/20/2017 09:12 PM, Quincey Koziol wrote:

>
>> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[hidden email]> wrote:
>>
>> Yes, the issue is still there.
>>
>> I will try to make a dummy program to demonstrate the error. It might be the easiest thing to debug on in the long run.
>
> That would be very helpful, thanks,
> Quincey
>
>>
>> Regards,
>> Håkon
>>
>>
>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>> Can you try it with 1.10.1 and see if you still have an issue.
>>> Scot
>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>>>
>>>> Hi Håkon,
>>>>
>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>
>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>
>>>>> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.
>>>>
>>>> Yeah, that’s usually the case with these kind of issues.  :-/
>>>>
>>>>
>>>>> One thing I wonder about is:
>>>>> Is the requirements for collective IO in this document:
>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>> still valid and accurate?
>>>>>
>>>>> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>>>>>
>>>>> Am I onto something? Could this be a problem?
>>>>
>>>> Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…
>>>>
>>>> Quincey
>>>>
>>>>
>>>>> Regards,
>>>>> Håkon
>>>>>
>>>>>
>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>>>>>> Quincey
>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>
>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>>>>>
>>>>>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Håkon
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>> Hi Håkon,
>>>>>>>> Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>>>>>> Quincey
>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>
>>>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>>
>>>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>>>> on the system level.
>>>>>>>>>
>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>
>>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>
>>>>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>
>>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>>>
>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Håkon Strandenes
>>>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>> [hidden email]
>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>> _______________________________________________
>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>> [hidden email]
>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>
>>>>
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [hidden email]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

h5HangDbg.tar.gz (2M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
One correction:

The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort
2016.3.210" are another problem with a segmentation fault.

To avioud confusion, I repeat the working/not working cases I tried:

HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS

HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem,
maybe with HDF5 installation

HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING

I also tested on another cluster with GPFS parallel file system (instead
of LUSTRE):

Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING

So the common denominator seems to be Intel MPI 2017.

Regards,
Håkon


On 05/22/2017 05:13 PM, Håkon Strandenes wrote:

> I have managed to prepare an example program. I got away a
> lot of non-essential stuff, by preparing some datafiles in advance. The
> example is for 20 processes *only*.
>
> I reported earlier that I also found the bug on a system with SGI MPT,
> this example runs fine on this system, so let's for the moment disregard
> that.
>
> The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.
>
> I tested for instance:
> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>
> And the following does not work:
> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>
> Does anyone have any idea on how to proceed in the debugging? Does
> anyone see any obvious flaws in my example program?
>
> Thanks for all help.
>
> Regards,
> Håkon Strandenes
>
>
> On 05/20/2017 09:12 PM, Quincey Koziol wrote:
>>
>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[hidden email]>
>>> wrote:
>>>
>>> Yes, the issue is still there.
>>>
>>> I will try to make a dummy program to demonstrate the error. It might
>>> be the easiest thing to debug on in the long run.
>>
>>     That would be very helpful, thanks,
>>         Quincey
>>
>>>
>>> Regards,
>>> Håkon
>>>
>>>
>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>>> Can you try it with 1.10.1 and see if you still have an issue.
>>>> Scot
>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>>>>
>>>>> Hi Håkon,
>>>>>
>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes
>>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>>
>>>>>> A standalone test program would be quite an effort, but I will
>>>>>> think about it. I know that at least all simple test cases pass,
>>>>>> so I need a "complicated" problem to generate the error.
>>>>>
>>>>>     Yeah, that’s usually the case with these kind of issues.  :-/
>>>>>
>>>>>
>>>>>> One thing I wonder about is:
>>>>>> Is the requirements for collective IO in this document:
>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>>> still valid and accurate?
>>>>>>
>>>>>> The reason I ask is that my filespace is complicated. Each IO
>>>>>> process create the filespace with MANY calls to select_hyperslab.
>>>>>> Hence it is neither regular nor singular, and according to the
>>>>>> above mentioned document the HDF5 library should not be able to do
>>>>>> collective IO in this case. Still, it seems like it hangs in some
>>>>>> collective writing routine.
>>>>>>
>>>>>> Am I onto something? Could this be a problem?
>>>>>
>>>>>     Fortunately, we’ve expanded the feature set for collective I/O
>>>>> now and it supports arbitrary selections on chunked datasets.  
>>>>> There’s always the chance for a bug of course, but it would have to
>>>>> be very unusual, since we are pretty thorough about the regression
>>>>> testing…
>>>>>
>>>>>         Quincey
>>>>>
>>>>>
>>>>>> Regards,
>>>>>> Håkon
>>>>>>
>>>>>>
>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  
>>>>>>> But, the constant seems to be your code now. :-/  Can you
>>>>>>> replicate the error with a small standalone C test program?
>>>>>>>     Quincey
>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try
>>>>>>>> OpenMPI as well, but that is not as well tested on the systems
>>>>>>>> we are using as the previously mentioned ones.
>>>>>>>>
>>>>>>>> I also tested and can confirm that the problem is there as well
>>>>>>>> with HDF5 1.10.1.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Håkon
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>>> Hi Håkon,
>>>>>>>>>     Actually, given this behavior, it’s reasonably possible
>>>>>>>>> that you have found a bug in the MPI implementation that you
>>>>>>>>> have, so I wouldn’t rule that out.  What implementation and
>>>>>>>>> version of MPI are you using?
>>>>>>>>>     Quincey
>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes
>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have an MPI application where each process sample some data.
>>>>>>>>>> Each
>>>>>>>>>> process can have an arbitrary number of sampling points (or no
>>>>>>>>>> points at
>>>>>>>>>> all). During the simulation each process buffer the sample
>>>>>>>>>> values in
>>>>>>>>>> local memory until the buffer is full. At that point each
>>>>>>>>>> process send
>>>>>>>>>> its data to designated IO processes, and the IO processes open
>>>>>>>>>> a HDF5
>>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>>
>>>>>>>>>> The filespace can be quite compicated, constructed with
>>>>>>>>>> numerous calls
>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple
>>>>>>>>>> contiguous
>>>>>>>>>> block of data. The chunk size is equal to the buffer size,
>>>>>>>>>> i.e. each
>>>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>>>
>>>>>>>>>> The problem is that in some cases, the application hang in
>>>>>>>>>> h5dwrite_f
>>>>>>>>>> (Fortran application). I cannot see why. It happens on
>>>>>>>>>> multiple systems
>>>>>>>>>> with different MPI implementations, so I believe that the
>>>>>>>>>> problem is in
>>>>>>>>>> my application or in the HDF5 library, not in the MPI
>>>>>>>>>> implementation or
>>>>>>>>>> on the system level.
>>>>>>>>>>
>>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>>
>>>>>>>>>> I have tried to compile HDF5 with as much error checking as
>>>>>>>>>> possible
>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get any
>>>>>>>>>> errors or
>>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>>
>>>>>>>>>> I ran the code through TotalView, and got the attached
>>>>>>>>>> backtrace for the
>>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>>
>>>>>>>>>> Does anyone have any idea on how to continue debugging this
>>>>>>>>>> problem?
>>>>>>>>>>
>>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Håkon Strandenes
>>>>>>>>>> <Backtrace HDF5
>>>>>>>>>> err.png>_______________________________________________
>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>> [hidden email]
>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>>>>>>
>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>> _______________________________________________
>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>> [hidden email]
>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>>>>>
>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [hidden email]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>
>>>>> Twitter: https://twitter.com/hdf5
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Scot Breitenfeld
I tried your example using both HDF5 1.8.18 and our develop branch (basically 1.10.1) on a CentOS 7 system and your program completes successfully using Intel 17.0.4.

mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux*
Copyright(C) 2003-2017, Intel Corporation.  All rights reserved.
ifort version 17.0.4

Can you verify if ‘make test’ passes in testpar and fortran/testpar for your installation?

Thanks,
Scot  

> On May 22, 2017, at 2:38 PM, Håkon Strandenes <[hidden email]> wrote:
>
> One correction:
>
> The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210" are another problem with a segmentation fault.
>
> To avioud confusion, I repeat the working/not working cases I tried:
>
> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>
> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem, maybe with HDF5 installation
>
> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>
> I also tested on another cluster with GPFS parallel file system (instead of LUSTRE):
>
> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING
>
> So the common denominator seems to be Intel MPI 2017.
>
> Regards,
> Håkon
>
>
> On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
>> I have managed to prepare an example program. I got away a
>> lot of non-essential stuff, by preparing some datafiles in advance. The example is for 20 processes *only*.
>> I reported earlier that I also found the bug on a system with SGI MPT, this example runs fine on this system, so let's for the moment disregard that.
>> The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.
>> I tested for instance:
>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>> And the following does not work:
>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>> Does anyone have any idea on how to proceed in the debugging? Does anyone see any obvious flaws in my example program?
>> Thanks for all help.
>> Regards,
>> Håkon Strandenes
>> On 05/20/2017 09:12 PM, Quincey Koziol wrote:
>>>
>>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[hidden email]> wrote:
>>>>
>>>> Yes, the issue is still there.
>>>>
>>>> I will try to make a dummy program to demonstrate the error. It might be the easiest thing to debug on in the long run.
>>>
>>>    That would be very helpful, thanks,
>>>        Quincey
>>>
>>>>
>>>> Regards,
>>>> Håkon
>>>>
>>>>
>>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>>>> Can you try it with 1.10.1 and see if you still have an issue.
>>>>> Scot
>>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>>>>>
>>>>>> Hi Håkon,
>>>>>>
>>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>
>>>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>>>
>>>>>>> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.
>>>>>>
>>>>>>    Yeah, that’s usually the case with these kind of issues.  :-/
>>>>>>
>>>>>>
>>>>>>> One thing I wonder about is:
>>>>>>> Is the requirements for collective IO in this document:
>>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>>>> still valid and accurate?
>>>>>>>
>>>>>>> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>>>>>>>
>>>>>>> Am I onto something? Could this be a problem?
>>>>>>
>>>>>>    Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…
>>>>>>
>>>>>>        Quincey
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Håkon
>>>>>>>
>>>>>>>
>>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>>>>>>>>    Quincey
>>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>>>>>>>
>>>>>>>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Håkon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>>>> Hi Håkon,
>>>>>>>>>>    Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>>>>>>>>    Quincey
>>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>>>
>>>>>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>>>>
>>>>>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>>>>>> on the system level.
>>>>>>>>>>>
>>>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>>>
>>>>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>>>
>>>>>>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>>>
>>>>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>>>>>
>>>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Håkon Strandenes
>>>>>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>> [hidden email]
>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>> [hidden email]
>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [hidden email]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>> Twitter: https://twitter.com/hdf5
>>>>
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [hidden email]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
Thanks for trying my example. I will try the tests.

However, when trying my own example again I realized that the error does
not occur when running on one compute node or a single workstation. I
tested 20 processes on a single node, both running on a node-local
filesystem (local scratch) and a parallel networked filesystem, and that
worked. Running five processes each on four nodes leads to the
error/hanging condition.

Regards,
Håkon Strandenes


On 05/25/2017 04:30 PM, Scot Breitenfeld wrote:

> I tried your example using both HDF5 1.8.18 and our develop branch (basically 1.10.1) on a CentOS 7 system and your program completes successfully using Intel 17.0.4.
>
> mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux*
> Copyright(C) 2003-2017, Intel Corporation.  All rights reserved.
> ifort version 17.0.4
>
> Can you verify if ‘make test’ passes in testpar and fortran/testpar for your installation?
>
> Thanks,
> Scot
>
>> On May 22, 2017, at 2:38 PM, Håkon Strandenes <[hidden email]> wrote:
>>
>> One correction:
>>
>> The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210" are another problem with a segmentation fault.
>>
>> To avioud confusion, I repeat the working/not working cases I tried:
>>
>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>>
>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem, maybe with HDF5 installation
>>
>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>>
>> I also tested on another cluster with GPFS parallel file system (instead of LUSTRE):
>>
>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING
>>
>> So the common denominator seems to be Intel MPI 2017.
>>
>> Regards,
>> Håkon
>>
>>
>> On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
>>> I have managed to prepare an example program. I got away a
>>> lot of non-essential stuff, by preparing some datafiles in advance. The example is for 20 processes *only*.
>>> I reported earlier that I also found the bug on a system with SGI MPT, this example runs fine on this system, so let's for the moment disregard that.
>>> The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.
>>> I tested for instance:
>>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>>> And the following does not work:
>>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
>>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>>> Does anyone have any idea on how to proceed in the debugging? Does anyone see any obvious flaws in my example program?
>>> Thanks for all help.
>>> Regards,
>>> Håkon Strandenes
>>> On 05/20/2017 09:12 PM, Quincey Koziol wrote:
>>>>
>>>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[hidden email]> wrote:
>>>>>
>>>>> Yes, the issue is still there.
>>>>>
>>>>> I will try to make a dummy program to demonstrate the error. It might be the easiest thing to debug on in the long run.
>>>>
>>>>     That would be very helpful, thanks,
>>>>         Quincey
>>>>
>>>>>
>>>>> Regards,
>>>>> Håkon
>>>>>
>>>>>
>>>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>>>>> Can you try it with 1.10.1 and see if you still have an issue.
>>>>>> Scot
>>>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi Håkon,
>>>>>>>
>>>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>>>>
>>>>>>>> A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.
>>>>>>>
>>>>>>>     Yeah, that’s usually the case with these kind of issues.  :-/
>>>>>>>
>>>>>>>
>>>>>>>> One thing I wonder about is:
>>>>>>>> Is the requirements for collective IO in this document:
>>>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>>>>> still valid and accurate?
>>>>>>>>
>>>>>>>> The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.
>>>>>>>>
>>>>>>>> Am I onto something? Could this be a problem?
>>>>>>>
>>>>>>>     Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets.  There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…
>>>>>>>
>>>>>>>         Quincey
>>>>>>>
>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Håkon
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the constant seems to be your code now. :-/  Can you replicate the error with a small standalone C test program?
>>>>>>>>>     Quincey
>>>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>>>
>>>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.
>>>>>>>>>>
>>>>>>>>>> I also tested and can confirm that the problem is there as well with HDF5 1.10.1.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Håkon
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>>>>> Hi Håkon,
>>>>>>>>>>>     Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out.  What implementation and version of MPI are you using?
>>>>>>>>>>>     Quincey
>>>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[hidden email]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>>>>
>>>>>>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>>>>>>> on the system level.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>>>>
>>>>>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>>>>
>>>>>>>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>>>>
>>>>>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>>>>>>
>>>>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Håkon Strandenes
>>>>>>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>>> [hidden email]
>>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>> [hidden email]
>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>> [hidden email]
>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [hidden email]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>> Twitter: https://twitter.com/hdf5
>>>>
>>>>
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [hidden email]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes-2
I am sorry for the delay, by health has not been cooperating in
debugging this problem the last week.

I now tried HDF5 1.8.18 with Intel MPI 2017.1.132, and all tests pass,
both serial and parallel, C and Fortran.

The example still fail when run over more than one compute node. When
dataset transfer is H5FD_MPIO_INDEPENDENT_F it succeed.

My next step will be to try to build HDF5 with Cmake instead of
configure, to see if this changes anything.

Regards,
Håkon


On 05/26/2017 04:03 PM, Håkon Strandenes wrote:

> Thanks for trying my example. I will try the tests.
>
> However, when trying my own example again I realized that the error does
> not occur when running on one compute node or a single workstation. I
> tested 20 processes on a single node, both running on a node-local
> filesystem (local scratch) and a parallel networked filesystem, and that
> worked. Running five processes each on four nodes leads to the
> error/hanging condition.
>
> Regards,
> Håkon Strandenes
>
>
> On 05/25/2017 04:30 PM, Scot Breitenfeld wrote:
>> I tried your example using both HDF5 1.8.18 and our develop branch
>> (basically 1.10.1) on a CentOS 7 system and your program completes
>> successfully using Intel 17.0.4.
>>
>> mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux*
>> Copyright(C) 2003-2017, Intel Corporation.  All rights reserved.
>> ifort version 17.0.4
>>
>> Can you verify if ‘make test’ passes in testpar and fortran/testpar
>> for your installation?
>>
>> Thanks,
>> Scot
>>
>>> On May 22, 2017, at 2:38 PM, Håkon Strandenes <[hidden email]>
>>> wrote:
>>>
>>> One correction:
>>>
>>> The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort
>>> 2016.3.210" are another problem with a segmentation fault.
>>>
>>> To avioud confusion, I repeat the working/not working cases I tried:
>>>
>>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>>>
>>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem,
>>> maybe with HDF5 installation
>>>
>>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>>>
>>> I also tested on another cluster with GPFS parallel file system
>>> (instead of LUSTRE):
>>>
>>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
>>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
>>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
>>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING
>>>
>>> So the common denominator seems to be Intel MPI 2017.
>>>
>>> Regards,
>>> Håkon
>>>
>>>
>>> On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
>>>> I have managed to prepare an example program. I got away a
>>>> lot of non-essential stuff, by preparing some datafiles in advance.
>>>> The example is for 20 processes *only*.
>>>> I reported earlier that I also found the bug on a system with SGI
>>>> MPT, this example runs fine on this system, so let's for the moment
>>>> disregard that.
>>>> The problem occur with combinations of "newer" Intel MPI with
>>>> "newer" HDF5.
>>>> I tested for instance:
>>>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>>>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>>>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>>>> And the following does not work:
>>>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
>>>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>>>> Does anyone have any idea on how to proceed in the debugging? Does
>>>> anyone see any obvious flaws in my example program?
>>>> Thanks for all help.
>>>> Regards,
>>>> Håkon Strandenes
>>>> On 05/20/2017 09:12 PM, Quincey Koziol wrote:
>>>>>
>>>>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes
>>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>> Yes, the issue is still there.
>>>>>>
>>>>>> I will try to make a dummy program to demonstrate the error. It
>>>>>> might be the easiest thing to debug on in the long run.
>>>>>
>>>>>     That would be very helpful, thanks,
>>>>>         Quincey
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Håkon
>>>>>>
>>>>>>
>>>>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>>>>>> Can you try it with 1.10.1 and see if you still have an issue.
>>>>>>> Scot
>>>>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Hi Håkon,
>>>>>>>>
>>>>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes
>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>
>>>>>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>>>>>
>>>>>>>>> A standalone test program would be quite an effort, but I will
>>>>>>>>> think about it. I know that at least all simple test cases
>>>>>>>>> pass, so I need a "complicated" problem to generate the error.
>>>>>>>>
>>>>>>>>     Yeah, that’s usually the case with these kind of issues.  :-/
>>>>>>>>
>>>>>>>>
>>>>>>>>> One thing I wonder about is:
>>>>>>>>> Is the requirements for collective IO in this document:
>>>>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>>>>>> still valid and accurate?
>>>>>>>>>
>>>>>>>>> The reason I ask is that my filespace is complicated. Each IO
>>>>>>>>> process create the filespace with MANY calls to
>>>>>>>>> select_hyperslab. Hence it is neither regular nor singular, and
>>>>>>>>> according to the above mentioned document the HDF5 library
>>>>>>>>> should not be able to do collective IO in this case. Still, it
>>>>>>>>> seems like it hangs in some collective writing routine.
>>>>>>>>>
>>>>>>>>> Am I onto something? Could this be a problem?
>>>>>>>>
>>>>>>>>     Fortunately, we’ve expanded the feature set for collective
>>>>>>>> I/O now and it supports arbitrary selections on chunked
>>>>>>>> datasets.  There’s always the chance for a bug of course, but it
>>>>>>>> would have to be very unusual, since we are pretty thorough
>>>>>>>> about the regression testing…
>>>>>>>>
>>>>>>>>         Quincey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Håkon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is
>>>>>>>>>> good.  But, the constant seems to be your code now. :-/  Can
>>>>>>>>>> you replicate the error with a small standalone C test program?
>>>>>>>>>>     Quincey
>>>>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes
>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can
>>>>>>>>>>> try OpenMPI as well, but that is not as well tested on the
>>>>>>>>>>> systems we are using as the previously mentioned ones.
>>>>>>>>>>>
>>>>>>>>>>> I also tested and can confirm that the problem is there as
>>>>>>>>>>> well with HDF5 1.10.1.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Håkon
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>>>>>> Hi Håkon,
>>>>>>>>>>>>     Actually, given this behavior, it’s reasonably possible
>>>>>>>>>>>> that you have found a bug in the MPI implementation that you
>>>>>>>>>>>> have, so I wouldn’t rule that out.  What implementation and
>>>>>>>>>>>> version of MPI are you using?
>>>>>>>>>>>>     Quincey
>>>>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes
>>>>>>>>>>>>> <[hidden email]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have an MPI application where each process sample some
>>>>>>>>>>>>> data. Each
>>>>>>>>>>>>> process can have an arbitrary number of sampling points (or
>>>>>>>>>>>>> no points at
>>>>>>>>>>>>> all). During the simulation each process buffer the sample
>>>>>>>>>>>>> values in
>>>>>>>>>>>>> local memory until the buffer is full. At that point each
>>>>>>>>>>>>> process send
>>>>>>>>>>>>> its data to designated IO processes, and the IO processes
>>>>>>>>>>>>> open a HDF5
>>>>>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The filespace can be quite compicated, constructed with
>>>>>>>>>>>>> numerous calls
>>>>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple
>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>> block of data. The chunk size is equal to the buffer size,
>>>>>>>>>>>>> i.e. each
>>>>>>>>>>>>> time the dataset is extended it is extended by exactly one
>>>>>>>>>>>>> chunk.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is that in some cases, the application hang in
>>>>>>>>>>>>> h5dwrite_f
>>>>>>>>>>>>> (Fortran application). I cannot see why. It happens on
>>>>>>>>>>>>> multiple systems
>>>>>>>>>>>>> with different MPI implementations, so I believe that the
>>>>>>>>>>>>> problem is in
>>>>>>>>>>>>> my application or in the HDF5 library, not in the MPI
>>>>>>>>>>>>> implementation or
>>>>>>>>>>>>> on the system level.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have tried to compile HDF5 with as much error checking as
>>>>>>>>>>>>> possible
>>>>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get
>>>>>>>>>>>>> any errors or
>>>>>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran the code through TotalView, and got the attached
>>>>>>>>>>>>> backtrace for the
>>>>>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone have any idea on how to continue debugging this
>>>>>>>>>>>>> problem?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Håkon Strandenes
>>>>>>>>>>>>> <Backtrace HDF5
>>>>>>>>>>>>> err.png>_______________________________________________
>>>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>>>> [hidden email]
>>>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>>> [hidden email]
>>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>>>
>>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>> [hidden email]
>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [hidden email]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>
>>>>>> Twitter: https://twitter.com/hdf5
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [hidden email]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>
>>>>> Twitter: https://twitter.com/hdf5
>>>>>
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [hidden email]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [hidden email]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [hidden email]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5