[FLASH-USERS] MPI failure

Peter Vitello vitello at llnl.gov
Mon Jun 11 16:03:14 CDT 2007


Alan,

   Thanks for the response.  I don't think that this is a MAXBLOCKS 
problem.  I am running with 45 nodes
and MAXBLOCKS=8000 to avoid this problem.  My actual min_blocks and 
max_blocks just before an abort
are 564 and 651 which are well below MAXBLOCKS.

Peter

At 12:41 PM 6/11/2007, you wrote:

>Peter-
>
>Perhaps you have checked, but the "Out of memory" suggests exactly
>that. Was the block count on each processor approaching MAXBLOCKS?
>I have seen crashes with non-intuitive mpi errors when the number of 
>blocks on the processors gets close to the maximum and the code
>runs out of memory. You might try the run again on more processors.
>
>Hope this helps,
>
>Alan
>
>On Mon, 11 Jun 2007, Peter Vitello wrote:
>
>>
>>I have a 2D hydrodynamic calculation using FLASH2.5 which runs for 
>>a while and then fails with a number of MPI error messages.
>>
>>I am using the pgF90 compiler with -fast settings and 
>>MPICH2-1.0.4p1 on a linux cluster.
>>
>>Any suggestions?  I would appreciate any help.
>>
>>Peter Vitello
>>LLNL
>>
>>[cli_20]: aborting job:
>>Fatal error in MPI_Irecv: Other MPI error, error stack:
>>MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, 
>>MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD, 
>>request=0x137dab7c) failed
>>MPID_Irecv(74): Out of memory
>>[cli_0]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=23)
>>[cli_4]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=10)
>>[cli_12]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=8)
>>[cli_16]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=3)
>>[cli_24]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=6)
>>[cli_19]: aborting job:
>>Fatal error in MPI_Waitall: Other MPI error, error stack:
>>MPI_Waitall(242)..........................: MPI_Waitall(count=528, 
>>req_array=0x137da480, status_array=0x13653a80) failed
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=2)
>>[cli_8]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=17)
>>[cli_28]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(286).......................:
>>MPIC_Sendrecv(161)........................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=5)
>>[cli_21]: aborting job:
>>Fatal error in MPI_Ssend: Other MPI error, error stack:
>>MPI_Ssend(167)............................: 
>>MPI_Ssend(buf=0x12c41820, count=12, MPI_DOUBLE_PRECISION, dest=20, 
>>tag=291, MPI_COMM_WORLD) failed
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(670)..............: connection failure 
>>(set=0,sock=1,errno=104:Connection reset by peer)
>>[cli_18]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=4)
>>[cli_22]: abort[cli_6]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=8)
>>ing job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=1)
>>[cli_17]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................: 
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(286).......................:
>>MPIC_Sendrecv(161)........................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by 
>>peer (set=0,sock=12)
>>rank 20 in job 7  n06.llnl.gov_14457   caused collective abort of all ranks



More information about the flash-users mailing list