[FLASH-USERS] MPI failure
Peter Vitello
vitello at llnl.gov
Mon Jun 11 16:03:14 CDT 2007
Alan,
Thanks for the response. I don't think that this is a MAXBLOCKS
problem. I am running with 45 nodes
and MAXBLOCKS=8000 to avoid this problem. My actual min_blocks and
max_blocks just before an abort
are 564 and 651 which are well below MAXBLOCKS.
Peter
At 12:41 PM 6/11/2007, you wrote:
>Peter-
>
>Perhaps you have checked, but the "Out of memory" suggests exactly
>that. Was the block count on each processor approaching MAXBLOCKS?
>I have seen crashes with non-intuitive mpi errors when the number of
>blocks on the processors gets close to the maximum and the code
>runs out of memory. You might try the run again on more processors.
>
>Hope this helps,
>
>Alan
>
>On Mon, 11 Jun 2007, Peter Vitello wrote:
>
>>
>>I have a 2D hydrodynamic calculation using FLASH2.5 which runs for
>>a while and then fails with a number of MPI error messages.
>>
>>I am using the pgF90 compiler with -fast settings and
>>MPICH2-1.0.4p1 on a linux cluster.
>>
>>Any suggestions? I would appreciate any help.
>>
>>Peter Vitello
>>LLNL
>>
>>[cli_20]: aborting job:
>>Fatal error in MPI_Irecv: Other MPI error, error stack:
>>MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12,
>>MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD,
>>request=0x137dab7c) failed
>>MPID_Irecv(74): Out of memory
>>[cli_0]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=23)
>>[cli_4]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=10)
>>[cli_12]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=8)
>>[cli_16]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=3)
>>[cli_24]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=6)
>>[cli_19]: aborting job:
>>Fatal error in MPI_Waitall: Other MPI error, error stack:
>>MPI_Waitall(242)..........................: MPI_Waitall(count=528,
>>req_array=0x137da480, status_array=0x13653a80) failed
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=2)
>>[cli_8]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=17)
>>[cli_28]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(286).......................:
>>MPIC_Sendrecv(161)........................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=5)
>>[cli_21]: aborting job:
>>Fatal error in MPI_Ssend: Other MPI error, error stack:
>>MPI_Ssend(167)............................:
>>MPI_Ssend(buf=0x12c41820, count=12, MPI_DOUBLE_PRECISION, dest=20,
>>tag=291, MPI_COMM_WORLD) failed
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(670)..............: connection failure
>>(set=0,sock=1,errno=104:Connection reset by peer)
>>[cli_18]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=4)
>>[cli_22]: abort[cli_6]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=8)
>>ing job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(464).......................:
>>MPIC_Recv(98).............................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=1)
>>[cli_17]: aborting job:
>>Fatal error in MPI_Allreduce: Other MPI error, error stack:
>>MPI_Allreduce(701)........................:
>>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
>>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
>>MPIR_Allreduce(286).......................:
>>MPIC_Sendrecv(161)........................:
>>MPIC_Wait(324)............................:
>>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
>>MPIDI_CH3I_Progress_handle_sock_event(175):
>>MPIDU_Socki_handle_read(644)..............: connection closed by
>>peer (set=0,sock=12)
>>rank 20 in job 7 n06.llnl.gov_14457 caused collective abort of all ranks
More information about the flash-users
mailing list