[FLASH-USERS] MPI failure
Alan Calder
calder at flash.uchicago.edu
Mon Jun 11 14:41:49 CDT 2007
Peter-
Perhaps you have checked, but the "Out of memory" suggests exactly
that. Was the block count on each processor approaching MAXBLOCKS?
I have seen crashes with non-intuitive mpi errors when the number of
blocks on the processors gets close to the maximum and the code
runs out of memory. You might try the run again on more processors.
Hope this helps,
Alan
On Mon, 11 Jun 2007, Peter Vitello wrote:
>
> I have a 2D hydrodynamic calculation using FLASH2.5 which runs for a while
> and then fails with a number of MPI error messages.
>
> I am using the pgF90 compiler with -fast settings and MPICH2-1.0.4p1 on a
> linux cluster.
>
> Any suggestions? I would appreciate any help.
>
> Peter Vitello
> LLNL
>
> [cli_20]: aborting job:
> Fatal error in MPI_Irecv: Other MPI error, error stack:
> MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, MPI_DOUBLE_PRECISION,
> src=21, tag=440, MPI_COMM_WORLD, request=0x137dab7c) failed
> MPID_Irecv(74): Out of memory
> [cli_0]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=23)
> [cli_4]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=10)
> [cli_12]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=8)
> [cli_16]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=3)
> [cli_24]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=6)
> [cli_19]: aborting job:
> Fatal error in MPI_Waitall: Other MPI error, error stack:
> MPI_Waitall(242)..........................: MPI_Waitall(count=528,
> req_array=0x137da480, status_array=0x13653a80) failed
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=2)
> [cli_8]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=17)
> [cli_28]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(286).......................:
> MPIC_Sendrecv(161)........................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=5)
> [cli_21]: aborting job:
> Fatal error in MPI_Ssend: Other MPI error, error stack:
> MPI_Ssend(167)............................: MPI_Ssend(buf=0x12c41820,
> count=12, MPI_DOUBLE_PRECISION, dest=20, tag=291, MPI_COMM_WORLD) failed
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(670)..............: connection failure
> (set=0,sock=1,errno=104:Connection reset by peer)
> [cli_18]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=4)
> [cli_22]: abort[cli_6]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=8)
> ing job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(464).......................:
> MPIC_Recv(98).............................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=1)
> [cli_17]: aborting job:
> Fatal error in MPI_Allreduce: Other MPI error, error stack:
> MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510,
> rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
> MPIR_Allreduce(286).......................:
> MPIC_Sendrecv(161)........................:
> MPIC_Wait(324)............................:
> MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
> MPIDI_CH3I_Progress_handle_sock_event(175):
> MPIDU_Socki_handle_read(644)..............: connection closed by peer
> (set=0,sock=12)
> rank 20 in job 7 n06.llnl.gov_14457 caused collective abort of all ranks
>
More information about the flash-users
mailing list