[FLASH-USERS] MPI failure
Peter Vitello
vitello at llnl.gov
Mon Jun 11 13:53:40 CDT 2007
I have a 2D hydrodynamic calculation using FLASH2.5 which runs for a
while and then fails with a number of MPI error messages.
I am using the pgF90 compiler with -fast settings and
MPICH2-1.0.4p1 on a linux cluster.
Any suggestions? I would appreciate any help.
Peter Vitello
LLNL
[cli_20]: aborting job:
Fatal error in MPI_Irecv: Other MPI error, error stack:
MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12,
MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD,
request=0x137dab7c) failed
MPID_Irecv(74): Out of memory
[cli_0]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=23)
[cli_4]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=10)
[cli_12]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=8)
[cli_16]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=3)
[cli_24]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=6)
[cli_19]: aborting job:
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..........................: MPI_Waitall(count=528,
req_array=0x137da480, status_array=0x13653a80) failed
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=2)
[cli_8]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=17)
[cli_28]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(286).......................:
MPIC_Sendrecv(161)........................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=5)
[cli_21]: aborting job:
Fatal error in MPI_Ssend: Other MPI error, error stack:
MPI_Ssend(167)............................: MPI_Ssend(buf=0x12c41820,
count=12, MPI_DOUBLE_PRECISION, dest=20, tag=291, MPI_COMM_WORLD) failed
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(670)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)
[cli_18]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=4)
[cli_22]: abort[cli_6]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=8)
ing job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=1)
[cli_17]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................:
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30,
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(286).......................:
MPIC_Sendrecv(161)........................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer
(set=0,sock=12)
rank 20 in job 7 n06.llnl.gov_14457 caused collective abort of all ranks
More information about the flash-users
mailing list