[FLASH-USERS] MPI failure

Peter Vitello vitello at llnl.gov
Mon Jun 11 13:53:40 CDT 2007


I have a 2D hydrodynamic calculation using FLASH2.5 which runs for a 
while and then fails with a number of MPI error messages.

  I am using the pgF90 compiler with -fast settings and 
MPICH2-1.0.4p1 on a linux cluster.

Any suggestions?  I would appreciate any help.

Peter Vitello
LLNL

[cli_20]: aborting job:
Fatal error in MPI_Irecv: Other MPI error, error stack:
MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, 
MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD, 
request=0x137dab7c) failed
MPID_Irecv(74): Out of memory
[cli_0]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=23)
[cli_4]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=10)
[cli_12]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=8)
[cli_16]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=3)
[cli_24]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=6)
[cli_19]: aborting job:
Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(242)..........................: MPI_Waitall(count=528, 
req_array=0x137da480, status_array=0x13653a80) failed
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=2)
[cli_8]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=17)
[cli_28]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(286).......................:
MPIC_Sendrecv(161)........................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=5)
[cli_21]: aborting job:
Fatal error in MPI_Ssend: Other MPI error, error stack:
MPI_Ssend(167)............................: MPI_Ssend(buf=0x12c41820, 
count=12, MPI_DOUBLE_PRECISION, dest=20, tag=291, MPI_COMM_WORLD) failed
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(670)..............: connection failure 
(set=0,sock=1,errno=104:Connection reset by peer)
[cli_18]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=4)
[cli_22]: abort[cli_6]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=8)
ing job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(464).......................:
MPIC_Recv(98).............................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=1)
[cli_17]: aborting job:
Fatal error in MPI_Allreduce: Other MPI error, error stack:
MPI_Allreduce(701)........................: 
MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, 
MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce(286).......................:
MPIC_Sendrecv(161)........................:
MPIC_Wait(324)............................:
MPIDI_CH3I_Progress(158)..................: handle_sock_op failed
MPIDI_CH3I_Progress_handle_sock_event(175):
MPIDU_Socki_handle_read(644)..............: connection closed by peer 
(set=0,sock=12)
rank 20 in job 7  n06.llnl.gov_14457   caused collective abort of all ranks



More information about the flash-users mailing list