Re: [FLASH-BUGS] wallclock checkpoint bug

From: Mike Zingale (zingale@flash.uchicago.edu)
Date: Wed Feb 05 2003 - 18:13:39 CST

  • Next message: Robi Banerjee: "[FLASH-BUGS] FLASH / IDL 3D improvements"

    Sean, I think you are right about this one. I don't believe that any of
    us have run into this before, but it could be a clock syncronization
    issue. In any case, I'll change this to do the time computation on the
    master processor shortly.

    Mike

    ------------------------------------------------------------------------------
    Michael Zingale
    UCO/Lick Observatory
    UCSC
    Santa Cruz, CA 95064

    phone: (831) 459-5246
    fax: (831) 459-5265
    e-mail: zingale@ucolick.org
    web: http://www.ucolick.org/~zingale

    ``What an awful dream -- ones and zeros everywhere. I thought I saw a two''
       -- Bender

    On Wed, 5 Feb 2003, Sean Matt wrote:

    > Hi,
    >
    > We've been having problems with our large and relatively long
    > simulations (that is, those that run on tens of processors for several
    > hours). The simulations all hang up (stop producing any logfile output,
    > but still using up cpu cycles) at some time that is near an integer times
    > the wallclock checkpoint time (in our case, 3600 seconds).
    > The last time this happened, we used totalview to find out the
    > problem, and we believe it is a bug. It turns out that, when the run
    > hangs up, some of the processors are waiting on an MPI_Bcast around line
    > 371 of the output subroutine ("/source/io/output.F90"), while the others
    > are waiting at an MPI_Reduce within the restrict_tree subroutine that is
    > called from the output subroutine around line 312. The restrict_tree in
    > question is called when the wallclock time is right for a checkpoint file
    > to be written. So the problem is that some of the processors think it's
    > time to write, and others do not.
    > We believe that this is most likely caused by the way FLASH checks
    > the wallclock time. Around line 303,
    >
    > "dt_checkpoint = MPI_Wtime() - lastWallClockCheckpoint"
    >
    > is executed by EACH processor. The time between the last MPI
    > sinchronization will not always be the same for each processor. In some
    > cases, the following if statement ("if (dt_checkpoint >
    > wall_clock_checkpoint) then") may be true for some processors, but not
    > others. We believe this if statement should be done by one processor
    > only, and the result should be broadcast. FLASH already does something
    > similar to our suggestion for checking for the ".dump_restart" file near
    > the end of the output subroutine.
    >
    >
    > -Sean
    >



    This archive was generated by hypermail 2b30 : Wed Feb 05 2003 - 18:13:47 CST