[FLASH-BUGS] sedov_sph with Flash2.4 (II)

From: Peter Woitke (woitke@strw.leidenuniv.nl)
Date: Thu Oct 21 2004 - 09:14:22 CDT

  • Next message: Tomasz Plewa: "Re: [FLASH-BUGS] sedov_sph with Flash2.4 (II)"

    Dear developers,

    -- see also 1.submission "sedov_sph with Flash2.4" by Erik-Jan Rijkhorst --

    We are still having trouble to run 1D models (sedov, sedov_sph)
    on the parallel computers to our disposal (ASTER/TERAS, see
    http://www.sara.nl). Meanwhile we talked to the operators at these
    computing center and have a few more informations. The problem is
    still unsolved, however.

    Description of the problem
    ==========================
    Error in mesh_prolong.F90 or subordinate routines: after the prolongation,
    one or several new child blocks have wrong solndata, see attached output
    from files created before (5003) and after (5005) the call of
    mesh_prolong.F90 in source/mesh/amr/update_grid_refinement.F90.

    Further description is related to sedov_sph -1d with 8 PEs:
    ===========================================================
    - The same error occurs with FLASH2.3 and FLASH2.4

    - The error is reproducable

    - The same error occurs on ASTER (64-bit linux) and TERAS (64bit SGI origin),
       but not on our local linux-cluster!

    - The same error occurs with mpt-module xor mpich-1.2.5-module loaded
       (different MPI implementations)

    - The error occurs only for -1d models

    - The error occurs only for certain numbers of PEs as described
       in our first submission

    - The error occurs seldomly, for this problem at timestep 1666, but not
       at all the refinements done before.

    Further observations/ideas
    ==========================
    The occurrence of this error only for -1d models with lots of processors
    which might suggest that the super-fast MPI-implementation is a problem:
    many MPI_actions are taken shortly after each other with only few data.

    The refinement step at 1666 is a complicated one: one block needs
    to be refined, but since it is surrounded by a row of less refined
    neighbours on the right hand side, 10 new blocks are created.

    We would be very happy about any kind of comments by the developers.

    Kind regards,

    Peter Woitke & Erik-Jan Rijkhorst

    PS: details about setup and compilation
    =======================================
    ----------
    setup call
    ----------
    ./setup sedov -1d -auto, ./setup sedov_sph -1d -auto, respectively

    ----------------------------------
    Makefile.h: (for TERAS SGI-origin)
    ----------------------------------
    HDF5_PATH = /usr/local/opt/hdf5-1.4.4
    FCOMP = f90
    CCOMP = cc
    CPPCOMP = CC
    LINK = f90
       (MIPSpro SGI f90/cc compilers)
    FFLAGS_OPT = -64 -c -r8 -d8 -i4 -cpp -mips4 -O3 -Ofast=ip35
       (we also tried without opimisation - no difference)
    LFLAGS_OPT = -64 -r8 -d8 -i4 -IPA -o
    LIB_HDF5 = -L$(HDF5_PATH)/lib -lhdf5
    LIB_OPT = -lmpi

    --------
    run-call
    --------
    mpirun -np 8 flash2

    --------------------------
    slightly changed flash.par
    --------------------------
    lrefine_min = 1
    lrefine_max = 8
    basenm = "sedov_sph_"
    restart = .false.
    tplot = 0.001
    trstrt = 0.01
    nend = 1700
    tmax = 0.05
    plot_var_1 = "dens"
    plot_var_2 = "pres"
    plot_var_3 = "temp"
    plot_var_4 = "velx"

    --------------------------------------------------
    This is how we created the additional output files
    (in source/mesh/amr/update_grid_refinement.F90)
    --------------------------------------------------
             integer,save:: counter = 0
             ...
             call mark_grid_refinement()
             if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
               call plotfile(5000+counter, time)
               counter = counter + 1
             endif
             call mesh_refine_derefine()
             do block_no = 1,lnblocks
                call grid (block_no)
             end do
             new_child => dBaseTreePtrNewChild()
             if (conserved_var) then
                do block_no = 1, lnblocks
                   if ( .not. new_child(block_no)) then
                      solnData => dBaseGetDataPtrSingleBlock(block_no, GC)
                      ...
                      call convert_var_prim_to_cons( solnData(:,:,:,:) )
                      call dBaseReleaseDataPtrSingleBlock(block_no, solnData)
                   endif
                enddo
             endif
             if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
               call plotfile(5000+counter, time)
               counter = counter + 1
             endif
             call mesh_prolong (MyPE, 1, nguard)
             if ( nstep .eq. 1466 .or. nstep .eq. 1666) then
               call plotfile(5000+counter, time)
               counter = counter + 1
             endif
             call mesh_guardcell (MyPE, 1, nguard, time, 1, 0)
             ...

    plotfiles 5000,5001,5002 are created during an exemplary working
    refinement step (nstep=1466),
    plotfiles 5003,5004,5005 are created during the refinement step that
    causes the error (nstep=1666).

    ----------------
    The stdout file:
    ----------------
         1662 3.0518E-02 1.1371E-05 | 1.137E-05
         1663 3.0541E-02 1.1372E-05 | 1.137E-05
         1664 3.0564E-02 1.1374E-05 | 1.137E-05
         1665 3.0587E-02 1.1375E-05 | 1.138E-05
      block to be refined: myPE=6, blockno=6 (print from amr_refine_derefine)
      *** Wrote output to sedov_sph_hdf5_plt_cnt_5003 ***
      PE 0 lnblocks 11 (print from amr_refine_derefine)
      PE 1 lnblocks 7
      PE 2 lnblocks 7
      PE 3 lnblocks 9
      PE 4 lnblocks 6
      PE 5 lnblocks 7
      PE 6 lnblocks 7
      PE 7 lnblocks 5
      min_blocks 6 max_blocks 12 tot_blocks 69
      *** Wrote output to sedov_sph_hdf5_plt_cnt_5004 ***
      *** Wrote output to sedov_sph_hdf5_plt_cnt_5005 ***







    This archive was generated by hypermail 2b30 : Thu Oct 21 2004 - 09:14:42 CDT