[FLASH-BUGS] Non-member submission from [Erik-Jan Rijkhorst <rijkhorst@strw.leidenuniv.nl>] (fwd)

From: Shawn Needham (shawn@flash.uchicago.edu)
Date: Thu Sep 23 2004 - 14:23:09 CDT

  • Next message: Shawn Needham: "[FLASH-BUGS] BOUNCE flash-bugs@flash.uchicago.edu: Non-member submission from [Yuriy Zaliznyak <zalik@rzg.mpg.de>] (fwd)"

    Date: Thu, 23 Sep 2004 17:56:11 +0200 (MEST)
    From: Erik-Jan Rijkhorst <rijkhorst@strw.leidenuniv.nl>
    X-X-Sender: rijkhors@laak.strw.leidenuniv.nl
    To: flash-bugs@flash.uchicago.edu
    cc: Peter Woitke <woitke@strw.leidenuniv.nl>
    Subject: sedov_sph with Flash2.4

    Dear developers,

    We are having trouble to run the 1D spherical sedov_sph setup on the
    parallel machines we have at our disposal. The machines are:

    Aster with Intel efc/ecc version 8.0 compilers
    An SGI Altix 3700 system, consisting of 416 CPUs (Intel Itanium 2, 1,3 GHz]

    and

    Teras with MIPSpro SGI f90/cc compilers
    1024-CPU system consisting of two 512-CPU SGI Origin 3800 systems

    (http://www.sara.nl for more info)

    The problem we encounter is that at some timestep, the variables undergo
    sudden gross changes, e.g. creating new velocity jumps at locations where
    nothing should happen.

    We have used simple testing compilation flags, e.g. without optimisation.

    These problems occur if (and only if) the following three conditions are
    met:

      1) refinement is allowed, i.e. lrefinmax > lrefine_min. We used
         lrefine_min=1 and lrefine_max=8
      2) the module /mesh/amr/paramesh2.0/quadratic_spherical is involved
         (we have similar problems with other applications using this module)
      3) a certain number of processors is used, e.g.

      4 PE => ok
      5 PE => ok
      6 PE => ok
      7 PE => ok
      8 PE => sudden variable jumps at timestep 1666
      9 PE => ok
    10 PE => ok
    12 PE => ok
    14 PE => ok
    16 PE => sudden variable jumps at timestep 444

    The errors are reproducable and similar on Aster and Teras (e.g. occur at
    the same timestep).

    (However, on Aster with 8 PEs, we get a "Nonconvergence in subroutine
    rieman!"-error at timestep 1666.)

    Because of the 3 conditions mentioned above needed to trigger this
    behaviour, we currently believe that there probably is a bug somewhere in
    the prolongation (or possibly guard cell exchange routines?) that go with
    the new quadratic/spherical routines.

    We have checked some of the 'Flash daily test results' that are on the web
    but couldn't find a test with the new spherical module (like sod_sph or
    sedov_sph) done with more than 2 processors. Have such test with for
    example 8 processors been done on similar systems as ours?

    If needed we can send you more information about the architecture,
    compiler and/or compilation flags used etc. Just tell us what you need.

    Thank you very much for your help,

    Peter Woitke and Erik-Jan Rijkhorst



    This archive was generated by hypermail 2b30 : Thu Sep 23 2004 - 14:23:46 CDT