From van at astro.ox.ac.uk Mon Jun 4 10:45:54 2007 From: van at astro.ox.ac.uk (Vincenzo Antonuccio) Date: Mon, 4 Jun 2007 16:45:54 +0100 Subject: [FLASH-USERS] xflash is crashing Message-ID: <200706041645.54335.van@astro.ox.ac.uk> I have a problem using xflash. Here is my configuration: FLASH v. 2.5, using fidlr3 IDL: version 6.3, on HDF5 v. 1.6.2 > uname -a: Linux uist 2.6.8-3-686 #1 Tue Dec 5 21:26:38 UTC 2006 i686 GNU/Linux I can successfully launch xflash: I get in fact the widget. Then I try to upload one of the HDF5, 2D sedov outputs. The file is correctly read. But when I try to plot density or pressure or something else, IDL goes into segmentation fault. Now: I know that at page 231 of the FLASH v. 2.5 UG is written that, for IDL v. 6.0 and higher, there is a problem with HDF5, so one should revert to fidlr2. AND THIS IS WHAT I DID, but the result is the same: as soon as I try to plot something using xflash, I get segmentation fault. Now, I have tried to debug the problem, and I think that it arises when TVIMAGE calls CMCONGRID, around this point: ............. 2: BEGIN ; *** TWO DIMENSIONAL ARRAY IF int THEN BEGIN srx = float(s(1) - m1) / (x-m1) * findgen(x) + halfx sry = float(s(2) - m1) / (y-m1) * findgen(y) + halfy ;print, 'check #2', 'ok 0a' RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub) ;print, 'check #2', 'ok 0b' ........... (The commented "print" are mine). What I can see is that IDL crashes at "RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub)". Any idea/suggestion? Many thanks in advance ****************************************** Vincenzo ANTONUCCIO-DELOGU Astrophysics, University of Oxford Denys Wilkinson Building Keble Road, Oxford OX1 3RH United Kingdom and INAF - Catania Astrophysical Observatory, Catania, ITALY Room 555A, Beecroft Inst. for Particle Astrophysics Tlf.: +44-(0)1865 283019 Fax: +44-(0)1865-273390 e-mail: van at astro.ox.ac.uk skype: eurocosmo From gjordan at flash.uchicago.edu Mon Jun 4 12:36:30 2007 From: gjordan at flash.uchicago.edu (George Jordan) Date: Mon, 4 Jun 2007 12:36:30 -0500 (CDT) Subject: [FLASH-USERS] xflash is crashing In-Reply-To: <200706041645.54335.van@astro.ox.ac.uk> References: <200706041645.54335.van@astro.ox.ac.uk> Message-ID: Hi Vincenzo, I would suggest using the visualization software called VisIt. You can download it at http://www.llnl.gov/visit/ They have worked with us and support the FLASH data format. It is a very nice piece of software and I highly recommend it. Its capabilities are far superior to IDL. You should give it a try. When launching from the command line type: visit -default_format FLASH to tell VisIt to use the FLASH file reader plugin. Best, Cal On Mon, 4 Jun 2007, Vincenzo Antonuccio wrote: > I have a problem using xflash. Here is my configuration: > > FLASH v. 2.5, using fidlr3 > IDL: version 6.3, on > HDF5 v. 1.6.2 >> uname -a: Linux uist 2.6.8-3-686 #1 Tue Dec 5 21:26:38 UTC 2006 i686 > GNU/Linux > > I can successfully launch xflash: I get in fact the widget. Then I try to > upload one of the HDF5, 2D sedov outputs. The file is correctly read. But > when I try to plot density or pressure or something else, IDL goes into > segmentation fault. > > Now: I know that at page 231 of the FLASH v. 2.5 UG is written that, for IDL > v. 6.0 and higher, there is a problem with HDF5, so one should revert to > fidlr2. AND THIS IS WHAT I DID, but the result is the same: as soon as I try > to plot something using xflash, I get segmentation fault. > > Now, I have tried to debug the problem, and I think that it arises when > TVIMAGE calls CMCONGRID, around this point: > ............. > 2: BEGIN ; *** TWO DIMENSIONAL ARRAY > IF int THEN BEGIN > srx = float(s(1) - m1) / (x-m1) * findgen(x) + halfx > sry = float(s(2) - m1) / (y-m1) * findgen(y) + halfy > ;print, 'check #2', 'ok 0a' > RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub) > ;print, 'check #2', 'ok 0b' > ........... > > (The commented "print" are mine). What I can see is that IDL crashes at > "RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub)". > > Any idea/suggestion? > > Many thanks in advance > > ****************************************** > Vincenzo ANTONUCCIO-DELOGU > > Astrophysics, University of Oxford > Denys Wilkinson Building > Keble Road, Oxford OX1 3RH > United Kingdom > > and > > INAF - Catania Astrophysical Observatory, > Catania, ITALY > > Room 555A, Beecroft Inst. for Particle Astrophysics > Tlf.: +44-(0)1865 283019 > Fax: +44-(0)1865-273390 > > e-mail: van at astro.ox.ac.uk > skype: eurocosmo > From guptasanjib at lanl.gov Mon Jun 4 18:23:07 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Mon, 04 Jun 2007 17:23:07 -0600 Subject: [FLASH-USERS] maximum number of plot files per run Message-ID: <46649EDB.8080901@lanl.gov> Hello, Hopefully this has a very simple resolution - the maximum number of plot files I get per run is 9999, so all the plot files are named with zero paddings for file # < 9999, e.g. "hc-rt_hdf5_plt_cnt_0057" upto "hc-rt_hdf5_plt_cnt_9999" anything higher comes out as "hc-rt_hdf5_plt_cnt_****" which I am sure has to do with an allowed length of string for the filename........ Do I have to change code in a driver module if I want it to print , say 1000000 or (many) more plot files, if so, which source file do I modify, and what is the safest (minimalist) way of doing this without messing up the rest of the code? Is there a way of getting this done by changing something in "flash.par" ? I see in "flash.par" the variables basenm="hc-rt_" which no doubt is used to construct the plot and check file names. Also there is run_number="001" is that used to set the length of the file name string somehow? Thanks much, Sanjib Gupta From gjordan at flash.uchicago.edu Mon Jun 4 19:15:25 2007 From: gjordan at flash.uchicago.edu (George Jordan) Date: Mon, 4 Jun 2007 19:15:25 -0500 (CDT) Subject: [FLASH-USERS] maximum number of plot files per run In-Reply-To: <46649EDB.8080901@lanl.gov> References: <46649EDB.8080901@lanl.gov> Message-ID: Hi Sanjib, How is LANL going? It looks like the place to change the filename to include more digits is in the fortran file io_getOutputName.F90. The two places that need to be changed are lines #54 and #75. Keep in mind that the filename string is passed to several .c files that call the hdf5 routines and that strings, FORTRAN, and C when combined can lead to funny results. If possible I would suggest making several directories and just resetting the plot file number with each restart (plotFileNumber in flash.par). This assumes that you would have to restart the simulation before the 10,000th plot file is created and that you would restart in a new directory. Best, Cal On Mon, 4 Jun 2007, sanjib gupta wrote: > Hello, > > Hopefully this has a very simple resolution - the maximum number of plot > files I get per run is > 9999, so all the plot files are named with zero paddings for file # < 9999, > e.g. > "hc-rt_hdf5_plt_cnt_0057" > upto "hc-rt_hdf5_plt_cnt_9999" anything higher comes out as > "hc-rt_hdf5_plt_cnt_****" > which I am sure has to do with an allowed length of string for the > filename........ > Do I have to change code in a driver module if I want it to print , say > 1000000 or (many) more plot files, if so, > which source file do I modify, and what is the safest (minimalist) way of > doing this without messing up > the rest of the code? > Is there a way of getting this done by changing something in "flash.par" ? I > see in "flash.par" > the variables > basenm="hc-rt_" > which no doubt is used to construct the plot and check file names. Also there > is > run_number="001" > is that used to set the length of the file name string somehow? > Thanks much, > Sanjib Gupta > From mzingale at scotty.ess.sunysb.edu Mon Jun 4 19:22:41 2007 From: mzingale at scotty.ess.sunysb.edu (Mike Zingale) Date: Mon, 4 Jun 2007 20:22:41 -0400 (EDT) Subject: [FLASH-USERS] xflash is crashing In-Reply-To: <200706041645.54335.van@astro.ox.ac.uk> References: <200706041645.54335.van@astro.ox.ac.uk> Message-ID: This is a likely bug in IDL that is interacting with a recently patched buffer overflow in the xorg server, see: http://www.ittvis.com/services/techtip.asp?ttid=4177 They recommend you roll back your libx11 or wait for IDL 6.4 Mike On Mon, 4 Jun 2007, Vincenzo Antonuccio wrote: > I have a problem using xflash. Here is my configuration: > > FLASH v. 2.5, using fidlr3 > IDL: version 6.3, on > HDF5 v. 1.6.2 >> uname -a: Linux uist 2.6.8-3-686 #1 Tue Dec 5 21:26:38 UTC 2006 i686 > GNU/Linux > > I can successfully launch xflash: I get in fact the widget. Then I try to > upload one of the HDF5, 2D sedov outputs. The file is correctly read. But > when I try to plot density or pressure or something else, IDL goes into > segmentation fault. > > Now: I know that at page 231 of the FLASH v. 2.5 UG is written that, for IDL > v. 6.0 and higher, there is a problem with HDF5, so one should revert to > fidlr2. AND THIS IS WHAT I DID, but the result is the same: as soon as I try > to plot something using xflash, I get segmentation fault. > > Now, I have tried to debug the problem, and I think that it arises when > TVIMAGE calls CMCONGRID, around this point: > ............. > 2: BEGIN ; *** TWO DIMENSIONAL ARRAY > IF int THEN BEGIN > srx = float(s(1) - m1) / (x-m1) * findgen(x) + halfx > sry = float(s(2) - m1) / (y-m1) * findgen(y) + halfy > ;print, 'check #2', 'ok 0a' > RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub) > ;print, 'check #2', 'ok 0b' > ........... > > (The commented "print" are mine). What I can see is that IDL crashes at > "RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub)". > > Any idea/suggestion? > > Many thanks in advance > > ****************************************** > Vincenzo ANTONUCCIO-DELOGU > > Astrophysics, University of Oxford > Denys Wilkinson Building > Keble Road, Oxford OX1 3RH > United Kingdom > > and > > INAF - Catania Astrophysical Observatory, > Catania, ITALY > > Room 555A, Beecroft Inst. for Particle Astrophysics > Tlf.: +44-(0)1865 283019 > Fax: +44-(0)1865-273390 > > e-mail: van at astro.ox.ac.uk > skype: eurocosmo > ----------------------------------------------------------------------------- Michael Zingale (mzingale at mail.astro.sunysb.edu) Assistant Professor Dept. of Physics and Astronomy office: ESS 440 SUNY Stony Brook phone: 631-632-8225 Stony Brook, NY 11794-3800 web: http://www.astro.sunysb.edu/mzingale ----------------------------------------------------------------------------- From guptasanjib at lanl.gov Mon Jun 4 20:21:05 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Mon, 04 Jun 2007 19:21:05 -0600 Subject: [FLASH-USERS] Benchmarking FLASH/Restart at different speed issue Message-ID: <4664BA81.8050502@lanl.gov> Hello, I've noticed a peculiarity regarding the speed at which FLASH runs .... Say I set up a job on 64 processors and run it for a while. Then I kill the job , restart the job from the last checkpoint file outputted in the first run, no changes in "flash.par" except restart=true and cpnumber is specified.... The same job now runs 20-30 times faster, without any change of # of processors. Any hints ? Would be good to know why this is happening since we plan to benchmark FLASH runs vs. # of processors used and get a sense of how fine a resolution we can afford for our turbulent burning scenarios... Thanks much, Sanjib. From ipatov at dtm.ciw.edu Mon Jun 4 23:20:26 2007 From: ipatov at dtm.ciw.edu (ipatov at dtm.ciw.edu) Date: Tue, 5 Jun 2007 00:20:26 -0400 (EDT) Subject: [FLASH-USERS] h5_write.c Message-ID: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> Dear FLASH users. I have succesfully installed and used FLASH2.5 on one computer. Recently I made the same instalation [with the same tar files] on another computer (of different type), but was not able to make test runs on the second computer. Now after ./setup sedov -auto [same for sod] gmake I got: h5_write.c: in function 'h5_write_tree_info_' h5_write.c:1004: 'status' undecleared (first use in this function) h5_write.c:1005: 'H5P_DEFAULT' undecleared (first use in this function) h5_write.c:1005: 'refine_level' undecleared (first use in this function) h5_write.c: At top level: h5_write.c: 1897: parse error before '*' token h5_write.c: in function 'h5_write_particles_' And a great number of similar messages. At the end: gmake[1]: *** [h5_write.o] Error 1 Then I tried to use hdf5-1.6.4 instead of the previous hdf5-1.6.2.tar, but got the same errors. Does somebody know what must be done to solve the problem? Best regards Sergei From nhearn at uchicago.edu Tue Jun 5 09:13:44 2007 From: nhearn at uchicago.edu (Nathan Hearn) Date: Tue, 5 Jun 2007 09:13:44 -0500 Subject: [FLASH-USERS] h5_write.c In-Reply-To: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> References: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> Message-ID: <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> Hi Sergei, These are very curious errors, as they would suggest that the HDF5 header file is not being included. However, I would have expected the compile to fail at the line #include and not get any further. (Also, it is somewhat odd that there seem to be no errors when the HDF5-related variables are declared, but only when they are used.) I would suggest that you inspect the Makefile.h for this build. (Did you create a new Makefile.h for this machine?) In particular, make sure that all of the HDF5 settings are correct -- this may simply involve setting HDF5_PATH to the correct directory, but you will want to make sure that this is reflected in the CFLAGS_HDF5 and the LIB_HDF5 statements. Also, if you haven't already, be sure to turn on all warning messages for your compilers by adding the appropriate flags to CFLAGS_OPT and FFLAGS_OPT (or, if appropriate, CFLAGS_DEBUG, CFLAGS_TEST, etc.). If you are still having problems, could you email the compiler command that gmake issues for h5_write.c, along with the first few warning and/or error messages that it generates? - Nathan On 6/4/07, ipatov at dtm.ciw.edu wrote: > Dear FLASH users. > I have succesfully installed and used FLASH2.5 on one computer. > Recently I made the same instalation [with the same tar files] > on another computer (of different type), but > was not able to make test runs on the second computer. > Now after > ./setup sedov -auto [same for sod] > gmake > I got: > h5_write.c: in function 'h5_write_tree_info_' > h5_write.c:1004: 'status' undecleared (first use in this function) > h5_write.c:1005: 'H5P_DEFAULT' undecleared (first use in this function) > h5_write.c:1005: 'refine_level' undecleared (first use in this function) > h5_write.c: At top level: > h5_write.c: 1897: parse error before '*' token > h5_write.c: in function 'h5_write_particles_' > And a great number of similar messages. > At the end: > gmake[1]: *** [h5_write.o] Error 1 > > Then I tried to use hdf5-1.6.4 instead of the previous > hdf5-1.6.2.tar, but got the same errors. > > Does somebody know what must be done to solve the problem? > > Best regards > Sergei -- Nathan C. Hearn nhearn at uchicago.edu ASC Flash Center Computational Physics Group University of Chicago From van at astro.ox.ac.uk Tue Jun 5 10:32:42 2007 From: van at astro.ox.ac.uk (Vincenzo Antonuccio) Date: Tue, 5 Jun 2007 16:32:42 +0100 Subject: [FLASH-USERS] xflash is crashing In-Reply-To: References: <200706041645.54335.van@astro.ox.ac.uk> Message-ID: <200706051632.42589.van@astro.ox.ac.uk> On Tuesday 05 June 2007 01:22, Mike Zingale wrote: > This is a likely bug in IDL that is interacting with a recently patched > buffer overflow in the xorg server, > ................. > They recommend you roll back your libx11 or wait for IDL 6.4 > > Mike I can confirm your suspicion, Mike: here in Oxford they have just updated IDL to 6.4, and the system does not crash anymore. The reason why it started recently to crash is that they recently applied a security path to X11, and the latter was "colliding" with IDL. Many thanks to everybody for the suggestions. I will check ViSiT, as soon as I can. ****************************************** Vincenzo ANTONUCCIO-DELOGU Marie Curie fellow, until June 30th, 2007 at: Astrophysics, University of Oxford Denys Wilkinson Building Keble Road, Oxford OX1 3RH United Kingdom (Home Institution: INAF - Catania Astrophysical Observatory, Catania, ITALY) Room 555A, Beecroft Inst. for Particle Astrophysics Tlf.: +44-(0)1865 283019 Fax: +44-(0)1865-273390 e-mail: van at astro.ox.ac.uk skype: eurocosmo 'Malheur a l'homme d'etude qui n'est d'aucune coterie, on lui reprochera jusqu'a de petits succes fort incertains, et la haute vertu triomphera en le volant.' Guai all'intellettuale che non appartiene a nessuna consorteria: gli sara' rimproverato ogni successo, anche il piu' incerto, e la virtu' trionfera' derubandolo. (Stendhal, Le Rouge et le Noir) From richp at flash.uchicago.edu Tue Jun 5 12:44:55 2007 From: richp at flash.uchicago.edu (Paul M. Rich) Date: Tue, 05 Jun 2007 12:44:55 -0500 Subject: [FLASH-USERS] Benchmarking FLASH/Restart at different speed issue In-Reply-To: <4664BA81.8050502@lanl.gov> References: <4664BA81.8050502@lanl.gov> Message-ID: <4665A117.7010606@flash.uchicago.edu> Sanjib, That is a very interesting problem. Could you provide some more information about it, such as the architecture you are running on, some of the output and/or the logfile. --Paul Rich sanjib gupta wrote: > Hello, > > I've noticed a peculiarity regarding the speed at which FLASH runs .... > Say I set up a job on 64 processors and run it for a while. > Then I kill the job , restart the job from the last checkpoint file > outputted in the first run, no changes in > "flash.par" except restart=true and cpnumber is specified.... > The same job now runs 20-30 times faster, without any change of # of > processors. > Any hints ? > Would be good to know why this is happening since we plan to > benchmark FLASH runs vs. # of processors used and get a sense of > how fine a resolution we can afford for our turbulent burning > scenarios... > > Thanks much, > Sanjib. > > > From ipatov at dtm.ciw.edu Tue Jun 5 16:10:15 2007 From: ipatov at dtm.ciw.edu (ipatov at dtm.ciw.edu) Date: Tue, 5 Jun 2007 17:10:15 -0400 (EDT) Subject: [FLASH-USERS] h5_write.c In-Reply-To: <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> References: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> Message-ID: <33207.10.101.1.203.1181077815.squirrel@www.dtm.ciw.edu> Dear Nathan Hearn. Thank you for your e-mail. The below problem was solved after we changed the file FLASH2.5/source/sites/Prototypes/Linux/Makefile.h e.g., instead of CCOMP = mpicc we use CPPCOMP=g++ CCOMP=gcc but after mpirun -np 1 flash2 we have flash2: error loading shaped libraries: libhdf5.so.: cannot open shared object file: No such file or directory Sergei > Hi Sergei, > > These are very curious errors, as they would suggest that the HDF5 > header file is not being included. However, I would have expected the > compile to fail at the line > > #include > > and not get any further. (Also, it is somewhat odd that there seem to > be no errors when the HDF5-related variables are declared, but only > when they are used.) > > I would suggest that you inspect the Makefile.h for this build. > (Did you create a new Makefile.h for this machine?) In particular, > make sure that all of the HDF5 settings are correct -- this may simply > involve setting HDF5_PATH to the correct directory, but you will want > to make sure that this is reflected in the CFLAGS_HDF5 and the > LIB_HDF5 statements. Also, if you haven't already, be sure to turn on > all warning messages for your compilers by adding the appropriate > flags to CFLAGS_OPT and FFLAGS_OPT (or, if appropriate, CFLAGS_DEBUG, > CFLAGS_TEST, etc.). > > If you are still having problems, could you email the compiler > command that gmake issues for h5_write.c, along with the first few > warning and/or error messages that it generates? > > > - Nathan > > > On 6/4/07, ipatov at dtm.ciw.edu wrote: >> Dear FLASH users. >> I have succesfully installed and used FLASH2.5 on one computer. >> Recently I made the same instalation [with the same tar files] >> on another computer (of different type), but >> was not able to make test runs on the second computer. >> Now after >> ./setup sedov -auto [same for sod] >> gmake >> I got: >> h5_write.c: in function 'h5_write_tree_info_' >> h5_write.c:1004: 'status' undecleared (first use in this function) >> h5_write.c:1005: 'H5P_DEFAULT' undecleared (first use in this function) >> h5_write.c:1005: 'refine_level' undecleared (first use in this >> function) >> h5_write.c: At top level: >> h5_write.c: 1897: parse error before '*' token >> h5_write.c: in function 'h5_write_particles_' >> And a great number of similar messages. >> At the end: >> gmake[1]: *** [h5_write.o] Error 1 >> >> Then I tried to use hdf5-1.6.4 instead of the previous >> hdf5-1.6.2.tar, but got the same errors. >> >> Does somebody know what must be done to solve the problem? >> >> Best regards >> Sergei > > -- > Nathan C. Hearn > nhearn at uchicago.edu > > ASC Flash Center > Computational Physics Group > University of Chicago > From guptasanjib at lanl.gov Tue Jun 5 18:57:43 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Tue, 05 Jun 2007 17:57:43 -0600 Subject: [FLASH-USERS] Details of speedup after restart Message-ID: <4665F877.2030402@lanl.gov> Hi, I am attaching 2 log files - the initial run on 128 processors, then immediately killing the job and restarting from the first checkpoint file "hc-rt-hdf5_chk_0000" notice about 4 timesteps per second initially, then ~30 timesteps/sec after restart. On 64 processors I noticed the gain was higher , but my resolution was lower (half the number of nblocky, same nblockx, this is a 2D run)- sorry did not keep the logfiles. However this "gain" cannot be predicted......sometimes I don't get it on the first restart, so I restart a couple of times! As you'all can guess, this plays havoc with any benchmarking efforts .......and we do intend to showcase our results from FLASH soon ... :-) We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster ....and hdf5 version 1.6.5 ......Makefile.h is attached. Architecture - 64 bit AMD Opteron running FC3 linux + BProcV4 (cluster OS) with kernel Thanks much for your help/insight/suggestions, Sanjib. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hc-rt-firstrun_06_05_07.log Url: http://flash.uchicago.edu/pipermail/flash-users/attachments/20070605/172918eb/attachment.pl -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hc-rt-restart_06_05_07.log Url: http://flash.uchicago.edu/pipermail/flash-users/attachments/20070605/172918eb/attachment-0001.pl -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Makefile.h Url: http://flash.uchicago.edu/pipermail/flash-users/attachments/20070605/172918eb/attachment.h From nhearn at uchicago.edu Tue Jun 5 20:04:17 2007 From: nhearn at uchicago.edu (Nathan Hearn) Date: Tue, 5 Jun 2007 20:04:17 -0500 Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <4665F877.2030402@lanl.gov> References: <4665F877.2030402@lanl.gov> Message-ID: <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> Hi Sanjib, This is very curious. How do the output data files that come from restarts compare with those from non-restarts? Are they binary-equivalent? My concern is that the speed-up is coming from the code processing the input data differently each time. Alternatively, is it possible that there are background operations on the compute nodes (e.g., cluster node monitors, file system checks, network loads, etc.) that are interfering with your benchmarking runs? (Can you use top on any of the processing nodes while the simulation is running?) - Nathan On 6/5/07, sanjib gupta wrote: > Hi, > > I am attaching 2 log files - the initial run on 128 processors, then > immediately killing the job and restarting from the first checkpoint > file "hc-rt-hdf5_chk_0000" > notice about 4 timesteps per second initially, then ~30 timesteps/sec > after restart. > > On 64 processors I noticed the gain was higher , but my resolution was > lower (half the number of nblocky, same nblockx, this is a 2D run)- > sorry did not keep the logfiles. > > However this "gain" cannot be predicted......sometimes I don't get it on > the first restart, so I restart a couple of times! > As you'all can guess, this plays havoc with any benchmarking efforts > .......and we do intend to showcase our results from FLASH soon ... :-) > > We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster > ....and hdf5 version 1.6.5 ......Makefile.h is attached. > Architecture - 64 bit AMD Opteron > running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 > > Thanks much for your help/insight/suggestions, > Sanjib. From guptasanjib at lanl.gov Tue Jun 5 20:44:50 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Tue, 05 Jun 2007 19:44:50 -0600 Subject: speedups output files Re: [FLASH-USERS] Details of speedup after restart In-Reply-To: <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> References: <4665F877.2030402@lanl.gov> <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> Message-ID: <46661192.9000204@lanl.gov> Hi Nathan, The output files - the HDf5 check and plot files are the same size as before restarts..... and seem fine - they pickup exactly where the run left off, whether it is the thermodynamic conditions or mass fractions, and the resulting burning looks perfectly reasonable... I am not sure what you mean by binary-equivalent..... background operations are usually carried out on a different queue on the cluster......the cluster top command btop, just lets me know which nodes are free and which I am using.... and it shows usage at 99 % or so, meaning I have all CPU usage on the dual-processor nodes......this however could change during maintenance hours like 1-3 am when I am not around to monitor usage......sometimes the number of total timesteps at the end of the night does not make sense (too few), but this has only been a couple of times.... and I have not combed the usually voluminous logfiles unless something really goes wrong. However, I don't think background processes are responsible for the slow first run - it is too consistent , and we get maintenance messages from the Cluster operators..... I would be aware of them. Also the CPU usage - initially when this happened I remember doing a lot of checks on the allotted processors, I was getting near 100%, before and after the restart. The restart is from "flash.dat" ? How does it work? I was curious anyway, since it would be nice to do things like.....change the resolution at a restart. If only the thermo conditions etc. are transported to the restarted run, and properly rezoned, one would not have to worry? Unless of course dynamic structures have developed based on the earlier zoning ...but for relatively quiescent scenarios that are being restarted this should approximately work? Thanks Sanjib Nathan Hearn wrote: > Hi Sanjib, > > This is very curious. How do the output data files that come from > restarts compare with those from non-restarts? Are they > binary-equivalent? My concern is that the speed-up is coming from the > code processing the input data differently each time. > > Alternatively, is it possible that there are background operations > on the compute nodes (e.g., cluster node monitors, file system checks, > network loads, etc.) that are interfering with your benchmarking runs? > (Can you use top on any of the processing nodes while the simulation > is running?) > > > - Nathan > > > On 6/5/07, sanjib gupta wrote: >> Hi, >> >> I am attaching 2 log files - the initial run on 128 processors, then >> immediately killing the job and restarting from the first checkpoint >> file "hc-rt-hdf5_chk_0000" >> notice about 4 timesteps per second initially, then ~30 timesteps/sec >> after restart. >> >> On 64 processors I noticed the gain was higher , but my resolution was >> lower (half the number of nblocky, same nblockx, this is a 2D run)- >> sorry did not keep the logfiles. >> >> However this "gain" cannot be predicted......sometimes I don't get it on >> the first restart, so I restart a couple of times! >> As you'all can guess, this plays havoc with any benchmarking efforts >> .......and we do intend to showcase our results from FLASH soon ... >> :-) >> >> We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster >> ....and hdf5 version 1.6.5 ......Makefile.h is attached. >> Architecture - 64 bit AMD Opteron >> running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 >> >> Thanks much for your help/insight/suggestions, >> Sanjib. From sheeler at flash.uchicago.edu Tue Jun 5 20:54:46 2007 From: sheeler at flash.uchicago.edu (Dan Sheeler) Date: Tue, 5 Jun 2007 20:54:46 -0500 (CDT) Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <4665F877.2030402@lanl.gov> References: <4665F877.2030402@lanl.gov> Message-ID: Does this run have just 64 blocks (4 x 16)? If this is a standard flash setup with local physics, having less than one block per process probably will produce weird performance numbers. In a standard run, 2d 8x8 blocks require very little ram or work per process. Single processes are happy working on thousands of blocks. Furthermore, work is distributed to the processors in nothing smaller than block-sized chunks. If your run is a typical setup, then half of the processes have more-or-less nothing to do but add communication, and I can imagine that would cause runtime to fluctuate non-deterministically. Dan -- Dan Sheeler ASC Flash Center sheeler at flash.uchicago.edu (773) 834-3236 On Tue, 5 Jun 2007, sanjib gupta wrote: > Hi, > > I am attaching 2 log files - the initial run on 128 processors, then > immediately killing the job and restarting from the first checkpoint file > "hc-rt-hdf5_chk_0000" > notice about 4 timesteps per second initially, then ~30 timesteps/sec after > restart. > > On 64 processors I noticed the gain was higher , but my resolution was lower > (half the number of nblocky, same nblockx, this is a 2D run)- sorry did not > keep the logfiles. > > However this "gain" cannot be predicted......sometimes I don't get it on the > first restart, so I restart a couple of times! > As you'all can guess, this plays havoc with any benchmarking efforts > .......and we do intend to showcase our results from FLASH soon ... :-) > > We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster > ....and hdf5 version 1.6.5 ......Makefile.h is attached. > Architecture - 64 bit AMD Opteron > running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 > > Thanks much for your help/insight/suggestions, > Sanjib. > From guptasanjib at lanl.gov Tue Jun 5 21:17:03 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Tue, 05 Jun 2007 20:17:03 -0600 Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: References: <4665F877.2030402@lanl.gov> Message-ID: <4666191F.70301@lanl.gov> For instance, when I ran using 64 processors I had nblockx=4 nblocky=16 in "flash.par"................so that makes sense. So that means that if all allotted processors are at 100% usage by my job when I am at 1block/proc, then I will get no enhancements from doubling to 128 processors, so it makes sense to increase the resolution instead? That explains some tests I did to see how the runs scale with #processors, but I'm not sure it explains the drastic difference between initial run vs. all subsequent (restarted) runs. Thanks, Sanjib. Dan Sheeler wrote: > Does this run have just 64 blocks (4 x 16)? If this is a standard > flash setup with local physics, having less than one block per process > probably will produce weird performance numbers. In a standard run, > 2d 8x8 blocks require very little ram or work per process. Single > processes are happy working on thousands of blocks. Furthermore, work > is distributed to the processors in nothing smaller than block-sized > chunks. If your run is a typical setup, then half of the processes > have more-or-less nothing to do but add communication, and I can > imagine that would cause runtime to fluctuate non-deterministically. > > Dan > > -- > Dan Sheeler > ASC Flash Center > sheeler at flash.uchicago.edu > (773) 834-3236 > > On Tue, 5 Jun 2007, sanjib gupta wrote: > >> Hi, >> >> I am attaching 2 log files - the initial run on 128 processors, then >> immediately killing the job and restarting from the first checkpoint >> file "hc-rt-hdf5_chk_0000" >> notice about 4 timesteps per second initially, then ~30 timesteps/sec >> after restart. >> >> On 64 processors I noticed the gain was higher , but my resolution >> was lower (half the number of nblocky, same nblockx, this is a 2D >> run)- sorry did not keep the logfiles. >> >> However this "gain" cannot be predicted......sometimes I don't get it >> on the first restart, so I restart a couple of times! >> As you'all can guess, this plays havoc with any benchmarking efforts >> .......and we do intend to showcase our results from FLASH soon ... >> :-) >> >> We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux >> cluster ....and hdf5 version 1.6.5 ......Makefile.h is attached. >> Architecture - 64 bit AMD Opteron >> running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 >> >> Thanks much for your help/insight/suggestions, >> Sanjib. >> From gawrysz at camk.edu.pl Wed Jun 6 00:11:56 2007 From: gawrysz at camk.edu.pl (Artur Gawryszczak) Date: Wed, 6 Jun 2007 07:11:56 +0200 Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <4665F877.2030402@lanl.gov> References: <4665F877.2030402@lanl.gov> Message-ID: <200706060711.57227@phoenix.camk.edu.pl> Hi, On ?roda, 6 czerwca 2007, sanjib gupta wrote: > I am attaching 2 log files - the initial run on 128 processors, then > immediately killing the job and restarting from the first checkpoint > file "hc-rt-hdf5_chk_0000" > notice about 4 timesteps per second initially, then ~30 timesteps/sec > after restart. You're using only base level of refinement (lrefine_max=1) which has a non obvious side effect: when you start from scratch, only master procesor gets the work and the other are just waiting. After a restart the blocks are distributed and then the run becomes truly parallel. I'd suggest you to decrease nblock[xy] and use lrefine_min=2 and lrefine_max=2 instead, then after refining to second level the blocks will be distributed. If you don't require extremely flexible AMR then you may also increase nxb and nyb from 8 to 16 or 32 (at compile time) which will reduce overhead due to communication. Your setup is also relatively small - it's just 32x128 cells, so probably it makes little sense to use more than 4 or 8 CPU for it. -- Cheers, Artur From nhearn at uchicago.edu Wed Jun 6 04:37:07 2007 From: nhearn at uchicago.edu (Nathan Hearn) Date: Wed, 6 Jun 2007 04:37:07 -0500 Subject: speedups output files Re: [FLASH-USERS] Details of speedup after restart In-Reply-To: <46661192.9000204@lanl.gov> References: <4665F877.2030402@lanl.gov> <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> <46661192.9000204@lanl.gov> Message-ID: <2467fdc0706060237o1008f4d2nb7743b9cab37fd53@mail.gmail.com> Hi Sanjib, From the main email thread, it looks like you are on your way to a solution towards the speedup issue. (I like Artur's explanation, as blocks are probably redistributed only during AMR refine/derefine steps. However, it is still unclear to me why you only occasionally see the speedup during restart.) Regarding the nature of the restarts, I believe that flash.dat merely stores the output stream for diagnostic data generated during the run. Unless there are some specific input files required by the Flash modules in use, the only files needed for restart are flash.par and the checkpoint file. The checkpoint file contains all the information necessary to reconstruct the mesh. Changing the resolution during a restart is a somewhat complicated issue. By design, Flash uses the structural information stored in the checkpoint file to build the mesh in memory, and there is no re-meshing capability included. However, it should be possible to alter the lrefine settings in flash.par to force Flash to change the minimum and maximum levels of refinement after the checkpoint data is loaded. (This has been a recent topic of discussion here.) Right now, more significant changes to the mesh -- such as altering the physical size of the domain, changing the arrangement of base blocks, or changing the number of zones per block -- is not permitted. (I have been working on routines for resampling Flash data files during the init_block stage, but they are still in development.) - Nathan On 6/5/07, sanjib gupta wrote: > Hi Nathan, > > The output files - the HDf5 check and plot files are the same size as > before restarts..... > and seem fine - they pickup exactly where the run left off, whether it is > the thermodynamic conditions or mass fractions, and the resulting > burning looks perfectly reasonable... > I am not sure what you mean by binary-equivalent..... > > background operations are usually carried out on a different queue on > the cluster......the cluster top command > btop, just lets me know which nodes are free and which I am using.... > and it shows usage at 99 % or so, meaning I have all > CPU usage on the dual-processor nodes......this however could change > during maintenance hours like 1-3 am when I am not around to > monitor usage......sometimes the number of total timesteps at the end of > the night does not make sense (too few), but this has only been a couple > of times.... > and I have not combed the usually voluminous logfiles unless something > really goes wrong. > > However, I don't think background processes are responsible for the slow > first run - it is too consistent , and we get maintenance messages from > the Cluster operators..... > I would be aware of them. > Also the CPU usage - initially when this happened I remember doing a lot > of checks on the allotted processors, I was getting near 100%, before > and after the restart. > > The restart is from "flash.dat" ? How does it work? I was curious > anyway, since it would be nice to do things like.....change the > resolution at a restart. If only the thermo conditions etc. are > transported to the > restarted run, and properly rezoned, one would not have to worry? Unless > of course dynamic structures have developed based on the earlier zoning > ...but for relatively quiescent scenarios that are being restarted this > should approximately work? > > Thanks > Sanjib -- Nathan C. Hearn nhearn at uchicago.edu ASC Flash Center Computational Physics Group University of Chicago From dubey at flash.uchicago.edu Wed Jun 6 04:40:43 2007 From: dubey at flash.uchicago.edu (Anshu Dubey) Date: Wed, 6 Jun 2007 04:40:43 -0500 (CDT) Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <200706060711.57227@phoenix.camk.edu.pl> References: <4665F877.2030402@lanl.gov> <200706060711.57227@phoenix.camk.edu.pl> Message-ID: <1362.129.215.49.19.1181122843.squirrel@flash.uchicago.edu> If you don't really need AMR, you could use FLASH3 instead. It has a true Uniform grid with significantly less housekeeping overhead. It also initializes the domain in parallel, so the blocks will get distributed on as many processors as there are blocks. Otherwise as Artur pointed out, use lrefine_min/max to be greater than 1, because FLASH 2 puts all initial blocks on the master processor, it is only the process of refinement which distributes them. > > You're using only base level of refinement (lrefine_max=1) which has a non > obvious side effect: when you start from scratch, only master procesor > gets > the work and the other are just waiting. After a restart the blocks are > distributed and then the run becomes truly parallel. I'd suggest you to > decrease nblock[xy] and use lrefine_min=2 and lrefine_max=2 instead, then > after refining to second level the blocks will be distributed. > > If you don't require extremely flexible AMR then you may also increase nxb > and nyb from 8 to 16 or 32 (at compile time) which will reduce overhead > due > to communication. > > Your setup is also relatively small - it's just 32x128 cells, so probably > it > makes little sense to use more than 4 or 8 CPU for it. > > -- > Cheers, > Artur > Anshu Dubey Code Group Leader phone : 773.834.2999 fax: 773.834.3230 From gawrysz at camk.edu.pl Wed Jun 6 05:04:00 2007 From: gawrysz at camk.edu.pl (Artur Gawryszczak) Date: Wed, 6 Jun 2007 12:04:00 +0200 Subject: speedups output files Re: [FLASH-USERS] Details of speedup after restart In-Reply-To: <2467fdc0706060237o1008f4d2nb7743b9cab37fd53@mail.gmail.com> References: <4665F877.2030402@lanl.gov> <46661192.9000204@lanl.gov> <2467fdc0706060237o1008f4d2nb7743b9cab37fd53@mail.gmail.com> Message-ID: <200706061204.00623@phoenix.camk.edu.pl> Hi, On ?roda, 6 czerwca 2007, Nathan Hearn wrote: > From the main email thread, it looks like you are on your way to a > solution towards the speedup issue. (I like Artur's explanation, as > blocks are probably redistributed only during AMR refine/derefine > steps. However, it is still unclear to me why you only occasionally > see the speedup during restart.) It is because redistribution takes place also after init from checkpoint. > Unless there are some specific input files required by the > Flash modules in use, the only files needed for restart are flash.par > and the checkpoint file. The checkpoint file contains all the > information necessary to reconstruct the mesh. It is also possible to restart with checkpoint dump only (flash2 -chk_file ). IIRC flash.par needs to be present, and could be empty. > However, it should be possible to alter the lrefine settings in flash.par > to force Flash to change the minimum and maximum levels of refinement after > the checkpoint data is loaded. If the restart is via flash.par file (restart=1, cpnumber=) then all parameters from flash.par override both defaults and values stored in checkpoint. -- Cheers, Artur From gawrysz at camk.edu.pl Wed Jun 6 07:09:28 2007 From: gawrysz at camk.edu.pl (Artur Gawryszczak) Date: Wed, 6 Jun 2007 14:09:28 +0200 Subject: speedups output files Re: [FLASH-USERS] Details of speedup after restart In-Reply-To: <56713.129.215.48.16.1181125898.squirrel@flash.uchicago.edu> References: <4665F877.2030402@lanl.gov> <200706061204.00623@phoenix.camk.edu.pl> <56713.129.215.48.16.1181125898.squirrel@flash.uchicago.edu> Message-ID: <200706061409.28720@phoenix.camk.edu.pl> Hi Anshu, On ?roda, 6 czerwca 2007, Anshu Dubey wrote: > In theory, you could change the refinement in flash.par, but it is not > always as straightforward as that. For instance, if you reduce > lrefine_max, and in the checkpoint you have blocks that are at the old > lrefine_max level, you could be in trouble. I don't believe this was ever > addressed in FLASH2, but there is implementation to restrict the offending > blocks in FLASH3. Even that I don't think is too well tested. I did it many times in FLASH2.x and there were no problems. Such situations are by default handled safely in mark_grid_refinement, via: if ((lrefine(i) > lrefine_max) .and. (nodetype(i) == 1)) then refine(i) = .FALSE. derefine(i) = .TRUE. endif The derefinement will take place when (mod(nstep, nrefs) == 0) and will strip at most one level at a time. The only place in which decrease of lrefine_max may lead to confusion is init_mpole from multipole Poisson solver, where lrefine_max is used directly for some calculations. -- Cheers, Artur From guptasanjib at lanl.gov Wed Jun 6 14:14:58 2007 From: guptasanjib at lanl.gov (Sanjib Gupta) Date: Wed, 6 Jun 2007 13:14:58 -0600 (MDT) Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <200706060711.57227@phoenix.camk.edu.pl> References: <4665F877.2030402@lanl.gov> <200706060711.57227@phoenix.camk.edu.pl> Message-ID: <47247.128.165.0.81.1181157298.squirrel@webmail.lanl.gov> Hi Artur,Anshu,Nathan, Many thanks for the helpful suggestions - I will implement them later this week when I get back to coding.......and I will be sure to share any interesting observations with the mailing list. But more than that, thanks enormously for taking the time to educate me about some of the "underthe-hood" nitty-gritty of FLASH, that helps me to understand how the code is structured and to make intelligent decisions in future runs. Regards, Sanjib. > Hi, > > On ?roda, 6 czerwca 2007, sanjib gupta wrote: >> I am attaching 2 log files - the initial run on 128 processors, then >> immediately killing the job and restarting from the first checkpoint >> file "hc-rt-hdf5_chk_0000" >> notice about 4 timesteps per second initially, then ~30 timesteps/sec >> after restart. > > You're using only base level of refinement (lrefine_max=1) which has a non > obvious side effect: when you start from scratch, only master procesor > gets > the work and the other are just waiting. After a restart the blocks are > distributed and then the run becomes truly parallel. I'd suggest you to > decrease nblock[xy] and use lrefine_min=2 and lrefine_max=2 instead, then > after refining to second level the blocks will be distributed. > > If you don't require extremely flexible AMR then you may also increase nxb > and nyb from 8 to 16 or 32 (at compile time) which will reduce overhead > due > to communication. > > Your setup is also relatively small - it's just 32x128 cells, so probably > it > makes little sense to use more than 4 or 8 CPU for it. > > -- > Cheers, > Artur > From ipatov at dtm.ciw.edu Wed Jun 6 14:43:48 2007 From: ipatov at dtm.ciw.edu (ipatov at dtm.ciw.edu) Date: Wed, 6 Jun 2007 15:43:48 -0400 (EDT) Subject: [FLASH-USERS] XFLASH, h5_wrappers.so In-Reply-To: <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> References: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> Message-ID: <33832.10.101.1.203.1181159028.squirrel@www.dtm.ciw.edu> Dear FLASH users, We used normally xflash on our old computer, but we were not able to run it on another computer. after idl xflash window appears, but after pressing the name of a file we got: Loaded DLM: HDF XMANAGER: Caught unexpected error from client application. Message follows... CALL_EXTERNAL: Error loading sharable executable. Symbol: hdf5_check_file_exist, File= /data/flash/FLASH2.5/tools/fildr2/h5_wrappers.so /data/flash/FLASH2.5/tools/fildr2/h5_wrappers.so: cannot open shared object file: No such file or directory Execution halted at: DETERMINE_FILE_TYPE 29 /data/flash/FLASH2.5/tools/fildr2/determine_file_type.pro XFLASH_EVENT 77 //data/flash/FLASH2.5/tools/fildr2/xflash.pro XMANAGER_EVLOOP_STANDARD 478 /usr/local/rsi/idl_6.1/lib/xmanager.pro XMANAGER 708 /usr/local/rsi/idl_6.1/lib/xmanager.pro XFLASH 1782 /data/flash/FLASH2.5/tools/fildr2/xflash.pro $MAIN$ The file /data/flash/FLASH2.5/tools/fildr2/h5_wrappers.so exists and was corrected today. We used the file tools/fildr2/Makefile.linux from our previous successful instalation on another computer as an example. Sergei From ipatov at dtm.ciw.edu Wed Jun 6 15:19:25 2007 From: ipatov at dtm.ciw.edu (ipatov at dtm.ciw.edu) Date: Wed, 6 Jun 2007 16:19:25 -0400 (EDT) Subject: [FLASH-USERS] Re: XFLASH, h5_wrappers.so In-Reply-To: <33832.10.101.1.203.1181159028.squirrel@www.dtm.ciw.edu> References: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> <33832.10.101.1.203.1181159028.squirrel@www.dtm.ciw.edu> Message-ID: <33855.10.101.1.203.1181161165.squirrel@www.dtm.ciw.edu> We suspect that the problems can be caused by that the h5_wrappers.so file is 64 bit and IDL on that computer is 32-bit, but do not understand how to solve it. Sergei > Dear FLASH users, > We used normally xflash on our old computer, but > we were not able to run it on another computer. > > after > idl > xflash > window appears, but after pressing the name of a file we got: > > Loaded DLM: HDF > XMANAGER: Caught unexpected error from client application. Message > follows... > CALL_EXTERNAL: Error loading sharable executable. > Symbol: hdf5_check_file_exist, File= > /data/flash/FLASH2.5/tools/fildr2/h5_wrappers.so > /data/flash/FLASH2.5/tools/fildr2/h5_wrappers.so: cannot open shared > object file: No such file or directory > Execution halted at: DETERMINE_FILE_TYPE 29 > /data/flash/FLASH2.5/tools/fildr2/determine_file_type.pro > XFLASH_EVENT 77 //data/flash/FLASH2.5/tools/fildr2/xflash.pro > XMANAGER_EVLOOP_STANDARD 478 /usr/local/rsi/idl_6.1/lib/xmanager.pro > XMANAGER 708 /usr/local/rsi/idl_6.1/lib/xmanager.pro > XFLASH 1782 /data/flash/FLASH2.5/tools/fildr2/xflash.pro > $MAIN$ > > The file /data/flash/FLASH2.5/tools/fildr2/h5_wrappers.so exists and was > corrected today. > We used the file tools/fildr2/Makefile.linux from our previous successful > instalation on another computer as an example. > > > Sergei > From vitello at llnl.gov Mon Jun 11 13:53:40 2007 From: vitello at llnl.gov (Peter Vitello) Date: Mon, 11 Jun 2007 11:53:40 -0700 Subject: [FLASH-USERS] MPI failure Message-ID: <7.0.1.0.2.20070611115136.054459e8@llnl.gov> I have a 2D hydrodynamic calculation using FLASH2.5 which runs for a while and then fails with a number of MPI error messages. I am using the pgF90 compiler with -fast settings and MPICH2-1.0.4p1 on a linux cluster. Any suggestions? I would appreciate any help. Peter Vitello LLNL [cli_20]: aborting job: Fatal error in MPI_Irecv: Other MPI error, error stack: MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD, request=0x137dab7c) failed MPID_Irecv(74): Out of memory [cli_0]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=23) [cli_4]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=10) [cli_12]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=8) [cli_16]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=3) [cli_24]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=6) [cli_19]: aborting job: Fatal error in MPI_Waitall: Other MPI error, error stack: MPI_Waitall(242)..........................: MPI_Waitall(count=528, req_array=0x137da480, status_array=0x13653a80) failed MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=2) [cli_8]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=17) [cli_28]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(286).......................: MPIC_Sendrecv(161)........................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=5) [cli_21]: aborting job: Fatal error in MPI_Ssend: Other MPI error, error stack: MPI_Ssend(167)............................: MPI_Ssend(buf=0x12c41820, count=12, MPI_DOUBLE_PRECISION, dest=20, tag=291, MPI_COMM_WORLD) failed MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(670)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer) [cli_18]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=4) [cli_22]: abort[cli_6]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=8) ing job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(464).......................: MPIC_Recv(98).............................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=1) [cli_17]: aborting job: Fatal error in MPI_Allreduce: Other MPI error, error stack: MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed MPIR_Allreduce(286).......................: MPIC_Sendrecv(161)........................: MPIC_Wait(324)............................: MPIDI_CH3I_Progress(158)..................: handle_sock_op failed MPIDI_CH3I_Progress_handle_sock_event(175): MPIDU_Socki_handle_read(644)..............: connection closed by peer (set=0,sock=12) rank 20 in job 7 n06.llnl.gov_14457 caused collective abort of all ranks From calder at flash.uchicago.edu Mon Jun 11 14:41:49 2007 From: calder at flash.uchicago.edu (Alan Calder) Date: Mon, 11 Jun 2007 14:41:49 -0500 (CDT) Subject: [FLASH-USERS] MPI failure In-Reply-To: <7.0.1.0.2.20070611115136.054459e8@llnl.gov> References: <7.0.1.0.2.20070611115136.054459e8@llnl.gov> Message-ID: Peter- Perhaps you have checked, but the "Out of memory" suggests exactly that. Was the block count on each processor approaching MAXBLOCKS? I have seen crashes with non-intuitive mpi errors when the number of blocks on the processors gets close to the maximum and the code runs out of memory. You might try the run again on more processors. Hope this helps, Alan On Mon, 11 Jun 2007, Peter Vitello wrote: > > I have a 2D hydrodynamic calculation using FLASH2.5 which runs for a while > and then fails with a number of MPI error messages. > > I am using the pgF90 compiler with -fast settings and MPICH2-1.0.4p1 on a > linux cluster. > > Any suggestions? I would appreciate any help. > > Peter Vitello > LLNL > > [cli_20]: aborting job: > Fatal error in MPI_Irecv: Other MPI error, error stack: > MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, MPI_DOUBLE_PRECISION, > src=21, tag=440, MPI_COMM_WORLD, request=0x137dab7c) failed > MPID_Irecv(74): Out of memory > [cli_0]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=23) > [cli_4]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=10) > [cli_12]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=8) > [cli_16]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=3) > [cli_24]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=6) > [cli_19]: aborting job: > Fatal error in MPI_Waitall: Other MPI error, error stack: > MPI_Waitall(242)..........................: MPI_Waitall(count=528, > req_array=0x137da480, status_array=0x13653a80) failed > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=2) > [cli_8]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=17) > [cli_28]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(286).......................: > MPIC_Sendrecv(161)........................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=5) > [cli_21]: aborting job: > Fatal error in MPI_Ssend: Other MPI error, error stack: > MPI_Ssend(167)............................: MPI_Ssend(buf=0x12c41820, > count=12, MPI_DOUBLE_PRECISION, dest=20, tag=291, MPI_COMM_WORLD) failed > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(670)..............: connection failure > (set=0,sock=1,errno=104:Connection reset by peer) > [cli_18]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=4) > [cli_22]: abort[cli_6]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=8) > ing job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(464).......................: > MPIC_Recv(98).............................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=1) > [cli_17]: aborting job: > Fatal error in MPI_Allreduce: Other MPI error, error stack: > MPI_Allreduce(701)........................: MPI_Allreduce(sbuf=0x1441f510, > rbuf=0x1440f500, count=30, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed > MPIR_Allreduce(286).......................: > MPIC_Sendrecv(161)........................: > MPIC_Wait(324)............................: > MPIDI_CH3I_Progress(158)..................: handle_sock_op failed > MPIDI_CH3I_Progress_handle_sock_event(175): > MPIDU_Socki_handle_read(644)..............: connection closed by peer > (set=0,sock=12) > rank 20 in job 7 n06.llnl.gov_14457 caused collective abort of all ranks > From vitello at llnl.gov Mon Jun 11 16:03:14 2007 From: vitello at llnl.gov (Peter Vitello) Date: Mon, 11 Jun 2007 14:03:14 -0700 Subject: [FLASH-USERS] MPI failure In-Reply-To: References: <7.0.1.0.2.20070611115136.054459e8@llnl.gov> Message-ID: <7.0.1.0.2.20070611135900.0542c690@llnl.gov> Alan, Thanks for the response. I don't think that this is a MAXBLOCKS problem. I am running with 45 nodes and MAXBLOCKS=8000 to avoid this problem. My actual min_blocks and max_blocks just before an abort are 564 and 651 which are well below MAXBLOCKS. Peter At 12:41 PM 6/11/2007, you wrote: >Peter- > >Perhaps you have checked, but the "Out of memory" suggests exactly >that. Was the block count on each processor approaching MAXBLOCKS? >I have seen crashes with non-intuitive mpi errors when the number of >blocks on the processors gets close to the maximum and the code >runs out of memory. You might try the run again on more processors. > >Hope this helps, > >Alan > >On Mon, 11 Jun 2007, Peter Vitello wrote: > >> >>I have a 2D hydrodynamic calculation using FLASH2.5 which runs for >>a while and then fails with a number of MPI error messages. >> >>I am using the pgF90 compiler with -fast settings and >>MPICH2-1.0.4p1 on a linux cluster. >> >>Any suggestions? I would appreciate any help. >> >>Peter Vitello >>LLNL >> >>[cli_20]: aborting job: >>Fatal error in MPI_Irecv: Other MPI error, error stack: >>MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, >>MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD, >>request=0x137dab7c) failed >>MPID_Irecv(74): Out of memory >>[cli_0]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=23) >>[cli_4]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=10) >>[cli_12]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=8) >>[cli_16]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=3) >>[cli_24]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=6) >>[cli_19]: aborting job: >>Fatal error in MPI_Waitall: Other MPI error, error stack: >>MPI_Waitall(242)..........................: MPI_Waitall(count=528, >>req_array=0x137da480, status_array=0x13653a80) failed >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=2) >>[cli_8]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=17) >>[cli_28]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(286).......................: >>MPIC_Sendrecv(161)........................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=5) >>[cli_21]: aborting job: >>Fatal error in MPI_Ssend: Other MPI error, error stack: >>MPI_Ssend(167)............................: >>MPI_Ssend(buf=0x12c41820, count=12, MPI_DOUBLE_PRECISION, dest=20, >>tag=291, MPI_COMM_WORLD) failed >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(670)..............: connection failure >>(set=0,sock=1,errno=104:Connection reset by peer) >>[cli_18]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=4) >>[cli_22]: abort[cli_6]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=8) >>ing job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(464).......................: >>MPIC_Recv(98).............................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=1) >>[cli_17]: aborting job: >>Fatal error in MPI_Allreduce: Other MPI error, error stack: >>MPI_Allreduce(701)........................: >>MPI_Allreduce(sbuf=0x1441f510, rbuf=0x1440f500, count=30, >>MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD) failed >>MPIR_Allreduce(286).......................: >>MPIC_Sendrecv(161)........................: >>MPIC_Wait(324)............................: >>MPIDI_CH3I_Progress(158)..................: handle_sock_op failed >>MPIDI_CH3I_Progress_handle_sock_event(175): >>MPIDU_Socki_handle_read(644)..............: connection closed by >>peer (set=0,sock=12) >>rank 20 in job 7 n06.llnl.gov_14457 caused collective abort of all ranks From vitello at llnl.gov Mon Jun 11 17:54:01 2007 From: vitello at llnl.gov (Peter Vitello) Date: Mon, 11 Jun 2007 15:54:01 -0700 Subject: [FLASH-USERS] MPI failure In-Reply-To: References: <7.0.1.0.2.20070611115136.054459e8@llnl.gov> Message-ID: <7.0.1.0.2.20070611154811.0545f2c8@llnl.gov> Thanks for the suggestion, but it doesn't look like my MPI out of memory failure is due to the stack size being limited. The results from ulimit are as follows, and stack size is unlimited. While unlimited, I don't know what memory is actually available. Does anyone know where else to check for what would FLASH 2.5 to generate: Fatal error in MPI_Irecv: Other MPI error, error stack: MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, MPI_DOUBLE_PRECISION, src=21, tag=440, MPI_COMM_WORLD, request=0x137dab7c) failed MPID_Irecv(74): Out of memory core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 37376 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 37376 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Thanks for any help. Peter Vitello LLNL At 02:46 PM 6/11/2007, you wrote: >>Perhaps you have checked, but the "Out of memory" suggests exactly >>that. Was the block count on each processor approaching MAXBLOCKS? >>I have seen crashes with non-intuitive mpi errors when the number >>of blocks on the processors gets close to the maximum and the code >>runs out of memory. You might try the run again on more processors. > >I have fouled up the usage of more codes than most >people have ever heard of: if it was an out-of-memory problem and >maxblocks was not being reached, check to make sure your stacksize is >not limited. You should be able to run the Unix command "limit" and >see something like > >% limit >cputime unlimited >filesize unlimited >datasize unlimited >stacksize unlimited >coredumpsize 0 kbytes >memoryuse unlimited >vmemoryuse unlimited >descriptors 1024 >memorylocked 32 kbytes >maxproc 98304 > >If that shows some smaller limit, in your .cshrc or .bashrc file, enter >the line > unlimit stacksize >so that all new processes (esp the MPI ones) are started without the >stacksize limited. > >This will only cause you trouble if your code is starting a lot of >Java virtual machines, or you are directly using pthreads, which are >both unusual for most HPC MPI codes. > >This has been an irritating bug encountered so many times that I've >started having our sysadmin apply it for the default start up file >for every new student I get. It always causes troubles that seem far >removed from the root cause. I strongly suspect that by now LLNL has >already done this by default for most users, but if you copied over a >.cshrc file from another machine it may have been overwritten or >overrided. > >And if this was the problem, feel free to post this to the rest of the >flash-user list. I have no sense of shame anymore, and don't mind >everyone knowing about how many times I've made this same mistake! From sheeler at flash.uchicago.edu Mon Jun 11 18:41:58 2007 From: sheeler at flash.uchicago.edu (Dan Sheeler) Date: Mon, 11 Jun 2007 18:41:58 -0500 (CDT) Subject: [FLASH-USERS] MPI failure In-Reply-To: <7.0.1.0.2.20070611154811.0545f2c8@llnl.gov> References: <7.0.1.0.2.20070611115136.054459e8@llnl.gov> <7.0.1.0.2.20070611154811.0545f2c8@llnl.gov> Message-ID: On 32 bit linux machines (and I think others), there's an insidious stack limit of 2Mb if your executable was compiled statically and linked against the pthreads library (which flash usueally does because mpich usually links to pthreads). Setting the limit in the shell doesn't affect it because the statically linked pthreads library somehow overrides the shell setting. so we've added c code in flash3 that sets the rlimit inside flash. Download flash3 and look for the file dr_set_rlimits.c. Basically, the code calls setrlimit to unlimit the stacksize like this: lim.rlim_cur = RLIM_INFINITY; lim.rlim_max = RLIM_INFINITY; retval = setrlimit(RLIMIT_STACK, &lim); This still might not be your problem, though. It seems like if it's not, you might have to see if there's some mpich2 buffer limits you might have to adjust through environment variables or something, but I'm not too familer with that. Dan -- Dan Sheeler ASC Flash Center sheeler at flash.uchicago.edu (773) 834-3236 On Mon, 11 Jun 2007, Peter Vitello wrote: > Thanks for the suggestion, but it doesn't look like my MPI out of memory > failure is due to the stack size being limited. > The results from ulimit are as follows, and stack size is unlimited. While > unlimited, I don't know what memory is actually > available. Does anyone know where else to check for what would FLASH 2.5 to > generate: > > Fatal error in MPI_Irecv: Other MPI error, error stack: > MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, MPI_DOUBLE_PRECISION, > src=21, tag=440, MPI_COMM_WORLD, request=0x137dab7c) failed > MPID_Irecv(74): Out of memory > > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited > pending signals (-i) 37376 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 37376 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > Thanks for any help. > > Peter Vitello > LLNL > > At 02:46 PM 6/11/2007, you wrote: > >>> Perhaps you have checked, but the "Out of memory" suggests exactly >>> that. Was the block count on each processor approaching MAXBLOCKS? >>> I have seen crashes with non-intuitive mpi errors when the number of >>> blocks on the processors gets close to the maximum and the code >>> runs out of memory. You might try the run again on more processors. >> >> I have fouled up the usage of more codes than most >> people have ever heard of: if it was an out-of-memory problem and >> maxblocks was not being reached, check to make sure your stacksize is >> not limited. You should be able to run the Unix command "limit" and >> see something like >> >> % limit >> cputime unlimited >> filesize unlimited >> datasize unlimited >> stacksize unlimited >> coredumpsize 0 kbytes >> memoryuse unlimited >> vmemoryuse unlimited >> descriptors 1024 >> memorylocked 32 kbytes >> maxproc 98304 >> >> If that shows some smaller limit, in your .cshrc or .bashrc file, enter >> the line >> unlimit stacksize >> so that all new processes (esp the MPI ones) are started without the >> stacksize limited. >> >> This will only cause you trouble if your code is starting a lot of Java >> virtual machines, or you are directly using pthreads, which are >> both unusual for most HPC MPI codes. >> >> This has been an irritating bug encountered so many times that I've started >> having our sysadmin apply it for the default start up file >> for every new student I get. It always causes troubles that seem far >> removed from the root cause. I strongly suspect that by now LLNL has >> already done this by default for most users, but if you copied over a >> .cshrc file from another machine it may have been overwritten or >> overrided. >> >> And if this was the problem, feel free to post this to the rest of the >> flash-user list. I have no sense of shame anymore, and don't mind >> everyone knowing about how many times I've made this same mistake! > From bernalcg at astroscu.unam.mx Wed Jun 13 11:43:33 2007 From: bernalcg at astroscu.unam.mx (Cristian Giovanny Bernal) Date: Wed, 13 Jun 2007 10:43:33 -0600 Subject: [FLASH-USERS] Visit... spherical geometry? Message-ID: <20070613163348.M25603@astroscu.unam.mx> Hello people, I am trying to visualize the results of a simulation in spherical and cartesian coordinates using the VISIT program. The results of the simulation in cartesian coordinates are fine, but in spherical coordinates not. Visit plot cartesian meshes only? It is possible to obtain 1D-slides, in format *.dat like in XFLASH? thanks -- Instituto de Astronomia Universidad Nacional Autonoma de Mexico (UNAM) http://www.astroscu.unam.mx From hudson at mcs.anl.gov Wed Jun 13 12:57:28 2007 From: hudson at mcs.anl.gov (Randy Hudson) Date: Wed, 13 Jun 2007 12:57:28 -0500 Subject: [FLASH-USERS] Visit... spherical geometry? In-Reply-To: <20070613163348.M25603@astroscu.unam.mx> References: <20070613163348.M25603@astroscu.unam.mx> Message-ID: <46703008.6010104@mcs.anl.gov> We've managed to visualize some 2d data in spherical coordinates. Here are the steps we followed to do it one way: read file create a Pseudocolor plot add a Transform operator display the Transform operator attributes window select the Coordinate tab of that window select the Spherical radio button of the Input coordinates region select the Cartesian radio button of the Output coordinates region click the window?s Apply button Let me know if this helps or is wide of the mark. Cristian Giovanny Bernal wrote: > Hello people, > > I am trying to visualize the results of a simulation in spherical and > cartesian coordinates using the VISIT program. The results of the simulation > in cartesian coordinates are fine, but in spherical coordinates not. Visit > plot cartesian meshes only? It is possible to obtain 1D-slides, in format > *.dat like in XFLASH? > > thanks > > -- Randy. From hudson at mcs.anl.gov Thu Jun 14 13:56:46 2007 From: hudson at mcs.anl.gov (Randy Hudson) Date: Thu, 14 Jun 2007 13:56:46 -0500 Subject: [Fwd: Re: [FLASH-USERS] Visit... spherical geometry?] Message-ID: <46718F6E.6010802@mcs.anl.gov> FYI... -------- Original Message -------- Date: Thu, 14 Jun 2007 12:08:16 -0600 From: Cristian Giovanny Bernal To: Randy Hudson yes! now is working... very thanks. One more question: I do not know if you have used the lineout mode tool. In affirmative case, is possible plot log-log? I appreciate much your help. than you again... On Wed, 13 Jun 2007 12:57:28 -0500, Randy Hudson wrote > We've managed to visualize some 2d data in spherical coordinates. > > Here are the steps we followed to do it one way: > read file > create a Pseudocolor plot > add a Transform operator > display the Transform operator attributes window > select the Coordinate tab of that window > select the Spherical radio button of the Input coordinates region > select the Cartesian radio button of the Output coordinates region > click the window?'s Apply button > > Let me know if this helps or is wide of the mark. > > Cristian Giovanny Bernal wrote: > > Hello people, > > > > I am trying to visualize the results of a simulation in spherical and > > cartesian coordinates using the VISIT program. The results of the simulation > > in cartesian coordinates are fine, but in spherical coordinates not. Visit > > plot cartesian meshes only? It is possible to obtain 1D-slides, in format > > *.dat like in XFLASH? > > > > thanks > > > > > > -- > > Randy. -- Instituto de Astronomia Universidad Nacional Autonoma de Mexico (UNAM) http://www.astroscu.unam.mx -- Randy. From marcus at MPA-Garching.MPG.DE Tue Jun 19 09:33:05 2007 From: marcus at MPA-Garching.MPG.DE (Marcus Brueggen) Date: Tue, 19 Jun 2007 16:33:05 +0200 (CEST) Subject: [FLASH-USERS] FLASH workshop in Bremen Message-ID: Dear FLASH users, I wish to advertise a Workshop on Adaptive-mesh simulations with FLASH to be held from 15.-18. October 2007 at the Jacobs University Bremen The aim of this workshop is to bring together scientists to 1. exchange experience and expertise with the code 2. define projects that compare and validate FLASH against other codes 3. foster collaborations to develop the code for future astrophysical applications Topics discussed may include: - Magnetohydrodynamics - Gravity solvers and Cosmology with FLASH - Radiative transport - Data analysis and visualisation - Migrating to FLASH3.0 - Scaling issues Some members of the FLASH team will join the workshop. We particularly welcome the application of PhD students and postdocs with some prior experience with the FLASH code. The workshop is sponsored by the European Science Foundation and we can offer some support for travel within Europe. The number of participants is limited to 30. Please register early under: http://www.jacobs-university.de/schools/ses/mbrueggen/ REGISTRATION DEADLINE: 15 AUGUST 2007 There is no registration fee. From van at astro.ox.ac.uk Mon Jun 4 10:45:54 2007 From: van at astro.ox.ac.uk (Vincenzo Antonuccio) Date: Mon, 4 Jun 2007 16:45:54 +0100 Subject: [FLASH-USERS] xflash is crashing Message-ID: <200706041645.54335.van@astro.ox.ac.uk> I have a problem using xflash. Here is my configuration: FLASH v. 2.5, using fidlr3 IDL: version 6.3, on HDF5 v. 1.6.2 > uname -a: Linux uist 2.6.8-3-686 #1 Tue Dec 5 21:26:38 UTC 2006 i686 GNU/Linux I can successfully launch xflash: I get in fact the widget. Then I try to upload one of the HDF5, 2D sedov outputs. The file is correctly read. But when I try to plot density or pressure or something else, IDL goes into segmentation fault. Now: I know that at page 231 of the FLASH v. 2.5 UG is written that, for IDL v. 6.0 and higher, there is a problem with HDF5, so one should revert to fidlr2. AND THIS IS WHAT I DID, but the result is the same: as soon as I try to plot something using xflash, I get segmentation fault. Now, I have tried to debug the problem, and I think that it arises when TVIMAGE calls CMCONGRID, around this point: ............. 2: BEGIN ; *** TWO DIMENSIONAL ARRAY IF int THEN BEGIN srx = float(s(1) - m1) / (x-m1) * findgen(x) + halfx sry = float(s(2) - m1) / (y-m1) * findgen(y) + halfy ;print, 'check #2', 'ok 0a' RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub) ;print, 'check #2', 'ok 0b' ........... (The commented "print" are mine). What I can see is that IDL crashes at "RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub)". Any idea/suggestion? Many thanks in advance ****************************************** Vincenzo ANTONUCCIO-DELOGU Astrophysics, University of Oxford Denys Wilkinson Building Keble Road, Oxford OX1 3RH United Kingdom and INAF - Catania Astrophysical Observatory, Catania, ITALY Room 555A, Beecroft Inst. for Particle Astrophysics Tlf.: +44-(0)1865 283019 Fax: +44-(0)1865-273390 e-mail: van at astro.ox.ac.uk skype: eurocosmo From gjordan at flash.uchicago.edu Mon Jun 4 12:36:30 2007 From: gjordan at flash.uchicago.edu (George Jordan) Date: Mon, 4 Jun 2007 12:36:30 -0500 (CDT) Subject: [FLASH-USERS] xflash is crashing In-Reply-To: <200706041645.54335.van@astro.ox.ac.uk> References: <200706041645.54335.van@astro.ox.ac.uk> Message-ID: Hi Vincenzo, I would suggest using the visualization software called VisIt. You can download it at http://www.llnl.gov/visit/ They have worked with us and support the FLASH data format. It is a very nice piece of software and I highly recommend it. Its capabilities are far superior to IDL. You should give it a try. When launching from the command line type: visit -default_format FLASH to tell VisIt to use the FLASH file reader plugin. Best, Cal On Mon, 4 Jun 2007, Vincenzo Antonuccio wrote: > I have a problem using xflash. Here is my configuration: > > FLASH v. 2.5, using fidlr3 > IDL: version 6.3, on > HDF5 v. 1.6.2 >> uname -a: Linux uist 2.6.8-3-686 #1 Tue Dec 5 21:26:38 UTC 2006 i686 > GNU/Linux > > I can successfully launch xflash: I get in fact the widget. Then I try to > upload one of the HDF5, 2D sedov outputs. The file is correctly read. But > when I try to plot density or pressure or something else, IDL goes into > segmentation fault. > > Now: I know that at page 231 of the FLASH v. 2.5 UG is written that, for IDL > v. 6.0 and higher, there is a problem with HDF5, so one should revert to > fidlr2. AND THIS IS WHAT I DID, but the result is the same: as soon as I try > to plot something using xflash, I get segmentation fault. > > Now, I have tried to debug the problem, and I think that it arises when > TVIMAGE calls CMCONGRID, around this point: > ............. > 2: BEGIN ; *** TWO DIMENSIONAL ARRAY > IF int THEN BEGIN > srx = float(s(1) - m1) / (x-m1) * findgen(x) + halfx > sry = float(s(2) - m1) / (y-m1) * findgen(y) + halfy > ;print, 'check #2', 'ok 0a' > RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub) > ;print, 'check #2', 'ok 0b' > ........... > > (The commented "print" are mine). What I can see is that IDL crashes at > "RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub)". > > Any idea/suggestion? > > Many thanks in advance > > ****************************************** > Vincenzo ANTONUCCIO-DELOGU > > Astrophysics, University of Oxford > Denys Wilkinson Building > Keble Road, Oxford OX1 3RH > United Kingdom > > and > > INAF - Catania Astrophysical Observatory, > Catania, ITALY > > Room 555A, Beecroft Inst. for Particle Astrophysics > Tlf.: +44-(0)1865 283019 > Fax: +44-(0)1865-273390 > > e-mail: van at astro.ox.ac.uk > skype: eurocosmo > From guptasanjib at lanl.gov Mon Jun 4 18:23:07 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Mon, 04 Jun 2007 17:23:07 -0600 Subject: [FLASH-USERS] maximum number of plot files per run Message-ID: <46649EDB.8080901@lanl.gov> Hello, Hopefully this has a very simple resolution - the maximum number of plot files I get per run is 9999, so all the plot files are named with zero paddings for file # < 9999, e.g. "hc-rt_hdf5_plt_cnt_0057" upto "hc-rt_hdf5_plt_cnt_9999" anything higher comes out as "hc-rt_hdf5_plt_cnt_****" which I am sure has to do with an allowed length of string for the filename........ Do I have to change code in a driver module if I want it to print , say 1000000 or (many) more plot files, if so, which source file do I modify, and what is the safest (minimalist) way of doing this without messing up the rest of the code? Is there a way of getting this done by changing something in "flash.par" ? I see in "flash.par" the variables basenm="hc-rt_" which no doubt is used to construct the plot and check file names. Also there is run_number="001" is that used to set the length of the file name string somehow? Thanks much, Sanjib Gupta From gjordan at flash.uchicago.edu Mon Jun 4 19:15:25 2007 From: gjordan at flash.uchicago.edu (George Jordan) Date: Mon, 4 Jun 2007 19:15:25 -0500 (CDT) Subject: [FLASH-USERS] maximum number of plot files per run In-Reply-To: <46649EDB.8080901@lanl.gov> References: <46649EDB.8080901@lanl.gov> Message-ID: Hi Sanjib, How is LANL going? It looks like the place to change the filename to include more digits is in the fortran file io_getOutputName.F90. The two places that need to be changed are lines #54 and #75. Keep in mind that the filename string is passed to several .c files that call the hdf5 routines and that strings, FORTRAN, and C when combined can lead to funny results. If possible I would suggest making several directories and just resetting the plot file number with each restart (plotFileNumber in flash.par). This assumes that you would have to restart the simulation before the 10,000th plot file is created and that you would restart in a new directory. Best, Cal On Mon, 4 Jun 2007, sanjib gupta wrote: > Hello, > > Hopefully this has a very simple resolution - the maximum number of plot > files I get per run is > 9999, so all the plot files are named with zero paddings for file # < 9999, > e.g. > "hc-rt_hdf5_plt_cnt_0057" > upto "hc-rt_hdf5_plt_cnt_9999" anything higher comes out as > "hc-rt_hdf5_plt_cnt_****" > which I am sure has to do with an allowed length of string for the > filename........ > Do I have to change code in a driver module if I want it to print , say > 1000000 or (many) more plot files, if so, > which source file do I modify, and what is the safest (minimalist) way of > doing this without messing up > the rest of the code? > Is there a way of getting this done by changing something in "flash.par" ? I > see in "flash.par" > the variables > basenm="hc-rt_" > which no doubt is used to construct the plot and check file names. Also there > is > run_number="001" > is that used to set the length of the file name string somehow? > Thanks much, > Sanjib Gupta > From mzingale at scotty.ess.sunysb.edu Mon Jun 4 19:22:41 2007 From: mzingale at scotty.ess.sunysb.edu (Mike Zingale) Date: Mon, 4 Jun 2007 20:22:41 -0400 (EDT) Subject: [FLASH-USERS] xflash is crashing In-Reply-To: <200706041645.54335.van@astro.ox.ac.uk> References: <200706041645.54335.van@astro.ox.ac.uk> Message-ID: This is a likely bug in IDL that is interacting with a recently patched buffer overflow in the xorg server, see: http://www.ittvis.com/services/techtip.asp?ttid=4177 They recommend you roll back your libx11 or wait for IDL 6.4 Mike On Mon, 4 Jun 2007, Vincenzo Antonuccio wrote: > I have a problem using xflash. Here is my configuration: > > FLASH v. 2.5, using fidlr3 > IDL: version 6.3, on > HDF5 v. 1.6.2 >> uname -a: Linux uist 2.6.8-3-686 #1 Tue Dec 5 21:26:38 UTC 2006 i686 > GNU/Linux > > I can successfully launch xflash: I get in fact the widget. Then I try to > upload one of the HDF5, 2D sedov outputs. The file is correctly read. But > when I try to plot density or pressure or something else, IDL goes into > segmentation fault. > > Now: I know that at page 231 of the FLASH v. 2.5 UG is written that, for IDL > v. 6.0 and higher, there is a problem with HDF5, so one should revert to > fidlr2. AND THIS IS WHAT I DID, but the result is the same: as soon as I try > to plot something using xflash, I get segmentation fault. > > Now, I have tried to debug the problem, and I think that it arises when > TVIMAGE calls CMCONGRID, around this point: > ............. > 2: BEGIN ; *** TWO DIMENSIONAL ARRAY > IF int THEN BEGIN > srx = float(s(1) - m1) / (x-m1) * findgen(x) + halfx > sry = float(s(2) - m1) / (y-m1) * findgen(y) + halfy > ;print, 'check #2', 'ok 0a' > RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub) > ;print, 'check #2', 'ok 0b' > ........... > > (The commented "print" are mine). What I can see is that IDL crashes at > "RETURN, INTERPOLATE(arr, srx, sry, /GRID, CUBIC=cub)". > > Any idea/suggestion? > > Many thanks in advance > > ****************************************** > Vincenzo ANTONUCCIO-DELOGU > > Astrophysics, University of Oxford > Denys Wilkinson Building > Keble Road, Oxford OX1 3RH > United Kingdom > > and > > INAF - Catania Astrophysical Observatory, > Catania, ITALY > > Room 555A, Beecroft Inst. for Particle Astrophysics > Tlf.: +44-(0)1865 283019 > Fax: +44-(0)1865-273390 > > e-mail: van at astro.ox.ac.uk > skype: eurocosmo > ----------------------------------------------------------------------------- Michael Zingale (mzingale at mail.astro.sunysb.edu) Assistant Professor Dept. of Physics and Astronomy office: ESS 440 SUNY Stony Brook phone: 631-632-8225 Stony Brook, NY 11794-3800 web: http://www.astro.sunysb.edu/mzingale ----------------------------------------------------------------------------- From guptasanjib at lanl.gov Mon Jun 4 20:21:05 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Mon, 04 Jun 2007 19:21:05 -0600 Subject: [FLASH-USERS] Benchmarking FLASH/Restart at different speed issue Message-ID: <4664BA81.8050502@lanl.gov> Hello, I've noticed a peculiarity regarding the speed at which FLASH runs .... Say I set up a job on 64 processors and run it for a while. Then I kill the job , restart the job from the last checkpoint file outputted in the first run, no changes in "flash.par" except restart=true and cpnumber is specified.... The same job now runs 20-30 times faster, without any change of # of processors. Any hints ? Would be good to know why this is happening since we plan to benchmark FLASH runs vs. # of processors used and get a sense of how fine a resolution we can afford for our turbulent burning scenarios... Thanks much, Sanjib. From ipatov at dtm.ciw.edu Mon Jun 4 23:20:26 2007 From: ipatov at dtm.ciw.edu (ipatov at dtm.ciw.edu) Date: Tue, 5 Jun 2007 00:20:26 -0400 (EDT) Subject: [FLASH-USERS] h5_write.c Message-ID: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> Dear FLASH users. I have succesfully installed and used FLASH2.5 on one computer. Recently I made the same instalation [with the same tar files] on another computer (of different type), but was not able to make test runs on the second computer. Now after ./setup sedov -auto [same for sod] gmake I got: h5_write.c: in function 'h5_write_tree_info_' h5_write.c:1004: 'status' undecleared (first use in this function) h5_write.c:1005: 'H5P_DEFAULT' undecleared (first use in this function) h5_write.c:1005: 'refine_level' undecleared (first use in this function) h5_write.c: At top level: h5_write.c: 1897: parse error before '*' token h5_write.c: in function 'h5_write_particles_' And a great number of similar messages. At the end: gmake[1]: *** [h5_write.o] Error 1 Then I tried to use hdf5-1.6.4 instead of the previous hdf5-1.6.2.tar, but got the same errors. Does somebody know what must be done to solve the problem? Best regards Sergei From nhearn at uchicago.edu Tue Jun 5 09:13:44 2007 From: nhearn at uchicago.edu (Nathan Hearn) Date: Tue, 5 Jun 2007 09:13:44 -0500 Subject: [FLASH-USERS] h5_write.c In-Reply-To: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> References: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> Message-ID: <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> Hi Sergei, These are very curious errors, as they would suggest that the HDF5 header file is not being included. However, I would have expected the compile to fail at the line #include and not get any further. (Also, it is somewhat odd that there seem to be no errors when the HDF5-related variables are declared, but only when they are used.) I would suggest that you inspect the Makefile.h for this build. (Did you create a new Makefile.h for this machine?) In particular, make sure that all of the HDF5 settings are correct -- this may simply involve setting HDF5_PATH to the correct directory, but you will want to make sure that this is reflected in the CFLAGS_HDF5 and the LIB_HDF5 statements. Also, if you haven't already, be sure to turn on all warning messages for your compilers by adding the appropriate flags to CFLAGS_OPT and FFLAGS_OPT (or, if appropriate, CFLAGS_DEBUG, CFLAGS_TEST, etc.). If you are still having problems, could you email the compiler command that gmake issues for h5_write.c, along with the first few warning and/or error messages that it generates? - Nathan On 6/4/07, ipatov at dtm.ciw.edu wrote: > Dear FLASH users. > I have succesfully installed and used FLASH2.5 on one computer. > Recently I made the same instalation [with the same tar files] > on another computer (of different type), but > was not able to make test runs on the second computer. > Now after > ./setup sedov -auto [same for sod] > gmake > I got: > h5_write.c: in function 'h5_write_tree_info_' > h5_write.c:1004: 'status' undecleared (first use in this function) > h5_write.c:1005: 'H5P_DEFAULT' undecleared (first use in this function) > h5_write.c:1005: 'refine_level' undecleared (first use in this function) > h5_write.c: At top level: > h5_write.c: 1897: parse error before '*' token > h5_write.c: in function 'h5_write_particles_' > And a great number of similar messages. > At the end: > gmake[1]: *** [h5_write.o] Error 1 > > Then I tried to use hdf5-1.6.4 instead of the previous > hdf5-1.6.2.tar, but got the same errors. > > Does somebody know what must be done to solve the problem? > > Best regards > Sergei -- Nathan C. Hearn nhearn at uchicago.edu ASC Flash Center Computational Physics Group University of Chicago From van at astro.ox.ac.uk Tue Jun 5 10:32:42 2007 From: van at astro.ox.ac.uk (Vincenzo Antonuccio) Date: Tue, 5 Jun 2007 16:32:42 +0100 Subject: [FLASH-USERS] xflash is crashing In-Reply-To: References: <200706041645.54335.van@astro.ox.ac.uk> Message-ID: <200706051632.42589.van@astro.ox.ac.uk> On Tuesday 05 June 2007 01:22, Mike Zingale wrote: > This is a likely bug in IDL that is interacting with a recently patched > buffer overflow in the xorg server, > ................. > They recommend you roll back your libx11 or wait for IDL 6.4 > > Mike I can confirm your suspicion, Mike: here in Oxford they have just updated IDL to 6.4, and the system does not crash anymore. The reason why it started recently to crash is that they recently applied a security path to X11, and the latter was "colliding" with IDL. Many thanks to everybody for the suggestions. I will check ViSiT, as soon as I can. ****************************************** Vincenzo ANTONUCCIO-DELOGU Marie Curie fellow, until June 30th, 2007 at: Astrophysics, University of Oxford Denys Wilkinson Building Keble Road, Oxford OX1 3RH United Kingdom (Home Institution: INAF - Catania Astrophysical Observatory, Catania, ITALY) Room 555A, Beecroft Inst. for Particle Astrophysics Tlf.: +44-(0)1865 283019 Fax: +44-(0)1865-273390 e-mail: van at astro.ox.ac.uk skype: eurocosmo 'Malheur a l'homme d'etude qui n'est d'aucune coterie, on lui reprochera jusqu'a de petits succes fort incertains, et la haute vertu triomphera en le volant.' Guai all'intellettuale che non appartiene a nessuna consorteria: gli sara' rimproverato ogni successo, anche il piu' incerto, e la virtu' trionfera' derubandolo. (Stendhal, Le Rouge et le Noir) From richp at flash.uchicago.edu Tue Jun 5 12:44:55 2007 From: richp at flash.uchicago.edu (Paul M. Rich) Date: Tue, 05 Jun 2007 12:44:55 -0500 Subject: [FLASH-USERS] Benchmarking FLASH/Restart at different speed issue In-Reply-To: <4664BA81.8050502@lanl.gov> References: <4664BA81.8050502@lanl.gov> Message-ID: <4665A117.7010606@flash.uchicago.edu> Sanjib, That is a very interesting problem. Could you provide some more information about it, such as the architecture you are running on, some of the output and/or the logfile. --Paul Rich sanjib gupta wrote: > Hello, > > I've noticed a peculiarity regarding the speed at which FLASH runs .... > Say I set up a job on 64 processors and run it for a while. > Then I kill the job , restart the job from the last checkpoint file > outputted in the first run, no changes in > "flash.par" except restart=true and cpnumber is specified.... > The same job now runs 20-30 times faster, without any change of # of > processors. > Any hints ? > Would be good to know why this is happening since we plan to > benchmark FLASH runs vs. # of processors used and get a sense of > how fine a resolution we can afford for our turbulent burning > scenarios... > > Thanks much, > Sanjib. > > > From ipatov at dtm.ciw.edu Tue Jun 5 16:10:15 2007 From: ipatov at dtm.ciw.edu (ipatov at dtm.ciw.edu) Date: Tue, 5 Jun 2007 17:10:15 -0400 (EDT) Subject: [FLASH-USERS] h5_write.c In-Reply-To: <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> References: <33024.10.101.1.203.1181017226.squirrel@www.dtm.ciw.edu> <2467fdc0706050713h7b957f4ancb5f43382d7f1a8d@mail.gmail.com> Message-ID: <33207.10.101.1.203.1181077815.squirrel@www.dtm.ciw.edu> Dear Nathan Hearn. Thank you for your e-mail. The below problem was solved after we changed the file FLASH2.5/source/sites/Prototypes/Linux/Makefile.h e.g., instead of CCOMP = mpicc we use CPPCOMP=g++ CCOMP=gcc but after mpirun -np 1 flash2 we have flash2: error loading shaped libraries: libhdf5.so.: cannot open shared object file: No such file or directory Sergei > Hi Sergei, > > These are very curious errors, as they would suggest that the HDF5 > header file is not being included. However, I would have expected the > compile to fail at the line > > #include > > and not get any further. (Also, it is somewhat odd that there seem to > be no errors when the HDF5-related variables are declared, but only > when they are used.) > > I would suggest that you inspect the Makefile.h for this build. > (Did you create a new Makefile.h for this machine?) In particular, > make sure that all of the HDF5 settings are correct -- this may simply > involve setting HDF5_PATH to the correct directory, but you will want > to make sure that this is reflected in the CFLAGS_HDF5 and the > LIB_HDF5 statements. Also, if you haven't already, be sure to turn on > all warning messages for your compilers by adding the appropriate > flags to CFLAGS_OPT and FFLAGS_OPT (or, if appropriate, CFLAGS_DEBUG, > CFLAGS_TEST, etc.). > > If you are still having problems, could you email the compiler > command that gmake issues for h5_write.c, along with the first few > warning and/or error messages that it generates? > > > - Nathan > > > On 6/4/07, ipatov at dtm.ciw.edu wrote: >> Dear FLASH users. >> I have succesfully installed and used FLASH2.5 on one computer. >> Recently I made the same instalation [with the same tar files] >> on another computer (of different type), but >> was not able to make test runs on the second computer. >> Now after >> ./setup sedov -auto [same for sod] >> gmake >> I got: >> h5_write.c: in function 'h5_write_tree_info_' >> h5_write.c:1004: 'status' undecleared (first use in this function) >> h5_write.c:1005: 'H5P_DEFAULT' undecleared (first use in this function) >> h5_write.c:1005: 'refine_level' undecleared (first use in this >> function) >> h5_write.c: At top level: >> h5_write.c: 1897: parse error before '*' token >> h5_write.c: in function 'h5_write_particles_' >> And a great number of similar messages. >> At the end: >> gmake[1]: *** [h5_write.o] Error 1 >> >> Then I tried to use hdf5-1.6.4 instead of the previous >> hdf5-1.6.2.tar, but got the same errors. >> >> Does somebody know what must be done to solve the problem? >> >> Best regards >> Sergei > > -- > Nathan C. Hearn > nhearn at uchicago.edu > > ASC Flash Center > Computational Physics Group > University of Chicago > From guptasanjib at lanl.gov Tue Jun 5 18:57:43 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Tue, 05 Jun 2007 17:57:43 -0600 Subject: [FLASH-USERS] Details of speedup after restart Message-ID: <4665F877.2030402@lanl.gov> Hi, I am attaching 2 log files - the initial run on 128 processors, then immediately killing the job and restarting from the first checkpoint file "hc-rt-hdf5_chk_0000" notice about 4 timesteps per second initially, then ~30 timesteps/sec after restart. On 64 processors I noticed the gain was higher , but my resolution was lower (half the number of nblocky, same nblockx, this is a 2D run)- sorry did not keep the logfiles. However this "gain" cannot be predicted......sometimes I don't get it on the first restart, so I restart a couple of times! As you'all can guess, this plays havoc with any benchmarking efforts .......and we do intend to showcase our results from FLASH soon ... :-) We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster ....and hdf5 version 1.6.5 ......Makefile.h is attached. Architecture - 64 bit AMD Opteron running FC3 linux + BProcV4 (cluster OS) with kernel Thanks much for your help/insight/suggestions, Sanjib. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hc-rt-firstrun_06_05_07.log Url: http://flash.uchicago.edu/pipermail/flash-users/attachments/20070605/172918eb/attachment-0002.pl -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: hc-rt-restart_06_05_07.log Url: http://flash.uchicago.edu/pipermail/flash-users/attachments/20070605/172918eb/attachment-0003.pl -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Makefile.h Url: http://flash.uchicago.edu/pipermail/flash-users/attachments/20070605/172918eb/attachment-0001.h From nhearn at uchicago.edu Tue Jun 5 20:04:17 2007 From: nhearn at uchicago.edu (Nathan Hearn) Date: Tue, 5 Jun 2007 20:04:17 -0500 Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <4665F877.2030402@lanl.gov> References: <4665F877.2030402@lanl.gov> Message-ID: <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> Hi Sanjib, This is very curious. How do the output data files that come from restarts compare with those from non-restarts? Are they binary-equivalent? My concern is that the speed-up is coming from the code processing the input data differently each time. Alternatively, is it possible that there are background operations on the compute nodes (e.g., cluster node monitors, file system checks, network loads, etc.) that are interfering with your benchmarking runs? (Can you use top on any of the processing nodes while the simulation is running?) - Nathan On 6/5/07, sanjib gupta wrote: > Hi, > > I am attaching 2 log files - the initial run on 128 processors, then > immediately killing the job and restarting from the first checkpoint > file "hc-rt-hdf5_chk_0000" > notice about 4 timesteps per second initially, then ~30 timesteps/sec > after restart. > > On 64 processors I noticed the gain was higher , but my resolution was > lower (half the number of nblocky, same nblockx, this is a 2D run)- > sorry did not keep the logfiles. > > However this "gain" cannot be predicted......sometimes I don't get it on > the first restart, so I restart a couple of times! > As you'all can guess, this plays havoc with any benchmarking efforts > .......and we do intend to showcase our results from FLASH soon ... :-) > > We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster > ....and hdf5 version 1.6.5 ......Makefile.h is attached. > Architecture - 64 bit AMD Opteron > running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 > > Thanks much for your help/insight/suggestions, > Sanjib. From guptasanjib at lanl.gov Tue Jun 5 20:44:50 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Tue, 05 Jun 2007 19:44:50 -0600 Subject: speedups output files Re: [FLASH-USERS] Details of speedup after restart In-Reply-To: <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> References: <4665F877.2030402@lanl.gov> <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> Message-ID: <46661192.9000204@lanl.gov> Hi Nathan, The output files - the HDf5 check and plot files are the same size as before restarts..... and seem fine - they pickup exactly where the run left off, whether it is the thermodynamic conditions or mass fractions, and the resulting burning looks perfectly reasonable... I am not sure what you mean by binary-equivalent..... background operations are usually carried out on a different queue on the cluster......the cluster top command btop, just lets me know which nodes are free and which I am using.... and it shows usage at 99 % or so, meaning I have all CPU usage on the dual-processor nodes......this however could change during maintenance hours like 1-3 am when I am not around to monitor usage......sometimes the number of total timesteps at the end of the night does not make sense (too few), but this has only been a couple of times.... and I have not combed the usually voluminous logfiles unless something really goes wrong. However, I don't think background processes are responsible for the slow first run - it is too consistent , and we get maintenance messages from the Cluster operators..... I would be aware of them. Also the CPU usage - initially when this happened I remember doing a lot of checks on the allotted processors, I was getting near 100%, before and after the restart. The restart is from "flash.dat" ? How does it work? I was curious anyway, since it would be nice to do things like.....change the resolution at a restart. If only the thermo conditions etc. are transported to the restarted run, and properly rezoned, one would not have to worry? Unless of course dynamic structures have developed based on the earlier zoning ...but for relatively quiescent scenarios that are being restarted this should approximately work? Thanks Sanjib Nathan Hearn wrote: > Hi Sanjib, > > This is very curious. How do the output data files that come from > restarts compare with those from non-restarts? Are they > binary-equivalent? My concern is that the speed-up is coming from the > code processing the input data differently each time. > > Alternatively, is it possible that there are background operations > on the compute nodes (e.g., cluster node monitors, file system checks, > network loads, etc.) that are interfering with your benchmarking runs? > (Can you use top on any of the processing nodes while the simulation > is running?) > > > - Nathan > > > On 6/5/07, sanjib gupta wrote: >> Hi, >> >> I am attaching 2 log files - the initial run on 128 processors, then >> immediately killing the job and restarting from the first checkpoint >> file "hc-rt-hdf5_chk_0000" >> notice about 4 timesteps per second initially, then ~30 timesteps/sec >> after restart. >> >> On 64 processors I noticed the gain was higher , but my resolution was >> lower (half the number of nblocky, same nblockx, this is a 2D run)- >> sorry did not keep the logfiles. >> >> However this "gain" cannot be predicted......sometimes I don't get it on >> the first restart, so I restart a couple of times! >> As you'all can guess, this plays havoc with any benchmarking efforts >> .......and we do intend to showcase our results from FLASH soon ... >> :-) >> >> We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster >> ....and hdf5 version 1.6.5 ......Makefile.h is attached. >> Architecture - 64 bit AMD Opteron >> running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 >> >> Thanks much for your help/insight/suggestions, >> Sanjib. From sheeler at flash.uchicago.edu Tue Jun 5 20:54:46 2007 From: sheeler at flash.uchicago.edu (Dan Sheeler) Date: Tue, 5 Jun 2007 20:54:46 -0500 (CDT) Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <4665F877.2030402@lanl.gov> References: <4665F877.2030402@lanl.gov> Message-ID: Does this run have just 64 blocks (4 x 16)? If this is a standard flash setup with local physics, having less than one block per process probably will produce weird performance numbers. In a standard run, 2d 8x8 blocks require very little ram or work per process. Single processes are happy working on thousands of blocks. Furthermore, work is distributed to the processors in nothing smaller than block-sized chunks. If your run is a typical setup, then half of the processes have more-or-less nothing to do but add communication, and I can imagine that would cause runtime to fluctuate non-deterministically. Dan -- Dan Sheeler ASC Flash Center sheeler at flash.uchicago.edu (773) 834-3236 On Tue, 5 Jun 2007, sanjib gupta wrote: > Hi, > > I am attaching 2 log files - the initial run on 128 processors, then > immediately killing the job and restarting from the first checkpoint file > "hc-rt-hdf5_chk_0000" > notice about 4 timesteps per second initially, then ~30 timesteps/sec after > restart. > > On 64 processors I noticed the gain was higher , but my resolution was lower > (half the number of nblocky, same nblockx, this is a 2D run)- sorry did not > keep the logfiles. > > However this "gain" cannot be predicted......sometimes I don't get it on the > first restart, so I restart a couple of times! > As you'all can guess, this plays havoc with any benchmarking efforts > .......and we do intend to showcase our results from FLASH soon ... :-) > > We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster > ....and hdf5 version 1.6.5 ......Makefile.h is attached. > Architecture - 64 bit AMD Opteron > running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 > > Thanks much for your help/insight/suggestions, > Sanjib. > From guptasanjib at lanl.gov Tue Jun 5 21:17:03 2007 From: guptasanjib at lanl.gov (sanjib gupta) Date: Tue, 05 Jun 2007 20:17:03 -0600 Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: References: <4665F877.2030402@lanl.gov> Message-ID: <4666191F.70301@lanl.gov> For instance, when I ran using 64 processors I had nblockx=4 nblocky=16 in "flash.par"................so that makes sense. So that means that if all allotted processors are at 100% usage by my job when I am at 1block/proc, then I will get no enhancements from doubling to 128 processors, so it makes sense to increase the resolution instead? That explains some tests I did to see how the runs scale with #processors, but I'm not sure it explains the drastic difference between initial run vs. all subsequent (restarted) runs. Thanks, Sanjib. Dan Sheeler wrote: > Does this run have just 64 blocks (4 x 16)? If this is a standard > flash setup with local physics, having less than one block per process > probably will produce weird performance numbers. In a standard run, > 2d 8x8 blocks require very little ram or work per process. Single > processes are happy working on thousands of blocks. Furthermore, work > is distributed to the processors in nothing smaller than block-sized > chunks. If your run is a typical setup, then half of the processes > have more-or-less nothing to do but add communication, and I can > imagine that would cause runtime to fluctuate non-deterministically. > > Dan > > -- > Dan Sheeler > ASC Flash Center > sheeler at flash.uchicago.edu > (773) 834-3236 > > On Tue, 5 Jun 2007, sanjib gupta wrote: > >> Hi, >> >> I am attaching 2 log files - the initial run on 128 processors, then >> immediately killing the job and restarting from the first checkpoint >> file "hc-rt-hdf5_chk_0000" >> notice about 4 timesteps per second initially, then ~30 timesteps/sec >> after restart. >> >> On 64 processors I noticed the gain was higher , but my resolution >> was lower (half the number of nblocky, same nblockx, this is a 2D >> run)- sorry did not keep the logfiles. >> >> However this "gain" cannot be predicted......sometimes I don't get it >> on the first restart, so I restart a couple of times! >> As you'all can guess, this plays havoc with any benchmarking efforts >> .......and we do intend to showcase our results from FLASH soon ... >> :-) >> >> We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux >> cluster ....and hdf5 version 1.6.5 ......Makefile.h is attached. >> Architecture - 64 bit AMD Opteron >> running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14 >> >> Thanks much for your help/insight/suggestions, >> Sanjib. >> From gawrysz at camk.edu.pl Wed Jun 6 00:11:56 2007 From: gawrysz at camk.edu.pl (Artur Gawryszczak) Date: Wed, 6 Jun 2007 07:11:56 +0200 Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <4665F877.2030402@lanl.gov> References: <4665F877.2030402@lanl.gov> Message-ID: <200706060711.57227@phoenix.camk.edu.pl> Hi, On ?roda, 6 czerwca 2007, sanjib gupta wrote: > I am attaching 2 log files - the initial run on 128 processors, then > immediately killing the job and restarting from the first checkpoint > file "hc-rt-hdf5_chk_0000" > notice about 4 timesteps per second initially, then ~30 timesteps/sec > after restart. You're using only base level of refinement (lrefine_max=1) which has a non obvious side effect: when you start from scratch, only master procesor gets the work and the other are just waiting. After a restart the blocks are distributed and then the run becomes truly parallel. I'd suggest you to decrease nblock[xy] and use lrefine_min=2 and lrefine_max=2 instead, then after refining to second level the blocks will be distributed. If you don't require extremely flexible AMR then you may also increase nxb and nyb from 8 to 16 or 32 (at compile time) which will reduce overhead due to communication. Your setup is also relatively small - it's just 32x128 cells, so probably it makes little sense to use more than 4 or 8 CPU for it. -- Cheers, Artur From nhearn at uchicago.edu Wed Jun 6 04:37:07 2007 From: nhearn at uchicago.edu (Nathan Hearn) Date: Wed, 6 Jun 2007 04:37:07 -0500 Subject: speedups output files Re: [FLASH-USERS] Details of speedup after restart In-Reply-To: <46661192.9000204@lanl.gov> References: <4665F877.2030402@lanl.gov> <2467fdc0706051804j536cec5budcd2c76f55d5839@mail.gmail.com> <46661192.9000204@lanl.gov> Message-ID: <2467fdc0706060237o1008f4d2nb7743b9cab37fd53@mail.gmail.com> Hi Sanjib, From the main email thread, it looks like you are on your way to a solution towards the speedup issue. (I like Artur's explanation, as blocks are probably redistributed only during AMR refine/derefine steps. However, it is still unclear to me why you only occasionally see the speedup during restart.) Regarding the nature of the restarts, I believe that flash.dat merely stores the output stream for diagnostic data generated during the run. Unless there are some specific input files required by the Flash modules in use, the only files needed for restart are flash.par and the checkpoint file. The checkpoint file contains all the information necessary to reconstruct the mesh. Changing the resolution during a restart is a somewhat complicated issue. By design, Flash uses the structural information stored in the checkpoint file to build the mesh in memory, and there is no re-meshing capability included. However, it should be possible to alter the lrefine settings in flash.par to force Flash to change the minimum and maximum levels of refinement after the checkpoint data is loaded. (This has been a recent topic of discussion here.) Right now, more significant changes to the mesh -- such as altering the physical size of the domain, changing the arrangement of base blocks, or changing the number of zones per block -- is not permitted. (I have been working on routines for resampling Flash data files during the init_block stage, but they are still in development.) - Nathan On 6/5/07, sanjib gupta wrote: > Hi Nathan, > > The output files - the HDf5 check and plot files are the same size as > before restarts..... > and seem fine - they pickup exactly where the run left off, whether it is > the thermodynamic conditions or mass fractions, and the resulting > burning looks perfectly reasonable... > I am not sure what you mean by binary-equivalent..... > > background operations are usually carried out on a different queue on > the cluster......the cluster top command > btop, just lets me know which nodes are free and which I am using.... > and it shows usage at 99 % or so, meaning I have all > CPU usage on the dual-processor nodes......this however could change > during maintenance hours like 1-3 am when I am not around to > monitor usage......sometimes the number of total timesteps at the end of > the night does not make sense (too few), but this has only been a couple > of times.... > and I have not combed the usually voluminous logfiles unless something > really goes wrong. > > However, I don't think background processes are responsible for the slow > first run - it is too consistent , and we get maintenance messages from > the Cluster operators..... > I would be aware of them. > Also the CPU usage - initially when this happened I remember doing a lot > of checks on the allotted processors, I was getting near 100%, before > and after the restart. > > The restart is from "flash.dat" ? How does it work? I was curious > anyway, since it would be nice to do things like.....change the > resolution at a restart. If only the thermo conditions etc. are > transported to the > restarted run, and properly rezoned, one would not have to worry? Unless > of course dynamic structures have developed based on the earlier zoning > ...but for relatively quiescent scenarios that are being restarted this > should approximately work? > > Thanks > Sanjib -- Nathan C. Hearn nhearn at uchicago.edu ASC Flash Center Computational Physics Group University of Chicago From dubey at flash.uchicago.edu Wed Jun 6 04:40:43 2007 From: dubey at flash.uchicago.edu (Anshu Dubey) Date: Wed, 6 Jun 2007 04:40:43 -0500 (CDT) Subject: [FLASH-USERS] Details of speedup after restart In-Reply-To: <200706060711.57227@ph