speedups output files Re: [FLASH-USERS] Details of speedup after restart
sanjib gupta
guptasanjib at lanl.gov
Tue Jun 5 20:44:50 CDT 2007
Hi Nathan,
The output files - the HDf5 check and plot files are the same size as
before restarts.....
and seem fine - they pickup exactly where the run left off, whether it is
the thermodynamic conditions or mass fractions, and the resulting
burning looks perfectly reasonable...
I am not sure what you mean by binary-equivalent.....
background operations are usually carried out on a different queue on
the cluster......the cluster top command
btop, just lets me know which nodes are free and which I am using....
and it shows usage at 99 % or so, meaning I have all
CPU usage on the dual-processor nodes......this however could change
during maintenance hours like 1-3 am when I am not around to
monitor usage......sometimes the number of total timesteps at the end of
the night does not make sense (too few), but this has only been a couple
of times....
and I have not combed the usually voluminous logfiles unless something
really goes wrong.
However, I don't think background processes are responsible for the slow
first run - it is too consistent , and we get maintenance messages from
the Cluster operators.....
I would be aware of them.
Also the CPU usage - initially when this happened I remember doing a lot
of checks on the allotted processors, I was getting near 100%, before
and after the restart.
The restart is from "flash.dat" ? How does it work? I was curious
anyway, since it would be nice to do things like.....change the
resolution at a restart. If only the thermo conditions etc. are
transported to the
restarted run, and properly rezoned, one would not have to worry? Unless
of course dynamic structures have developed based on the earlier zoning
...but for relatively quiescent scenarios that are being restarted this
should approximately work?
Thanks
Sanjib
Nathan Hearn wrote:
> Hi Sanjib,
>
> This is very curious. How do the output data files that come from
> restarts compare with those from non-restarts? Are they
> binary-equivalent? My concern is that the speed-up is coming from the
> code processing the input data differently each time.
>
> Alternatively, is it possible that there are background operations
> on the compute nodes (e.g., cluster node monitors, file system checks,
> network loads, etc.) that are interfering with your benchmarking runs?
> (Can you use top on any of the processing nodes while the simulation
> is running?)
>
>
> - Nathan
>
>
> On 6/5/07, sanjib gupta <guptasanjib at lanl.gov> wrote:
>> Hi,
>>
>> I am attaching 2 log files - the initial run on 128 processors, then
>> immediately killing the job and restarting from the first checkpoint
>> file "hc-rt-hdf5_chk_0000"
>> notice about 4 timesteps per second initially, then ~30 timesteps/sec
>> after restart.
>>
>> On 64 processors I noticed the gain was higher , but my resolution was
>> lower (half the number of nblocky, same nblockx, this is a 2D run)-
>> sorry did not keep the logfiles.
>>
>> However this "gain" cannot be predicted......sometimes I don't get it on
>> the first restart, so I restart a couple of times!
>> As you'all can guess, this plays havoc with any benchmarking efforts
>> .......and we do intend to showcase our results from FLASH soon ...
>> :-)
>>
>> We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster
>> ....and hdf5 version 1.6.5 ......Makefile.h is attached.
>> Architecture - 64 bit AMD Opteron
>> running FC3 linux + BProcV4 (cluster OS) with kernel = 2.6.14
>>
>> Thanks much for your help/insight/suggestions,
>> Sanjib.
More information about the flash-users
mailing list