speedups output files Re: [FLASH-USERS] Details of speedup after restart

sanjib gupta guptasanjib at lanl.gov
Tue Jun 5 20:44:50 CDT 2007


Hi Nathan,

The output files - the HDf5 check and plot files are the same size as 
before restarts.....
and seem fine - they pickup exactly where the run left off, whether it is
the thermodynamic conditions or mass fractions, and the resulting 
burning looks perfectly reasonable...
I am not sure what you mean by binary-equivalent.....

background operations are usually carried out on a different queue on 
the cluster......the cluster top command
btop, just lets me know which nodes are free and which I am using.... 
and it shows usage at 99 % or so, meaning I have all
CPU usage on the dual-processor nodes......this however could change 
during maintenance hours like 1-3 am when I am not around to
monitor usage......sometimes the number of total timesteps at the end of 
the night does not make sense (too few), but this has only been a couple 
of times....
and I have not combed the usually voluminous logfiles unless something 
really goes wrong.

However, I don't think background processes are responsible for the slow 
first run - it is too consistent , and we get maintenance messages from 
the Cluster operators.....
I would be aware of them.
Also the CPU usage - initially when this happened I remember doing a lot 
of checks on the allotted processors, I was getting near 100%, before 
and after the restart.

The restart is from "flash.dat" ? How does it work? I was curious 
anyway, since it would be nice to do things like.....change the 
resolution at a restart. If only the thermo conditions etc. are 
transported to the
restarted run, and properly rezoned, one would not have to worry? Unless 
of course dynamic structures have developed based on the earlier zoning 
...but for relatively quiescent scenarios that are being restarted this 
should approximately work?

Thanks
Sanjib



Nathan Hearn wrote:
> Hi Sanjib,
>
>    This is very curious.  How do the output data files that come from
> restarts compare with those from non-restarts?  Are they
> binary-equivalent?  My concern is that the speed-up is coming from the
> code processing the input data differently each time.
>
>    Alternatively, is it possible that there are background operations
> on the compute nodes (e.g., cluster node monitors, file system checks,
> network loads, etc.) that are interfering with your benchmarking runs?
> (Can you use top on any of the processing nodes while the simulation
> is running?)
>
>
> - Nathan
>
>
> On 6/5/07, sanjib gupta <guptasanjib at lanl.gov> wrote:
>> Hi,
>>
>> I am attaching 2 log files - the initial run on 128 processors, then
>> immediately killing the job and restarting from the first checkpoint
>> file "hc-rt-hdf5_chk_0000"
>> notice about 4 timesteps per second initially, then ~30 timesteps/sec
>> after restart.
>>
>> On 64 processors I noticed the gain was higher , but my resolution was
>> lower (half the number of nblocky, same nblockx, this is a 2D run)-
>> sorry did not keep the logfiles.
>>
>> However this "gain" cannot be predicted......sometimes I don't get it on
>> the first restart, so I restart a couple of times!
>> As you'all can guess, this plays havoc with any benchmarking efforts
>> .......and we do intend to showcase our results from FLASH soon ...   
>> :-)
>>
>> We compile with intel fortran 9.1.033 and openmpi 1.1 on a linux cluster
>> ....and hdf5 version 1.6.5 ......Makefile.h is attached.
>> Architecture - 64 bit AMD Opteron
>> running FC3 linux  + BProcV4 (cluster OS) with kernel = 2.6.14
>>
>> Thanks much for your help/insight/suggestions,
>> Sanjib.




More information about the flash-users mailing list