[FLASH-USERS] MPI failure
Dan Sheeler
sheeler at flash.uchicago.edu
Mon Jun 11 18:41:58 CDT 2007
On 32 bit linux machines (and I think others), there's an insidious stack
limit of 2Mb if your executable was compiled statically and linked against
the pthreads library (which flash usueally does because mpich usually
links to pthreads). Setting the limit in the shell doesn't affect it
because the statically linked pthreads library somehow overrides the shell
setting. so we've added c code in flash3 that sets the rlimit inside
flash. Download flash3 and look for the file dr_set_rlimits.c.
Basically, the code calls setrlimit to unlimit the stacksize like this:
lim.rlim_cur = RLIM_INFINITY;
lim.rlim_max = RLIM_INFINITY;
retval = setrlimit(RLIMIT_STACK, &lim);
This still might not be your problem, though. It seems like if it's not,
you might have to see if there's some mpich2 buffer limits you might have
to adjust through environment variables or something, but I'm not too
familer with that.
Dan
--
Dan Sheeler
ASC Flash Center
sheeler at flash.uchicago.edu
(773) 834-3236
On Mon, 11 Jun 2007, Peter Vitello wrote:
> Thanks for the suggestion, but it doesn't look like my MPI out of memory
> failure is due to the stack size being limited.
> The results from ulimit are as follows, and stack size is unlimited. While
> unlimited, I don't know what memory is actually
> available. Does anyone know where else to check for what would FLASH 2.5 to
> generate:
>
> Fatal error in MPI_Irecv: Other MPI error, error stack:
> MPI_Irecv(144): MPI_Irecv(buf=0x124f8cc0, count=12, MPI_DOUBLE_PRECISION,
> src=21, tag=440, MPI_COMM_WORLD, request=0x137dab7c) failed
> MPID_Irecv(74): Out of memory
>
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> pending signals (-i) 37376
> max locked memory (kbytes, -l) 32
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> stack size (kbytes, -s) unlimited
> cpu time (seconds, -t) unlimited
> max user processes (-u) 37376
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
>
> Thanks for any help.
>
> Peter Vitello
> LLNL
>
> At 02:46 PM 6/11/2007, you wrote:
>
>>> Perhaps you have checked, but the "Out of memory" suggests exactly
>>> that. Was the block count on each processor approaching MAXBLOCKS?
>>> I have seen crashes with non-intuitive mpi errors when the number of
>>> blocks on the processors gets close to the maximum and the code
>>> runs out of memory. You might try the run again on more processors.
>>
>> I have fouled up the usage of more codes than most
>> people have ever heard of: if it was an out-of-memory problem and
>> maxblocks was not being reached, check to make sure your stacksize is
>> not limited. You should be able to run the Unix command "limit" and
>> see something like
>>
>> % limit
>> cputime unlimited
>> filesize unlimited
>> datasize unlimited
>> stacksize unlimited
>> coredumpsize 0 kbytes
>> memoryuse unlimited
>> vmemoryuse unlimited
>> descriptors 1024
>> memorylocked 32 kbytes
>> maxproc 98304
>>
>> If that shows some smaller limit, in your .cshrc or .bashrc file, enter
>> the line
>> unlimit stacksize
>> so that all new processes (esp the MPI ones) are started without the
>> stacksize limited.
>>
>> This will only cause you trouble if your code is starting a lot of Java
>> virtual machines, or you are directly using pthreads, which are
>> both unusual for most HPC MPI codes.
>>
>> This has been an irritating bug encountered so many times that I've started
>> having our sysadmin apply it for the default start up file
>> for every new student I get. It always causes troubles that seem far
>> removed from the root cause. I strongly suspect that by now LLNL has
>> already done this by default for most users, but if you copied over a
>> .cshrc file from another machine it may have been overwritten or
>> overrided.
>>
>> And if this was the problem, feel free to post this to the rest of the
>> flash-user list. I have no sense of shame anymore, and don't mind
>> everyone knowing about how many times I've made this same mistake!
>
More information about the flash-users
mailing list