Optimizer stops to write to disk after a couple of configurations
I'm having problems with the configurations in this repo, in the directory optim-quads
.
Steps to reproduce
-
Run the batch job provided in the repo (change the path to OPAL to your path)
-
The job runs normally, but after 2min or so, nothing is written to the disk anymore, even if you wait 20min
-
Problems do not arise if
--ntasks=8
is set in the Slurm file -
The same job runs fine with OPAL 2.0, so it is not a bad configuration
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
Collapse replies - Author Developer
Yes, but also with the latest Master (commit 03bf3d6b).
@adelmann As @bellotti_r mentioned, this issue has nothing to do with my changes.
I applied the changes also to another branch derived from the OPAL-2.0 branch such that @bellotti_r is able to work until this issue is figured out.
Edited by frey_m
- Developer
Just a small observation. When running with master,
KEEP
is not an attribute ofOPTIMIZE
and the line needs to be commented out.Edited by snuverink_j @snuverink_j Yes, that's right. The
KEEP
flag I added for @bellotti_r in a branch.There are "just" 56 commits that changed the optimizer. Here's the list of all commits.
- Author Developer
Thanks for your effort. Which versions of the libraries are you using? My setup is as follows:
gcc/7.3.0 cmake/3.9.6 gsl/2.5 boost/1.68.0 openmpi/3.1.3 hdf5/1.10.4 H5hut/2.0.0rc5
- Author Developer
I don't think that changes much.
I suspect the number of cores is the problem. When I run the configuration with 8 cores, it works. When I change to 128, it doesn't.
How many cores have you been using?
Collapse replies - Author Developer
I don't have sufficient permissions to read your binary.
- Author Developer
Your binary seems to work. I'll try to recompile OPAL with the module you're using.
- Developer
I can't seem to reproduce this either on master with either 128 or 172 cores.
- Author Developer
It is working now. So it's really just the version of cmake that made the difference.
Thanks for all your help! :)
- bellotti_r closed
closed
Collapse replies - Author Developer
I don't know. My first htought was that I might've made a mistake during my first compilation, but you've experienced the same issue, right? Therefore the only difference is the cmake version...
- Author Developer
Yes, I've used the same modules for both.
- Author Developer
Here are the cmake logs. I couldn't find any difference apart from the cmake version. Very strange...
- Author Developer
- Author Developer
@frey_m The problem persists in your branch (OPAL 2.1)