@ext-neveu_n reported crashes with the optimizer. It turned out that the crashes were coming from the HDF5 output (with ENABLEHDF5=FALSE there were no more crashes).
voidH5PartWrapper::open(h5_int32_tflags){close();h5_prop_tprops=H5CreateFileProp();MPI_Commcomm=Ippl::getComm();h5_err_th5err=H5SetPropFileMPIOCollective(props,&comm);#if defined (NDEBUG)(void)h5err;#endifassert(h5err!=H5_ERR);file_m=H5OpenFile(fileName_m.c_str(),flags,props);assert(file_m!=(h5_file_t)H5_ERR);H5CloseProp(props);}
So the opening of the file failed for some reason (Perhaps the optimiser has deleted the directory?).
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
I had run a memory checker (valgrind) on a small job, I initially didn't notice anything in particular. But after the new error file I had a renewed look (signal 7 clearly indicates a memory problem).
On my small job there were invalid read on a single place only:
==24893== Invalid read of size 2==24893== at 0x4A0B6F6: memcpy (vg_replace_strmem.c:1023)==24893== by 0xC96E0E: copy (char_traits.h:350)==24893== by 0xC96E0E: _S_copy (basic_string.h:340)==24893== by 0xC96E0E: _S_copy_chars (basic_string.h:387)==24893== by 0xC96E0E: _M_construct<char const*> (basic_string.tcc:225)==24893== by 0xC96E0E: _M_construct_aux<char const*> (basic_string.h:236)==24893== by 0xC96E0E: _M_construct<char const*> (basic_string.h:255)==24893== by 0xC96E0E: basic_string (basic_string.h:511)==24893== by 0xC96E0E: deserialize(char*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > >&) (MPIHelper.cpp:26)==24893== by 0xC97A31: MPI_Recv_params(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > >&, unsigned long, ompi_communicator_t*) (MPIHelper.cpp:119)==24893== by 0x6E6AA9: Worker<OpalSimulation>::onMessage(ompi_status_public_t, unsigned long) (Worker.h:187)==24893== by 0x6EF0F9: run (Poller.h:94)==24893== by 0x6EF0F9: Worker (Worker.h:69)==24893== by 0x6EF0F9: Pilot<OpalInputFileParser, FixedPisaNsga2<BlendCrossover, IndependentBitMutation>, OpalSimulation, SocialNetworkGraph<NoCommTopology>, CommSplitter<ManyMasterSplit<NoCommTopology> > >::startWorker() (Pilot.h:273)==24893== Address 0x6d372ec is 220 bytes inside a block of size 221 alloc'd==24893== at 0x4A07879: operator new[](unsigned long) (vg_replace_malloc.c:423)==24893== by 0xC979FF: MPI_Recv_params(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, double, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, double> > >&, unsigned long, ompi_communicator_t*) (MPIHelper.cpp:114)==24893== by 0x6E6AA9: Worker<OpalSimulation>::onMessage(ompi_status_public_t, unsigned long) (Worker.h:187)==24893== by 0x6EF0F9: run (Poller.h:94)==24893== by 0x6EF0F9: Worker (Worker.h:69)==24893== by 0x6EF0F9: Pilot<OpalInputFileParser, FixedPisaNsga2<BlendCrossover, IndependentBitMutation>, OpalSimulation, SocialNetworkGraph<NoCommTopology>, CommSplitter<ManyMasterSplit<NoCommTopology> > >::startWorker() (Pilot.h:273)==24893== by 0x6D769E: setup (Pilot.h:203)==24893== by 0x6D769E: Pilot (Pilot.h:124)==24893== by 0x6D769E: OptimizeCmd::execute() (OptimizeCmd.cpp:361)
This is the creation of the std::istringstream in MPIHelper.cpp:
This makes sure buffer is null-terminated. Note that the length of str.c_str() is one more than its length().
And since the received string is not null-terminated, the initialisation with std::istringstream is(buffer); is undefined, which caused the invalid read. Possibly this wasn't noticed since the very next character is very likely to be null.
A possible safeguard (which in this case wouldn't have helped) would be to initialise buffer on the receiving end:
I have pushed to master. This occurred actually at several places (as shown by a new memory check run). Now there should be no more invalid reads. There are still a few "conditional jumps on uninitialised values" in the optimiser, however these will not cause crashes. I will look at those next week.
I had a look at the conditional jumps. These all come from the hypervolume calculation, in particular from the reference point that is used. It seems that in the original code there was the possibility to give a reference point. However, this is all commented out:
if(argc==2){printf("No reference point provided: using the origin\n");for(inti=0;i<maxn;i++)ref.objectives[i]=0;}elseif(argc-2!=maxn){printf("Your reference point should have %d values\n",maxn);return0;}elsefor(inti=2;i<argc;i++)ref.objectives[i-2]=atof(argv[i]);
Now the reference value is not initialised. I will simply put it to 0, so that the origin is used (0 is the most likely uninitialised value anyway).