Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
S src
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 5
    • Merge requests 5
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Analytics
    • Analytics
    • CI/CD
    • Code Review
    • Issue
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • OPAL
  • src
  • Issues
  • #310

Closed
Open
Created Jun 13, 2019 by bellotti_r@bellotti_rDeveloper

OPAL job failed due to attempt to free memory that is still in use by MPI

I tried to take 10k samples and obtained the following error message:

Ippl{0}> CommMPI: Initialization complete.
Ippl{0}> CommMPI: Parent process waiting for children ...
Ippl{0}> CommMPI: Child 1 ready.
Ippl{0}> CommMPI: Child 2 ready.
Ippl{0}> CommMPI: Child 3 ready.
Ippl{0}> CommMPI: Child 4 ready.
Ippl{0}> CommMPI: Child 5 ready.
Ippl{0}> CommMPI: Child 6 ready.
Ippl{0}> CommMPI: Child 7 ready.
Ippl{0}> CommMPI: Initialization complete.
[merlin-c-002:10963] Attempt to free memory that is still in use by an ongoing MPI communication (buffer 0x7f6a000, size 6098944).  MPI job will now abort.
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63717,1],42]
  Exit code:    1
--------------------------------------------------------------------------

I think this might be a bug. The following configuration file was used:

OPTION, INFO=TRUE;

REAL Nsamples = 10000;

// Design variables
p1:  DVAR, VARIABLE="p1", LOWERBOUND=-30., UPPERBOUND=0;

// Sampling methods
sp1: SAMPLING, VARIABLE="p1", RANDOM=true, TYPE="UNIFORM", SEED=122, N = Nsamples;

tdc1: DVAR, VARIABLE="PHASE", LOWERBOUND=0.4856, UPPERBOUND=0.7284;

 // ---- OPTIMIZER SECTION -------
dv0: DVAR, VARIABLE="IBF", LOWERBOUND=400, UPPERBOUND=500;
dv1: DVAR, VARIABLE="IM", LOWERBOUND=250, UPPERBOUND=440;
dv2: DVAR, VARIABLE="GPHASE", LOWERBOUND=-30.0, UPPERBOUND=0.0;
dv3: DVAR, VARIABLE="FWHM", LOWERBOUND=1.5e-12, UPPERBOUND=10.0e-12;

 
//Quad values
dv4: DVAR, VARIABLE="KQ1", LOWERBOUND=-8.0, UPPERBOUND=8.0; 
dv5: DVAR, VARIABLE="KQ2", LOWERBOUND=-8.0, UPPERBOUND=8.0;
dv6: DVAR, VARIABLE="KQ3", LOWERBOUND=-8.0, UPPERBOUND=8.0;
dv7: DVAR, VARIABLE="KQ4", LOWERBOUND=-8.0, UPPERBOUND=8.0;

stdc1: SAMPLING, VARIABLE="PHASE", TYPE="UNIFORM", SEED=329, N = Nsamples;
sdv0: SAMPLING, VARIABLE="IBF", RANDOM=true, TYPE="UNIFORM", SEED=5979, N = Nsamples;
sdv1: SAMPLING, VARIABLE="IM", RANDOM=true, TYPE="UNIFORM", SEED=2840, N = Nsamples;
sdv2: SAMPLING, VARIABLE="GPHASE", RANDOM=true, TYPE="UNIFORM", SEED=68921, N = Nsamples;
sdv3: SAMPLING, VARIABLE="FWHM", RANDOM=true, TYPE="UNIFORM", SEED=580972, N = Nsamples;
sdv4: SAMPLING, VARIABLE="KQ1", RANDOM=true, TYPE="UNIFORM", SEED=1169, N = Nsamples;
sdv5: SAMPLING, VARIABLE="KQ2", RANDOM=true, TYPE="UNIFORM", SEED=435831, N = Nsamples;
sdv6: SAMPLING, VARIABLE="KQ3", RANDOM=true, TYPE="UNIFORM", SEED=183246, N = Nsamples;
sdv7: SAMPLING, VARIABLE="KQ4", RANDOM=true, TYPE="UNIFORM", SEED=12548, N = Nsamples;



SAMPLE,
    RASTER          = false,
    DVARS           = {p1, tdc1, dv0, dv1, dv2, dv3, dv4, dv5, dv6, dv7},
    SAMPLINGS       = {sp1, stdc1, sdv0, sdv1, sdv2, sdv3, sdv4, sdv5, sdv6, sdv7},
    INPUT           = "awa.tmpl",
    OUTPUT          = "awa",
    OUTDIR          = "output_5k",
    TEMPLATEDIR     = "tmpl",
    FIELDMAPDIR     = "fieldmaps",
    NUM_MASTERS     = 1,
    NUM_COWORKERS   = 1;
QUIT;

Just write me an email if somebody is interested and needs more information.

Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking