Code indexing in gitaly is broken and leads to code not being visible to the user. We work on the issue with highest priority.

Skip to content
Snippets Groups Projects
Commit 03d631b9 authored by ext-neveu_n's avatar ext-neveu_n
Browse files

adding function to read mldb databases

parent 8f21ac20
No related branches found
No related tags found
No related merge requests found
# Author: Nicole Neveu # Author: Nicole Neveu
# Date: May 2018 # Date: May 2018
from opal.datasets.filetype import FileType
from opal.statistics import statistics as stat
from opal.datasets.DatasetBase import DatasetBase
from opal.analysis import impl_beam
import numpy as np
import json import numpy as np
import pylab as pl
import pandas as pd import pandas as pd
import matplotlib.pyplot as plt from opal.datasets.filetype import FileType
from matplotlib.widgets import Slider, Button, RadioButtons from db import mldb
from collections import OrderedDict
from optPilot.Annotate import AnnoteFinder
import pyOPALTools.optPilot.OptPilotJsonReader as jsonreader
def scaleData(vals):
"""
Scale 1D data array from 0 to 1.
Used to compare objectives with different units.
Parameters
----------
vals (numpy array) 1D array that holds any opal data
Returns
-------
sacaled_vals (numpy array) 1D array scaled from 0 to 1
"""
smax = np.max(vals)
smin = np.min(vals)
scaled_vals = (vals - smin)/smax
return (scaled_vals)
def pareto(x, y, dvars=0): def pareto(x, y, dvars=0):
""" """
Find Pareto points for 2 objectives given Find Pareto points for 2 objectives, given
all data recorded by optimization run. all data recorded by optimization run.
These points are calculated independent These points are calculated independent
of generation. i.e. best points from all of generation. i.e. best points from all
...@@ -46,13 +17,28 @@ def pareto(x, y, dvars=0): ...@@ -46,13 +17,28 @@ def pareto(x, y, dvars=0):
Parameters Parameters
---------- ----------
x (numpy array) array of first objective values x (numpy array) 1D array of first objective values
y (numpy array) array of second objective values y (numpy array) 1D array of second objective values
Optionals Optionals
--------- ---------
dvars dvars (numpy array) ND array of design variables
Returns
-------
pfdict (dictionary) Dictionary that holds pareto front
values and corresponding design values
""" """
#Check data is correct length
lx = len(x)
ly = len(y)
ld = len(dvars[:,0])
if lx==ly==ld:
pass
else:
print('Input data sizes do not match\n')
print('Please check input arrays')
#Making holders for my pareto fronts #Making holders for my pareto fronts
pareto_y = [] pareto_y = []
pareto_x = [] pareto_x = []
...@@ -60,6 +46,7 @@ def pareto(x, y, dvars=0): ...@@ -60,6 +46,7 @@ def pareto(x, y, dvars=0):
w = np.arange(0,1.001, 0.001) w = np.arange(0,1.001, 0.001)
sx = scaleData(x) sx = scaleData(x)
sy = scaleData(y) sy = scaleData(y)
#Finding best point with respect to all weights (w) #Finding best point with respect to all weights (w)
for i in range(0, len(w)): for i in range(0, len(w)):
fobj = sy * w[i] + sx *(1-w[i]) fobj = sy * w[i] + sx *(1-w[i])
...@@ -69,11 +56,84 @@ def pareto(x, y, dvars=0): ...@@ -69,11 +56,84 @@ def pareto(x, y, dvars=0):
pareto_pts = delete_repeats(pareto_x, pareto_y) pareto_pts = delete_repeats(pareto_x, pareto_y)
ind = np.array(pareto_pts.index.tolist()) ind = np.array(pareto_pts.index.tolist())
pdvar = dvars[ind, :]
#Check dvars is correct length
if dvars!=0:
pdvar = dvars[ind, :]
return(pareto_pts.ix[:,0], pareto_pts.ix[:,1], pdvar) #pareto_x, pareto_y, pdvar) return(pareto_pts.ix[:,0], pareto_pts.ix[:,1], pdvar) #pareto_x, pareto_y, pdvar)
def delete_repeats(x, y): #, z):
df = pd.DataFrame({'x':x, 'y':y}) #, 'z':z}) def get_all_data_db(dbpath):
"""
Get all objectives and design variables
from every generation in an optimzation
database. Databases are made using OPAL
output from json files or stat files.
Functions to make databases can be found
in mldb.py.
Parameters
----------
db (str) path to pickle file containing
database made with mldb.py
Returns
-------
data (dict) Dictonary containing all
objectives and design values
in optimization database.
"""
data = {}
dbr = mldb.mldb()
dbr.load(dbpath)
#dvars = dbr.getXNames()
#obj = dbr.getYNames()
gens = dbr.getNumberOfSamples()
return(data)
def scaleData(vals):
"""
Scale 1D data array from 0 to 1.
Used to compare objectives with different units.
Parameters
----------
vals (numpy array) 1D array that holds any opal data
Returns
-------
sacaled_vals (numpy array) 1D array scaled from 0 to 1
"""
smax = np.max(vals)
smin = np.min(vals)
scaled_vals = (vals - smin)/smax
return (scaled_vals)
def delete_repeats(x, y, z=0):
"""
Delete repeated pareto front values, if any.
Parameters
----------
x (numpy array) 1D array of first objective values
y (numpy array) 1D array of second objective values
Optionals
---------
z (numpy array) ND array of second design variables
"""
if z==0:
df = pd.DataFrame({'x':x, 'y':y}) #, 'z':z})
else:
df = pd.DataFrame({'x':x, 'y':y, 'z':z})
return df.drop_duplicates(subset=['x', 'y'], keep='first') return df.drop_duplicates(subset=['x', 'y'], keep='first')
  • Maintainer

    @ext-neveu_n It might be better adding this mldb as a dataset class in here. The dataset classes are all derived from DatasetBase. If you add the FileType and do all the connections right, it should then be loadable by load_dataset. We might need to use the argument astype of load_dataset.

  • adelmann :reminder_ribbon: @adelmann ·
    Owner

    @snuverink_j please comment

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    @frey_m I completely agree! And noticed the datasets after this commit.

    I started a new branch to do exactly what you mentioned (use load_dataset). I had to change a few other files slightly, so I made the branch instead of pushing directly. I made a new MlDataset.py file, and I'm now able to load the data. I will also move this function out of the pareto_fronts.py calc. There is some other cleaning up I want to do before requesting a merge.

    I was going to suggest, if @adelmann agrees, we can remove /db if MlDataset.py maintains the same functionality.

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    I basically copied db/mldb.py to opal/MlDataset.py and will remove/add somethings.

  • adelmann :reminder_ribbon: @adelmann ·
    Owner

    fine with me

  • snuverink_j @snuverink_j ·
    Developer

    fine with me. It should stay independent of OPAL though as mldb is also meant for non-OPAL (archiver) data.

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    What are the other use cases? experimental datasets?
    Maybe something general like pandas would help for heavy db management.
    I don't know the format, so that might be a bad suggestion.

    As it's written now, 2/3 of the file types in mldb.py are opal specific.
    Another option is moving the sdds and json parts to /opal/MlDataset.py and leave the ASCII portion where it is in mldb.py.

  • snuverink_j @snuverink_j ·
    Developer

    The other use case is data sets from (archived) accelerator data indeed. Right now there is only the ASCII option, but this could change or there can be multiple formats. We would like to have the same tool for this data and OPAL data to produce pandas on which additional analysis is done (like in the picture).

    genMLDb

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    Ok sounds good.
    So is it in the plans to give mldb.py a pandas makeover?

  • snuverink_j @snuverink_j ·
    Developer

    Right now mldb.py outputs a pickled file and there is readMlDb.py (in other repo, I invited you) that reads the output of mldb.py and outputs a pandas.Dataframe. I stopped working on this for the moment. This could possibly be merged/put into this repo and mldb.py. I don't know exactly what would be best for your case and am open to suggestions.

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    I can see the repo, but I'm not able to see readMLDb.py.
    Can you please update permissions?

    @snuverink_j I think merging is a good idea. readMLDb.py and mldb.py could be combined into a more general read/write class.
    I'm also not sure what's the best implementation.

    For the ml work, I would suggest a rewrite that directly loads files into a pandas dataframe.
    Then save those dataframes to pickle files using pandas when needed.
    This way, when a pickle is loaded later, it's already in the structure of a pandas dataframe.
    Note, I don't use pandas heavily, so I don't know if implementing that practical.

    In my case, I'm realizing I am not tied to the mldb structure.
    I made ml pickles for another project and thought saving one file was convenient.
    @frey_m I could also update /opal/dataset/OptimizerDataset.py to do what I want,
    which (in the narrowest description) is read/return all data from every optimization generation.

  • Maintainer

    @ext-neveu_n This should already be possible. You only need to loop over all genertaions and call getData.

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    Yes, I am trying this now. I should use OptimizerDataset() directly and not load_dataset(), correct? The latter returns a list.

  • Maintainer

    The function load_dataset instantiates an object of type OptimizerDataset. The function returns a list, that's true. However, in case of the OPTIMIZER you can use the first entry to access all.

    from opal import load_dataset
    from opal.datasets.filetype import FileType
    
    try:
        dsets = load_dataset('/path/to/JSON/directory/', ftype=FileType.OPTIMIZER)
        
        ds = dsets[0]
        
        gen = 1 # generation
        ind = 1 # individual
        ds.getData('', gen=gen, ind=ind)
    
    except Exception as e:
        print ( e )
  • Maintainer

    You can also access all values of an objective or design variable. You shouldn't use the ind argument in this case.

    ds.getData('objective or design variable', gen=gen)
    Edited by frey_m
  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    Ok, that's working, thanks!
    Is this the recommended syntax for retrieving properties:
    ngens = getattr(ds, "num_generations")

    Is there an example of this in another dataset test I could look at?
    I probably missed it because I was only looking at the optimizer.

    I could add to the python notebook example too.

  • Maintainer

    The goal of this module is to use the same interface independent of the underlying dataset. Data is retrieved by getData. Labels and units via getLabel, respectively, getUnit. I have a full bunch of examples in the test directory.

  • snuverink_j @snuverink_j ·
    Developer

    @ext-neveu_n permissions changed (hopefully correctly). Agreed on the suggestions. The idea for the two stage was to do some cleaning in between but this can be done also in the single class.

  • ext-neveu_n @ext-neveu_n ·
    Author Developer

    @frey_m Ok, sorry for being slow. I see the point of the interface, and I think I am using getData() correctly now.
    Do you have an example of getLabel()? I may have missed it in the test dir files.
    I will try to clean up what I am doing and add an example to optimization test for your approval.

    @snuverink_j Yes I can see the repo now.
    There is a lot there, so maybe keeping gen/read separate is ok too.
    Good luck incorporating pandas.

  • Maintainer

    @ext-neveu_n getLabel() just returns the name of the quantity -- however probably in a nicer format. I think the StatDataset.py is a nice example since it also uses label mappers. You can then use it as getData(). An example for this could be plot_profile1D.

    Sure you can add to the optimization test. It's always good to have examples. :smiley:

  • snuverink_j @snuverink_j ·
    Developer

    @ext-neveu_n : Thanks, I will keep an eye on what is being done here :smile:

  • ext-neveu_n @ext-neveu_n

    mentioned in issue #30 (closed)

    ·

    mentioned in issue #30 (closed)

    Toggle commit list
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment