adding function to read mldb databases
-
@ext-neveu_n It might be better adding this
mldb
as a dataset class in here. The dataset classes are all derived from DatasetBase. If you add theFileType
and do all the connections right, it should then be loadable by load_dataset. We might need to use the argumentastype
ofload_dataset
. -
@snuverink_j please comment
-
@frey_m I completely agree! And noticed the datasets after this commit.
I started a new branch to do exactly what you mentioned (use load_dataset). I had to change a few other files slightly, so I made the branch instead of pushing directly. I made a new MlDataset.py file, and I'm now able to load the data. I will also move this function out of the pareto_fronts.py calc. There is some other cleaning up I want to do before requesting a merge.
I was going to suggest, if @adelmann agrees, we can remove /db if MlDataset.py maintains the same functionality.
-
What are the other use cases? experimental datasets?
Maybe something general like pandas would help for heavy db management.
I don't know the format, so that might be a bad suggestion.As it's written now, 2/3 of the file types in mldb.py are opal specific.
Another option is moving the sdds and json parts to /opal/MlDataset.py and leave theASCII
portion where it is in mldb.py. -
The other use case is data sets from (archived) accelerator data indeed. Right now there is only the ASCII option, but this could change or there can be multiple formats. We would like to have the same tool for this data and OPAL data to produce pandas on which additional analysis is done (like in the picture).
-
Right now mldb.py outputs a pickled file and there is readMlDb.py (in other repo, I invited you) that reads the output of mldb.py and outputs a pandas.Dataframe. I stopped working on this for the moment. This could possibly be merged/put into this repo and mldb.py. I don't know exactly what would be best for your case and am open to suggestions.
-
I can see the repo, but I'm not able to see readMLDb.py.
Can you please update permissions?@snuverink_j I think merging is a good idea. readMLDb.py and mldb.py could be combined into a more general read/write class.
I'm also not sure what's the best implementation.For the ml work, I would suggest a rewrite that directly loads files into a pandas dataframe.
Then save those dataframes to pickle files using pandas when needed.
This way, when a pickle is loaded later, it's already in the structure of a pandas dataframe.
Note, I don't use pandas heavily, so I don't know if implementing that practical.In my case, I'm realizing I am not tied to the mldb structure.
I made ml pickles for another project and thought saving one file was convenient.
@frey_m I could also update /opal/dataset/OptimizerDataset.py to do what I want,
which (in the narrowest description) is read/return all data from every optimization generation. -
@ext-neveu_n This should already be possible. You only need to loop over all genertaions and call
getData
. -
The function
load_dataset
instantiates an object of typeOptimizerDataset
. The function returns a list, that's true. However, in case of the OPTIMIZER you can use the first entry to access all.from opal import load_dataset from opal.datasets.filetype import FileType try: dsets = load_dataset('/path/to/JSON/directory/', ftype=FileType.OPTIMIZER) ds = dsets[0] gen = 1 # generation ind = 1 # individual ds.getData('', gen=gen, ind=ind) except Exception as e: print ( e )
-
You can also access all values of an objective or design variable. You shouldn't use the
ind
argument in this case.ds.getData('objective or design variable', gen=gen)
Edited by frey_m -
Ok, that's working, thanks!
Is this the recommended syntax for retrieving properties:
ngens = getattr(ds, "num_generations")
Is there an example of this in another dataset test I could look at?
I probably missed it because I was only looking at the optimizer.I could add to the python notebook example too.
-
The goal of this module is to use the same interface independent of the underlying dataset. Data is retrieved by
getData
. Labels and units viagetLabel
, respectively,getUnit
. I have a full bunch of examples in the test directory. -
@ext-neveu_n permissions changed (hopefully correctly). Agreed on the suggestions. The idea for the two stage was to do some cleaning in between but this can be done also in the single class.
-
@frey_m Ok, sorry for being slow. I see the point of the interface, and I think I am using getData() correctly now.
Do you have an example of getLabel()? I may have missed it in the test dir files.
I will try to clean up what I am doing and add an example to optimization test for your approval.@snuverink_j Yes I can see the repo now.
There is a lot there, so maybe keeping gen/read separate is ok too.
Good luck incorporating pandas. -
@ext-neveu_n getLabel() just returns the name of the quantity -- however probably in a nicer format. I think the StatDataset.py is a nice example since it also uses label mappers. You can then use it as getData(). An example for this could be plot_profile1D.
Sure you can add to the optimization test. It's always good to have examples.
-
@ext-neveu_n : Thanks, I will keep an eye on what is being done here
-
mentioned in issue #30 (closed)