Store stat files of Sampler (and Optimizer?) in a single file

marked this issue as related to #249 (closed)

changed the description

We could try to (mis-)use H5hut where every step is a separate run.

Here some size estimates:

AWA: Injector + all tanks ~ 20 MB per simulation. We do random sample from 40k to 70k -> File size ~ TB

SwissFEL / LCLS -> File size > 100 TB

Conclusion : we should not save all entries from the stat file, hence we should have a option to select the data relevant from the stat file.

I think use H5hut is a good idea.

Q: @gsell can each worker (and we will have many of them 10000 and more, each running on 8 - 64 cores ) write to the H5 file efficiently?

I guess that your estimates are the sizes for the raw uncompressed data? I just checked with a stat file with 100 MB raw. When compressed 20 MB (gunzip) resp. 16 MB (bunzip2). I guess this is still to big.

@gsell: can each core write independent from all others? And can they write to separate data structures? @adelmann: I've already implemented the functionality to extract columns from the stat file of each run. Now it remains to write the data to file.

@adelmann if I remember correctly then you use Hadoop (in combination with Hive?) for another project. Couldn't this be a candidate?

changed the description

I have no expertise with Hadoop/Hive.

Over the weekend I looked at MongoDB, a document based NoSQL database. It seems to be a good fit for our need.

I do not like the idea to introduce another file format or database or whatever. If HDF5 can be used, then we should do it. @kraus: yes each core can write independently, @adelmann: performance should be sufficient with the right chunking and alignment.

Not sure whether H5hut fulfills the requirement. It might require some changes. But maybe this is a good opportunity to fix some design flaws in the H5hut format and design/develop a H5hut V2 format.

assigned to @gsell

If we agree on hdf5, then unfortunately I can't contribute anything, because I can't provide any expertise to this file format.

Basically I agree with @gsell however if there is a real need to add a new package then we have to do so. I think we can use HDF5 (H5Hut with maybe small additional features).

Before fixing H5hut we should look into the new standard being developed by LBL and DESY. At ICAP, it seams that this is project is alive. @gsell you are monitoring this.

I propose to find out first if we can use what we already have.

We need to store the following information: m-samples, each sample has k vectors of length n. When the sampler starts, q-samples are written, each sample from p-way parallel OPAL simulation.

sample-1: key-1: n doubles key-2: n doubles

key-k: n doubles

sample-2: key-1: n doubles key-2: n doubles

key-k: n doubles

...

sample-m: key-1: n doubles key-2: n doubles

key-k: n doubles

@adelmann we need to store more information: the design variables and their values with each sample. Furthermore the length of the arrays don't have to be exactly the same in all runs (e.g. due to early termination, difference in path length (chicane) or what ever reason).

I agree will update the spec from above shortly

Would be nice to have a real specification in the wiki - not just a few notes

Yes as soon as we agree on the spec.

Started with wiki page here

@adelmann the best way to find an agreement is to write drafts (wiki page) and discuss these drafts. This can be done on the wiki page itself.

mentioned in commit 8bb40527

@gsell and @adelmann I have written down a specification for the file format as I think it is needed. Please complete it as you see fit!

@kraus what is i and j?

i is the sum of number of design variables and the number of objectives and j is the number of columns from the stat file to be stored. Updated the page accordingly.

@kraus so we have I (key, double values) tuples and J (key, double array) tuples per sample? How are the name_ispecified?

The name_i are strings. Max length of the strings isn't specified yet, 64 should be sufficient, I guess.

I assume I will not be very huge. Is this correct? What is the range I is in?

If we use HDF5/H5hut and I is small enough, we can use attributes to store name_1 .. name_I.

I guess (@adelmann knows the realistic scenarios better than I do) max 15 design variables and as many objectives, in general much less.

Interesting columns from the stat file will be s, time, rms_x, rms_y, rms_s, emit_x, emit_y, emit_z, rms_px, rms_py, rms_pz, energy, ...? . The length of the arrays depends on the path length, the size of the time step and on the option STATDUMPFREQ.

Maximum number of objective is equal to the columns of the statfile.

Number of design variables will be, for the foreseen future below 100. At the moment we are using order 10.

The maximum number of objectives can be much bigger than this: an expression can be a combination of any two or more columns and can be evaluated at any position. So the theoretical maximum number of objectives is infinit.

Yes I agree but for the forseen future and real projects at hand my estimate is somehow bound the reality.

For the AWA optimisation, we now run one day the OPTIMIZER and 1/2 day the post processing, i.e. using linux tools to concatenate the results. So I would love to have a prototype and why not in H5hut. From this we can learn about performance etc.

added Optimiser label

added Sampler label

mentioned in issue #249 (closed)

@gsell what's the time frame for developing an HDF5 based file format that can store this kind of data?

mentioned in commit 0b38169d

This is not done in the json file

closed

reopened

First of all it's you @adelmann who requested this feature. And second I already implemented a lot towards this. I still wait for @gsell to provide me with a proper file format in hdf5.

changed weight to 1

Store stat files of Sampler (and Optimizer?) in a single file

Designs

Child items ...

Activity

Admin message

Store stat files of Sampler (and Optimizer?) in a single file

Relates to

Activity