Log a hash upon vp-setupfit of the fk tables#2476
Conversation
|
I have one question and one issue: It is somewhat non-trivial to get the metadata from pineappl. I tried: Hashing the full fk table does work, but then I have no idea to access this |
| # To change the configuration of the NNPDF code framework | ||
|
|
||
| nnpdf_share: ~/.local/share/NNPDF | ||
| nnpdf_share: /data/theorie/jkoorn/hep/share/NNPDF |
There was a problem hiding this comment.
You can have your own nnprofile.yaml in ~/.config/NNPDF/nnprofile.yaml with only some of the keys and those will overwrite the NNPDF standard profile.
| @property | ||
| def theories_path(self): | ||
| return self._theories_path | ||
|
|
There was a problem hiding this comment.
When you load a theory you have access to the path, you don't need that here.
|
|
||
| SETUPFIT_FIXED_CONFIG = dict( | ||
| actions_=['datacuts check_t0pdfset', 'theory evolven3fit_checks_action'] | ||
| ) |
There was a problem hiding this comment.
The elegant way of doing this is creating a validphys action that takes the theory and the data and then you add them here.
I think it makes sense to make it part of the checks of n3fit (n3fit/checks.py) since you will be checking that you can load all the theories. For instance look at check_consistent_basis. You can have something like
def fktable_hasher(theoryid, dataset_inputs):The variable theoryid here will already be parsed by validphys and will contain the path to the theory if you need to (theoryid.path).
But in this case, you don't even need that, because the data already contains a reference to the fktables if you let validphys do its thing:
def fktable_hasher(data):
for dataset in data.datasets:
fkspecs = dataset.fkspecs
for fk in fkspecs:
print(fk.fkpath)You can test this with the validphys API:
from validphys.api import API
dataset_inputs = [{"dataset": "ATLAS_Z0_8TEV_HIMASS_M-Y"}] # you can put as many as you want, I'm just testing with one
# The API basically receives the same thing that the runcard will give and it is very useful for testing
data = API.data(dataset_inputs=dataset_inputs, theoryid = 40_000_000, use_cuts="internal")
for dataset in data.datasets:
fkspecs = dataset.fkspecs
for fk in fkspecs:
print(fk.fkpath)Note that fk.fkpath is still a tuple of paths since many fktables are separated into different bins.
I think what we want to have at the end of the day is a log file like
theory_40000000/fastkernel/ATLASDY2D8TEV-aMCfast_obs_0.pineappl.lz4 - hash from the metadata
There was a problem hiding this comment.
So this means we write the hash in this n3fit/checks.py? That feels like a weird place to write it from. I don't fully understand why we would want to make it a check as opposed to something we do in vp-setupfit. I understand the argument for 'outsourcing' the work of loading the dataset names and fk table to validphys, but why in checks?
There was a problem hiding this comment.
You can also create a new module, for instance under io, that contains the function that does the hash.
It's not outsourcing, vp-setupfit calls validphys actions that do a few dufferent things, among which the hash
| # To change the configuration of the NNPDF code framework | ||
|
|
||
| nnpdf_share: ~/.local/share/NNPDF | ||
| nnpdf_share: ~/.local/share/NNPDF/theories |
There was a problem hiding this comment.
| nnpdf_share: ~/.local/share/NNPDF/theories | |
| nnpdf_share: ~/.local/share/NNPDF |
|
|
||
|
|
||
| @n3fit.checks.check_consistent_basis | ||
| @n3fit.checks.fktable_hasher |
There was a problem hiding this comment.
I didn't mean to add it as a check but rather to add it in that file (because in my mind, for organizational purposes it made sense)
but given the confusion perhaps it was a bad idea, and it is better to add a new module to n3fit/src/n3fit/io/ something like hash_creator.py or whatever.
| for dataset in data.datasets: | ||
| fkspecs = dataset.fkspecs | ||
| for fk in fkspecs: | ||
| fkpath = fk.fkpath[0] |
There was a problem hiding this comment.
Why fk.fkpath[0]?
With respect to the comment below, note that you can get, as argument to this function, also output_path which is a path to the runfolder, so your file can be
mdf5fk_path = output_path / "md5fk"
There was a problem hiding this comment.
becuase fk.fkpath is a tuple
if the goal is to hash all fk tables where there are multiple for 1 dataset (e.g. WP/WM, or multiple bins), I will have to change this logic. But this is a shortcut. My logic was that if an fk table is recomputed, all are recomputed for that dataset. Afaik this is how pineko works, or is it not?
There was a problem hiding this comment.
never mind, i am working around it now and hashing each fk table individually.
There was a problem hiding this comment.
It is how pineko works (by default) but I think we want this feature precisely for when weird things happen so better to have all of them!
There was a problem hiding this comment.
exaclty! Fully agreed
|
My most recent push works and provides a nice file with hashes of all fk tables, bin by bin: |
scarlehoff
left a comment
There was a problem hiding this comment.
Just a few more organizational points about actions, checks and so.
|
|
||
| # and save the fk hashes | ||
| # fk_hashes = self.config_class["_fk_hashes"] | ||
| # self.environment.save_fk_md5(fk_hashes) |
There was a problem hiding this comment.
Please dont' forget to remove the comments and the dead code (the old save_fk_md5)
There was a problem hiding this comment.
You can add fktable_hasher as an action here
| log.error(f"No eko found for {theoryid}") | ||
|
|
||
|
|
||
| @make_argcheck |
There was a problem hiding this comment.
You can remove make_argcheck from here and add this function to the n3fit_checks_providers. Then it will be understood as an action by reportengine and will be run after the other vp-setupfit actions.
There was a problem hiding this comment.
If I do this, it tells me that fktable_hasher is missing output_path as a positional argument. I have removed the make_argcheck and added it to the n3fit_checks_providers. So concretely, what do you mean by "add this function to the n3fit_checks_providers"?
There was a problem hiding this comment.
Sure, don't add it as a check anywhere. The function fktable_hasher is an action by itself. You can put this function in n3fit_checks_providers.
Then, since you added it as an action to vp_setupfit (btw, you need to tell it about the theory, just like with evolven3fit_checks_action next to it) reportengine will call fktable_hasher and will find out which arguments does it need.
Basically, remove all @n3fit.checks.fktable_hasher
There was a problem hiding this comment.
I am now telling it about the theory in vp_setupfit, and I put the function in n3fit_checks_provider, and it doesn't error anymore like it did earlier, but nothing happens, i.e. it doesn't run the function for some reason. I genuinely have no idea what I am then missing.
There was a problem hiding this comment.
The function that you left in n3fit_checks_provider is not the function that was in checks.py. It is a function that does nothing.
There was a problem hiding this comment.
ahh, so the idea was to put the whole thing in n3fit_checks_provider? I don't fully understand this.
Anyway, works now.
There was a problem hiding this comment.
The file n3fit_checks_provider is registered as a provider of actions in vp-setupfit. Therefore, any function there can be called as an action. The file n3fit_checks_provider is nothing magic, it just happen to be already registered as a provider. You could also put it in a separate module and register that.
When you try to call a "reportengine action" what reportengine does is it tries to build a graph from your desired action (and its dependences) to the input (your runcard). If it manages to (for instance, you ask for the covmat, the runcard contains datasets, but from that reportengine knows how to do the covmat) then all ok and continues running. Otherwise it fails and complains.
scarlehoff
left a comment
There was a problem hiding this comment.
I don't think this version works (the failing tests give some hints why).
Btw, make sure to run pre-commit on the files you changed:
git add <file>
pre-commit
(also, please rebase on top of master, being a rebase much preferred with respect to a merge)
| theory and dataset are numerically identical. | ||
| """ | ||
| MD5FK_FILENAME = "md5fk" | ||
| md5fk_path = output_path / MD5FK_FILENAME |
There was a problem hiding this comment.
I guess it makes sense to define this constant as a module level variable instead of here.
| for fkpath, table_name in zip(fk.fkpath, table_names): | ||
| for fkpath, table_name in zip(fk.fkpath, table_names): | ||
| fkhash = hashlib.md5(fkpath.read_bytes()).hexdigest() | ||
| f.write(f"{table_name} {fkhash}\n") |
There was a problem hiding this comment.
I don't think these loops are correct.
First there's the double loop at the end that looks repeated to me.but most importantly, the loop above,
table_names = [name for group in fk.metadata.FK_tables for name in group]
is using the metadata for the names but I think this is "rewinding" part of the work you are already doing by doing fk?
I think you can simplify this a lost by doing something like:
for fk in fkspecs:
# Make a list of the FK tables
for fkpath in fk.fkpath:
fkhash = hashlib.md5(fkpath.read_bytes()).hexdigest()
f.write(f"{fkpath.name} {fkhash}\n")
True, the name will include ".pineappl.lz4" but that's fine. You can remove that if you want. This version should be correct (but please check) for both compound fktables and also things with operations in the middle (I don't think your loop would deal with them correctly?)
The only thing I'm not sure is old fktables because those were slightly different, to be safe you can do:
for fk in fkspecs:
# Make a list of the FK tables
fkpaths = fk.fkpath
if not isinstance(fkpaths, tuple):
fkpaths = (fk.fkpath,)
for fkpath in fk.fkpath:
fkhash = hashlib.md5(fkpath.read_bytes()).hexdigest()
f.write(f"{fkpath.name} {fkhash}\n")
I think with this it will work also for the tests that are now failing (the old fktables), but please check.
|
I rebased, used pre-commit, and implemented these changes. Then I checked that for both 4.1 and 4.0 theories, all fk tables are hashed. Let's see what the tests do. |
|
Thanks! I think you rebased perhaps on an old copy of master? Somehow you took many changes from other branches that should already be merged. With you can select which commits you actually want (other wise you can always cherry-pick) |
|
Yeah I am a github noob, sorry. I'll try to tidy it up |
a789325 to
0a515c9
Compare
|
fixed now |
|
Why is it failing the test? |
|
In both cases due to a certain randomness on where the fits are running. There are PRs and issues deañing with the issue already but I guess it might still take somr weeks. |
No description provided.