-
Notifications
You must be signed in to change notification settings - Fork 50
Don't merge deduped entries into existing_entries. #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Merging pollutes the existing_entries for hooks that get run later. Instead we track them separately but consider the union(existing_entries, new_entries) when running the deduplication logic, since we also want to remove duplicates from other importers in this same.
|
Agreed -- I came to report this issue as well because of an effect I saw with When new entries are merged with existing ones, the combined set is passed to |
|
I see the problem - I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me - but there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results. |
I'm not sure what you mean exactly, but entries are sorted before being deduplicated Lines 64 to 114 in 35fea18
|
I guess if same input files are kept but the order of
I would definitely defer to your experience (and thank you soooo much for all your time and effort in creating beancount!) but would like to understand more. It looks like the original idea for feeding the already-evaluated entries into the There are two areas that this code impacts: importers and hooks. For hooks, it does seem like adding the new For The current diff proposes keeping a record of all observed entries during deduplication for each importer's The original intention of multi-file import duplicate detection still works with this proposed change, and the arguments passed to hooks are correctly named. |
I guess you didn't read the large comment in the code that I linked that describes the sorting key used. |
I dunno, it means that hooks are being passed an invalid beancount file as "existing entries", which is at least slightly surprising. I don't think it is unreasonable that hooks would expect the existing entries to be something that would pass bean-check. If hooks are expected to deal with malformed beancount data then we should probably clarify what variants of well-formed beancount syntax they might be expected to deal with in the "existing entries" so they can write appropriately defensive code. I'm not exactly averse to any of that -- importing stuff is kind of a weird liminal state instead of fully baked -- but, as someone who wrote a hook, I was caught out by beangulp breaking the (admittedly somewhat implicit) contract that "existing entries" were well-formed from a bean-check kind of point of view. I expected it in the imported data but not the existing data. |
|
Thanks for spotting this issue and sorry for not replying on the PR earlier: I had to find a bit of time to check what the code does against what it is supposed to be doing and what it is sensible to do. I agree that the current behavior is a bug. Newly extracted entries should not be appended to the existing entries before these are passed to the hook functions. I have only one issue with the proposed fix: it repeatedly concatenates possibly very long lists. I need to think a bit more about it, but I think that the deduplication process should be additive, namely |
|
This really needs to be fixed. I had to independently realize the same thing that the current behavior makes smart_importer completely useless: beancount/smart_importer#149. |
Merging pollutes the existing_entries for hooks that get run later. Instead we track them separately but consider the union(existing_entries, new_entries) when running the deduplication logic, since we also want to remove duplicates from other importers in this same.