Skip to content

Conversation

@hoostus
Copy link

@hoostus hoostus commented Mar 6, 2025

Merging pollutes the existing_entries for hooks that get run later. Instead we track them separately but consider the union(existing_entries, new_entries) when running the deduplication logic, since we also want to remove duplicates from other importers in this same.

Merging pollutes the existing_entries for hooks that get run later.
Instead we track them separately but consider the union(existing_entries, new_entries)
when running the deduplication logic, since we also want to remove duplicates
from other importers in this same.
@hunner
Copy link

hunner commented Jun 12, 2025

Agreed -- I came to report this issue as well because of an effect I saw with smart_importer.

When new entries are merged with existing ones, the combined set is passed to smart_importer as training data. This includes single-leg entries, which shouldn’t be considered valid training examples. As a result, smart_importer is producing single-leg entries in my journal, since it learns from these incomplete examples.

@blais
Copy link
Member

blais commented Jun 12, 2025

I see the problem - I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me - but there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results.

@dnicolodi
Copy link
Collaborator

there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results.

I'm not sure what you mean exactly, but entries are sorted before being deduplicated

def sort_extracted_entries(extracted: List[ExtractedEntry]) -> None:
"""Sort the extraxted entries.
Sort extracged entries, grouped by source document, in the order
in which they will be used in deduplication and in which they will
be serialized to file.
Args:
extracted: List of (filepath, entries, account, importer) tuples
where entries is the list of entries extracted from the
document at filepath by importer.
"""
# The entries are sorted on a key composed by (max-date, account,
# min-date, filename) where max-date and min-date are the latest
# and earliest date appearing in the entries list. This should
# place entries from documents produced earlier in time at before
# ones coming from documents produced later.
#
# Most imports have a balance statement at the end with a date
# that is one day later than the reporting period (balance
# statement are effective at the beginning of the day). Thus
# using the end date should be more predictable than sorting on
# the earliest entry.
#
# This diagram, where the bars represents the time span covered by
# contained entries, represent the sort order we want to obtain:
#
# Assets:Ali (-----)
# Assets:Ali (=====--------)
# Assets:Bob (------------)
# Assets:Bob (===----)
# Assets:Ali (--------------)
# Assets:Bob (====-----------)
#
# The sections marked with = represent the time spans in which
# duplicated entries could be present. We want entries form
# documents produced earlier in time to take precedence over
# entries from documents produced later in time.
def key(element: ExtractedEntry):
filename, entries, account, importer = element
dates = [entry.date for entry in entries]
# Sort documents that do not contain any entry last.
max_date = min_date = datetime.date(9999, 1, 1)
if dates:
max_date, min_date = max(dates), min(dates)
return max_date, account, min_date, filename
extracted.sort(key=key)
so, as long as you process the same files, the output is deterministic

@hunner
Copy link

hunner commented Jun 12, 2025

Depending on the order that you're processing the importers you would get different results.

I'm not sure what you mean exactly, but entries are sorted before being deduplicated. so, as long as you process the same files, the output is deterministic

I guess if same input files are kept but the order of importers is changed then that affects the existing_entries that each successive importer.deduplicate in the deduplication for-loop sees.

I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me

I would definitely defer to your experience (and thank you soooo much for all your time and effort in creating beancount!) but would like to understand more. It looks like the original idea for feeding the already-evaluated entries into the existing_entries came from #9 to allow bank transfer transactions to only have one side marked as a duplicate during multi-file imports. It was fixed in #64 and updated to deduplicate in-place in #114.

There are two areas that this code impacts: importers and hooks.

For hooks, it does seem like adding the new entries to the existing_entries list before calling the hooks misrepresents what the user has already touched vs. what the hooks should be processing. It would be nice to keep the distinction clear.

For importer.deduplicate(), the distinction between existing entries and new entries doesn't matter so much, since it is only concerned with deduplicating its own entries against all other entries, new or not.

The current diff proposes keeping a record of all observed entries during deduplication for each importer's deduplicate method to evaluate, then throwing that record away instead of reusing it as existing_entries for hooks. Because extract.mark_duplicate_entries() now marks the entries as duplicate in-place, the hooks will receive a list of new entries with duplicates already recorded. I don't think any info is lost, and if a hook wants to have the union of existing entries and new entries then it can iterate over the new entries and extend existing_entries and be back to the current functionality.

The original intention of multi-file import duplicate detection still works with this proposed change, and the arguments passed to hooks are correctly named.

@dnicolodi
Copy link
Collaborator

I guess if same input files are kept but the order of importers is changed then that affects the existing_entries that each successive importer.deduplicate in the deduplication for-loop sees.

I guess you didn't read the large comment in the code that I linked that describes the sorting key used.

@hoostus
Copy link
Author

hoostus commented Jun 13, 2025

the prior behavior seems reasonable to me

I dunno, it means that hooks are being passed an invalid beancount file as "existing entries", which is at least slightly surprising. I don't think it is unreasonable that hooks would expect the existing entries to be something that would pass bean-check. If hooks are expected to deal with malformed beancount data then we should probably clarify what variants of well-formed beancount syntax they might be expected to deal with in the "existing entries" so they can write appropriately defensive code.

I'm not exactly averse to any of that -- importing stuff is kind of a weird liminal state instead of fully baked -- but, as someone who wrote a hook, I was caught out by beangulp breaking the (admittedly somewhat implicit) contract that "existing entries" were well-formed from a bean-check kind of point of view. I expected it in the imported data but not the existing data.

@dnicolodi
Copy link
Collaborator

Thanks for spotting this issue and sorry for not replying on the PR earlier: I had to find a bit of time to check what the code does against what it is supposed to be doing and what it is sensible to do. I agree that the current behavior is a bug. Newly extracted entries should not be appended to the existing entries before these are passed to the hook functions.

I have only one issue with the proposed fix: it repeatedly concatenates possibly very long lists. I need to think a bit more about it, but I think that the deduplication process should be additive, namely deduplicate(A, B + C) should give the exact same result of deduplicate(A, B); deduplicate(A, C) where the first argument is the list of new entries and the second argument is the list of existing entries. Thus it may be more efficient to run consecutive dedup() calls than concatenating and sorting lists.

@sim642
Copy link

sim642 commented Jun 21, 2025

This really needs to be fixed. I had to independently realize the same thing that the current behavior makes smart_importer completely useless: beancount/smart_importer#149.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants