Don't merge deduped entries into existing_entries. #159

hoostus · 2025-03-06T07:00:36Z

Merging pollutes the existing_entries for hooks that get run later. Instead we track them separately but consider the union(existing_entries, new_entries) when running the deduplication logic, since we also want to remove duplicates from other importers in this same.

hunner · 2025-06-12T15:07:07Z

Agreed -- I came to report this issue as well because of an effect I saw with smart_importer.

When new entries are merged with existing ones, the combined set is passed to smart_importer as training data. This includes single-leg entries, which shouldn’t be considered valid training examples. As a result, smart_importer is producing single-leg entries in my journal, since it learns from these incomplete examples.

blais · 2025-06-12T16:58:49Z

I see the problem - I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me - but there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results.

dnicolodi · 2025-06-12T17:11:00Z

there's order dependence issue this creates if you wanted it that way. Depending on the order that you're processing the importers you would get different results.

I'm not sure what you mean exactly, but entries are sorted before being deduplicated

beangulp/beangulp/extract.py

Lines 64 to 114 in 35fea18

    
           def sort_extracted_entries(extracted: List[ExtractedEntry]) -> None: 
        
               """Sort the extraxted entries. 
        
               Sort extracged entries, grouped by source document, in the order 
        
               in which they will be used in deduplication and in which they will 
        
               be serialized to file. 
        
               Args: 
        
                 extracted: List of (filepath, entries, account, importer) tuples 
        
                   where entries is the list of entries extracted from the 
        
                   document at filepath by importer. 
        
               """ 
        
               # The entries are sorted on a key composed by (max-date, account, 
        
               # min-date, filename) where max-date and min-date are the latest 
        
               # and earliest date appearing in the entries list.  This should 
        
               # place entries from documents produced earlier in time at before 
        
               # ones coming from documents produced later. 
        
               # 
        
               # Most imports have a balance statement at the end with a date 
        
               # that is one day later than the reporting period (balance 
        
               # statement are effective at the beginning of the day).  Thus 
        
               # using the end date should be more predictable than sorting on 
        
               # the earliest entry. 
        
               # 
        
               # This diagram, where the bars represents the time span covered by 
        
               # contained entries, represent the sort order we want to obtain: 
        
               # 
        
               # Assets:Ali   (-----) 
        
               # Assets:Ali   (=====--------) 
        
               # Assets:Bob    (------------) 
        
               # Assets:Bob             (===----) 
        
               # Assets:Ali                 (--------------) 
        
               # Assets:Bob                (====-----------) 
        
               # 
        
               # The sections marked with = represent the time spans in which 
        
               # duplicated entries could be present.  We want entries form 
        
               # documents produced earlier in time to take precedence over 
        
               # entries from documents produced later in time. 
        
               def key(element: ExtractedEntry): 
        
                   filename, entries, account, importer = element 
        
                   dates = [entry.date for entry in entries] 
        
                   # Sort documents that do not contain any entry last. 
        
                   max_date = min_date = datetime.date(9999, 1, 1) 
        
                   if dates: 
        
                       max_date, min_date = max(dates), min(dates) 
        
                   return max_date, account, min_date, filename 
        
               extracted.sort(key=key)

so, as long as you process the same files, the output is deterministic

hunner · 2025-06-12T20:38:58Z

Depending on the order that you're processing the importers you would get different results.

I'm not sure what you mean exactly, but entries are sorted before being deduplicated. so, as long as you process the same files, the output is deterministic

I guess if same input files are kept but the order of importers is changed then that affects the existing_entries that each successive importer.deduplicate in the deduplication for-loop sees.

I'm not 100% that this should be desired behavior though, the prior behavior seems reasonable to me

I would definitely defer to your experience (and thank you soooo much for all your time and effort in creating beancount!) but would like to understand more. It looks like the original idea for feeding the already-evaluated entries into the existing_entries came from #9 to allow bank transfer transactions to only have one side marked as a duplicate during multi-file imports. It was fixed in #64 and updated to deduplicate in-place in #114.

There are two areas that this code impacts: importers and hooks.

For hooks, it does seem like adding the new entries to the existing_entries list before calling the hooks misrepresents what the user has already touched vs. what the hooks should be processing. It would be nice to keep the distinction clear.

For importer.deduplicate(), the distinction between existing entries and new entries doesn't matter so much, since it is only concerned with deduplicating its own entries against all other entries, new or not.

The current diff proposes keeping a record of all observed entries during deduplication for each importer's deduplicate method to evaluate, then throwing that record away instead of reusing it as existing_entries for hooks. Because extract.mark_duplicate_entries() now marks the entries as duplicate in-place, the hooks will receive a list of new entries with duplicates already recorded. I don't think any info is lost, and if a hook wants to have the union of existing entries and new entries then it can iterate over the new entries and extend existing_entries and be back to the current functionality.

The original intention of multi-file import duplicate detection still works with this proposed change, and the arguments passed to hooks are correctly named.

dnicolodi · 2025-06-12T23:24:38Z

I guess if same input files are kept but the order of importers is changed then that affects the existing_entries that each successive importer.deduplicate in the deduplication for-loop sees.

I guess you didn't read the large comment in the code that I linked that describes the sorting key used.

hoostus · 2025-06-13T00:31:01Z

the prior behavior seems reasonable to me

I dunno, it means that hooks are being passed an invalid beancount file as "existing entries", which is at least slightly surprising. I don't think it is unreasonable that hooks would expect the existing entries to be something that would pass bean-check. If hooks are expected to deal with malformed beancount data then we should probably clarify what variants of well-formed beancount syntax they might be expected to deal with in the "existing entries" so they can write appropriately defensive code.

I'm not exactly averse to any of that -- importing stuff is kind of a weird liminal state instead of fully baked -- but, as someone who wrote a hook, I was caught out by beangulp breaking the (admittedly somewhat implicit) contract that "existing entries" were well-formed from a bean-check kind of point of view. I expected it in the imported data but not the existing data.

dnicolodi · 2025-06-13T12:34:59Z

Thanks for spotting this issue and sorry for not replying on the PR earlier: I had to find a bit of time to check what the code does against what it is supposed to be doing and what it is sensible to do. I agree that the current behavior is a bug. Newly extracted entries should not be appended to the existing entries before these are passed to the hook functions.

I have only one issue with the proposed fix: it repeatedly concatenates possibly very long lists. I need to think a bit more about it, but I think that the deduplication process should be additive, namely deduplicate(A, B + C) should give the exact same result of deduplicate(A, B); deduplicate(A, C) where the first argument is the list of new entries and the second argument is the list of existing entries. Thus it may be more efficient to run consecutive dedup() calls than concatenating and sorting lists.

sim642 · 2025-06-21T17:11:59Z

This really needs to be fixed. I had to independently realize the same thing that the current behavior makes smart_importer completely useless: beancount/smart_importer#149.

sim642 mentioned this pull request Jun 21, 2025

No predicted postings for most transactions beancount/smart_importer#149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't merge deduped entries into existing_entries. #159

Don't merge deduped entries into existing_entries. #159

Uh oh!

hoostus commented Mar 6, 2025

Uh oh!

hunner commented Jun 12, 2025

Uh oh!

blais commented Jun 12, 2025

Uh oh!

dnicolodi commented Jun 12, 2025

Uh oh!

hunner commented Jun 12, 2025 •

edited

Loading

Uh oh!

dnicolodi commented Jun 12, 2025

Uh oh!

hoostus commented Jun 13, 2025

Uh oh!

dnicolodi commented Jun 13, 2025

Uh oh!

sim642 commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Don't merge deduped entries into existing_entries. #159

Are you sure you want to change the base?

Don't merge deduped entries into existing_entries. #159

Uh oh!

Conversation

hoostus commented Mar 6, 2025

Uh oh!

hunner commented Jun 12, 2025

Uh oh!

blais commented Jun 12, 2025

Uh oh!

dnicolodi commented Jun 12, 2025

Uh oh!

hunner commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnicolodi commented Jun 12, 2025

Uh oh!

hoostus commented Jun 13, 2025

Uh oh!

dnicolodi commented Jun 13, 2025

Uh oh!

sim642 commented Jun 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hunner commented Jun 12, 2025 •

edited

Loading