Skip to content

800: Allow declaring that multiple Entities should be created/updated#818

Open
lindsay-stevens wants to merge 46 commits intoXLSForm:masterfrom
lindsay-stevens:pyxform-800
Open

800: Allow declaring that multiple Entities should be created/updated#818
lindsay-stevens wants to merge 46 commits intoXLSForm:masterfrom
lindsay-stevens:pyxform-800

Conversation

@lindsay-stevens
Copy link
Contributor

@lindsay-stevens lindsay-stevens commented Feb 11, 2026

Closes #800
Closes #819

Why is this the best possible solution? Were any other approaches considered?

Initial implementation that passes existing relevant tests and a handful of new ones but still needs:

  • validations to be added
  • error messages tidied up
  • much more + thorough tests
    • mainly solver regression suite remaining but current tests cover many common patterns
    • remaining existing tests in test_create_repeat.py may be better covered by solver suite
  • more docstrings / comments
    • could be more thorough still but is reasonably clear at the moment
  • resolve todo's
  • probably some restructuring to improve readability / maintainability e.g.
    • replace mysterious dict/tuple/etc with types (with class methods etc)
    • redundant data processing / call paths

As noted in below comments, the index to this PR is the docstring at the top of test_entities.py which assigns a short code to each requirement from the XForm spec, for validations, and for general behaviour. These short codes are then used to tag the relevant test cases.

What are the regression risks?

New feature but main regression risk is breaking forms for the old entity specs

Does this change require updates to documentation? If so, please file an issue here and include the link below.

Yes, there's a spec doc update already but XLSForm side not documented yet

Before submitting this PR, please make sure you have:

  • included test cases for core behavior and edge cases in tests
  • run python -m unittest and verified all tests pass
  • run ruff format pyxform tests and ruff check pyxform tests to lint code
  • verified that any code or assets from external sources are properly credited in comments

- survey_element.py: add "type" (if available) to the repr for debugging
- entity_declaration.py: add missing "type" field which appears in the
  dict prepared by entity_parsing.py, and in general set that type
  argument by default
- survey.py: similar to the above add type field to init signature, and
  remove unicode/repr in favour of survey_element implementation
- works the same but could allow builder.py to be compatible with
  Survey/SurveyElement tree as well as the workbook_to_json output dict
- allowed to put entities in repeats (v1)
- allowed to declare more than one entity (v2)
- pyxform places the entity instead of user-specified (v2)
@lognaturel lognaturel moved this to ✏️ in progress in ODK Central Feb 11, 2026
- previous equality check would only emit for a direct repeat parent
- also nest under `if` to avoid potentially adding `None` to children
- test plan not fully implemented yet but the idea is to do a chunk of
  it to check that the approach seems sensible and usable in practice
- old: test_entities_columns__multiple_unexpected
- new: test_unexpected_column__multiple__error
- old: test_entities_columns__one_unexpected
- new: test_unexpected_column__single__error
- old: test_dataset_with_reserved_prefix__errors
- new: test_dataset_name__reserved_prefix__error
- old: test_dataset_with_period_in_name__errors
- new: test_dataset_name__period__error
- old: test_dataset_with_invalid_xml_name__errors
- new: test_dataset_name__xml_identifier__error
- old: test_worksheet_name_close_to_entities__produces_warning
- new: test_sheet_name_misspelling__warning
- old: test_saveto_without_entities_sheet__errors
- new: test_no_entity_declarations__error
- old: test_reference_name_not_found__error
- new: test_unresolved_variable_reference__error
- needed to refactor/move workbook_to_json survey type parsing because
  previous saveto check would raise a false positive error if the type
  contained "group" or "repeat" but that can occur for selects etc, so
  control type parsing is moved earlier so that result can be used, and
  so it's possible to catch a save_to in an end_group / end_repeat.
- converted `{dataset_name: list[ReferenceSource]}` to an object:
  - used in many functions so it's now clearer what that object is
  - allows it can carry the row_number for error messages
  - may make sense to move some function(s) to it as class methods
- test cases are somewhat light since the intention is to cover the
  solver behaviour more thoroughly in the topology test suite
- would be caught by names validation later anyway, but that names error
  is not particularly clear for this problem (which seems plausibly
  common) because the name check happens after the delimiter split.
- not explicitly tested before but existing tests alternate usages
- swapped tests in test_entities.py over to list_name since that's
  the current/documented alias
- also add some requirements tags to match the xpath assertions
- the docstring usage of google-style "attributes" is required in this
  case since a namedtuple defines properties at class level not init.
- linting
- variable references can float between containers in a scope boundary
  but save_tos must be in the nearest container
- combined get_container_scopes and get_allocation_requests into
  EntityReferences.get_allocation_request since these are two parts of
  the same process and operate on single EntityReferences instance.
  - also makes it a bit clearer what allocate_entities_to_containers is
    doing to group the allocation requests by scope boundary
  - updated test assertions using ENTITY_011/012 since this new func
    will flag the first/highest reference as the error rather than the
    lowest. The higher node is perhaps more incorrect since looking
    down the tree results in multiple nodes. But then again users tend
    to write forms top to bottom so either could be preferable.
- changed ReferenceSource.get_scope_boundary default to the /survey
  since that is always the default scope boundary (rather than empty).
- deleted ENTITY_010 error since it's now covered by ENTITY_011 and
  the latter more clearly shows both scopes.
- previously it was constrained to the nearest container, but the spec
  says "nearest ancestor container that has an Entity declaration"
- added tests for topology error scenarios involving save_to and/or vars
- changed requests.sort to sorted() since the sort result is only needed
  for the allocation loop.
- fold allocate_entities_to_paths into allocate_entities_to_containers
  since it doesn't get used elsewhere and relies on the external
  all_allocations dict anyway
  - also replace somewhat confusing `while` loop that just is based on
    ints with hopefully clearer range loop over requested_path
- replace get_allocation_request repeated max/min calls with equivalent
  single loop over the references, clearer and avoids repeated loops
@lognaturel
Copy link
Contributor

This is feeling really magical, I must say! I think users will find it "just works." I'm starting to get myself familiar with the code and coming up with more interesting forms to see how they match up with your tests. Let me know if there's anything I should pay particular attention to.

@lindsay-stevens
Copy link
Contributor Author

@ln thanks! I think something that would be good to review initially is the traceability test suite plan at the top of test_entities.py. I'm still refining it, but the gist of it is there. What I was going for is a way to organise the test suite for entities, in a lightweight way that can map each test to the relevant requirement(s) and vice versa. There are fancy ways to do this sort of thing with method decorators or gherkin libraries etc. - but as it is currently, at least it allows (in Pycharm or similar) to double-click or search each requirement tag (e.g ES001) to see where the relevant tests are.

Another thing that would be good to review early is the ENTITY_009, ENTITY_011, and ENTITY_012 errors for topology problems. With those cases, it seemed like the sort of thing where it may be quite confusing for users to resolve for non-trivial forms - even if the message includes every bit of context available (which may be too verbose) where the error is raised - because ultimately pyxform can't know exactly what the user intended to do. Maybe there is a good docs URL that can do the heavy lifting on explaining, rather than a paragraph-sized error? etc.

- as described in XLSForm#819 if an entity_id is defined then the default
  setvalue node which targets the entity_id should not be emitted.
  - included cases that show behaviour when a dynamic default for the
    referenced entity_id is included i.e. the setvalue for the default
    targets the question rather than the entity_id attribute
  - as discussed on forum thread 55163, in Collect this duplicate
    setvalue/bind sees the bind "win" whereas in Webforms the setvalue
    "wins" so it is a functionality problem as well as being redundant.
- added / moved test cases for implicit update/upsert modes
  - many of the tests that are still in the old modules include
    assertions for save_to which is not part of test_entities yet
  - added requirement and assertion for the instance csv since while
    that instance is essential for updating, it's not currently intended
    to be emitted by default
  - added requirements for not emitting namespace/version since these
    were covered in previous test cases
- L560: the allocations are derived from the entity declarations so
  the dataset_name should always be found
- L567: it seems like the has_repeat_ancestor flag is sufficient
- L568: the id_attr is added by default on L286 and the actions attr
  defaults to an empty list so it's safe to assume both exist
  - to help keep the default function in sync though, the action value
    for the repeat setvalue is copied from the top-level action
  - the top-level action is only added if there is no entity_id L234.
- L424: the container path was being calculated from the stack for each
  row, but it is a property of each stack entry which can be calculated
  in advance.
  - added a ContainerPath class to represent the collection of
    container nodes used in many places, and used it to calculate
    the container path for each stack entry when adding to the stack
  - also the path can assume the previous stack entry has a container
    path as well, so that value is used instead of re-iterating
- L696: deleted branch since it's not reachable - if there are multiple
  entity declarations, then the code would go through L690 -> L691
| | end_group | | | |
| | end_repeat | | | |

| entities |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 You really went for the most diabolical cases!

- dataset_2: xpath - where the meta/entity block should be for requestor_2


*Stress test*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice one. That's a very big survey, if things are working for that, we're in great shape.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think) I'm most of the way through the traceability suite, but this stress test is part of the solver suite, so I'm not sure it passes yet, but then again the expectation for the stress test is pretty loose.

@lognaturel
Copy link
Contributor

lognaturel commented Feb 27, 2026

Thanks for the guidance on how to approach early review. I've looked at tests and messages and one thing that's jumping out to me is that there's one common pattern that's not yet supported: savetos for the same list across inner groups within top level or repeats. Sample forms:

Edit: these do work!! I've updated them with some minor unrelated fixes that made conversion failed. I must have forgotten to install.

@lindsay-stevens
Copy link
Contributor Author

savetos for the same list across inner groups within top level or repeats

It may not have worked earlier but using the latest commit 6b525bb, conceptually this is OK. There's a test test_save_to_scope_breach__depth_1_group__save_to_only__ok which shows saveto inside and outside a group. But I'll add some more cases that follow the patterns in these XLSX files to lock them in. I was able to convert all 3 of the files, except that two of them had a missing referenced name error - member_hh perhaps should be named hhid, because there was no other question named hhid and the entity labels referenced ${hhid} - after fixing that they converted.

- add a test for the error that odk_validate raises when the instance
  referenced in update/upsert mode isn't there (EB019)
- update other such tests to use csv-external so that the instance is
  present - not a core part of the test as such, but required for
  update/upsert to work without an odk_validate error
- depth 1: survey/group
- depth 2: repeat/group
- depth 3: repeat/group + var outside repeat
@lognaturel
Copy link
Contributor

That’s amazing to read, thank you! I wonder whether I forgot to rebuild? I’ll try to get to the bottom of it and keep working through the test cases. 🚀

- previously, not adding any references to or from an entity would
  result in them being silently ignored, but now such entities are
  considered as targeting the survey
  - allocation currently only considers ancestors so only one such
    unreferenced entity can exist in the form, i.e. even if there are
    spare valid groups the entities won't be pushed down into them
- removed EB011 since it is covered by EB010 in that the row number is
  what makes it stable by breaking ties using a distinct value
- locally the @baseVersion seemed to reliably be the one in the error
  message, but on GH Actions it seems to prefer @branchId, so the
  test assertion uses generic message fragments, which should still
  be selective enough for this error and not match others
@lognaturel
Copy link
Contributor

Confirmed those three work and updated the source above to fix the issues you mentioned. So good!

I'm continuing my review looking specifically at risk to existing functionality we may not have explicitly tested and common patterns we expect. I think we should merge a version sooner than later so QA and the broader community can start pushing on it.

- combinations of:
  - create, update, multiple properties, multiple entities
  - survey, group, repeat, repeat/group, group/repeat
- delete tests covered by these newer ones
- previously did not handle a blank list_name / dataset where other
  columns had a value, but now there is an error for that
- the dataset_name checks assumed only one entity is possible so had
  hardcoded row=2 in all errors, but now it uses the actual row number
- should help keep what is parsed in sync with current entity cols
- still need outer get_entity_variable_references to index on
  question_name to avoid having to do nested iteration on every row
@lindsay-stevens lindsay-stevens marked this pull request as ready for review March 3, 2026 15:30
@lindsay-stevens
Copy link
Contributor Author

Maybe it's from looking at it for so long, but this PR seems to be in decent enough shape for merging. The remaining work is mainly the solver suite which is searching for edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't output Entity id setvalue when there's an entity id expression specified Allow declaring that multiple Entities should be created/updated

2 participants