Normalize `optimization.Table` DB storage #143

glatterf42 · 2024-12-17T13:41:40Z

This PR is a follow-up to #122: moving to the next optimization item, it redesigns the data model to normalize it and hopefully also make it more performant.

For Tables, this is a bit more tricky than for Indexsets. Tables have multiple columns, each of which needs to be linked to an Indexset but can have a name other than the Indexset's name. Each Table also needs to store data in table form, where different Tables will have a different number of columns.
Previously, we solved this by dumping the data into a JSON field, but that did not achieve the desired performance, nor did it follow DB model normalization rules.

In preparation of this PR, I first thought that our ideal model would look like this:

Table and Indexset should get a many-to-many relationship. Sqlalchemy's association proxy extension is an elegant solution for this, as it also offers to save optional fields (like distinct dimension names) along the relationship.
Table.data only consists of values that are stored inside Indexset.data (that the Table is linked to), so ideally, we save just references to these values in a new table TableData and have a one-to-many relationship between Table and that. A relationship between that and IndexSetData seems like the ideal object for this.

Unfortunately, only the first point worked out. Sqlalchemy can't handle a relationship as part of a UniqueConstraint, preventing the INSERT workflow that this whole refactoring is aimed at achieving. Instead, the TableData is now storing the actual str form of the values, increasing storage requirements (compared to the ideal solution). This might be a source of future improvement.
To tackle the problem of varying numbers of columns, TableData is now hardcoded to store up to 15 values per row, most of which will always be NULL. This requires some clean up before returning the actual DB object like dropping NULL columns and renaming columns (to the name from either their indexset or specific column_name), which is likely the messiest part of this PR. Please let me know how that could be improved.

Locally, the tests run just as long as they do on main, but this is not proper benchmarking, which would be the actual test for this PR: it should handle a million values faster than main (and hopefully old ixmp). I previously tested this for Parameters (which utilize upserting rather than just inserting), though, so I might have to draft another PR very similar to this one for them and run the manual benchmark again. A proper benchmark setup for optimization items (to run automatically) would be great (depending on how long we need to reach ixmp's performance).

codecov · 2024-12-17T15:15:10Z

Codecov Report

Attention: Patch coverage is 97.27273% with 3 lines in your changes missing coverage. Please review.

Project coverage is 88.5%. Comparing base (1885f0c) to head (4af4ffb).

Files with missing lines	Patch %	Lines
ixmp4/data/db/optimization/table/repository.py	85.7%	3 Missing ⚠️

Additional details and impacted files

@@                       Coverage Diff                       @@
##           enh/remove-superfluous-ellipses    #143   +/-   ##
===============================================================
  Coverage                             88.4%   88.5%           
===============================================================
  Files                                  231     231           
  Lines                                 8093    8155   +62     
===============================================================
+ Hits                                  7160    7219   +59     
- Misses                                 933     936    +3

Files with missing lines	Coverage Δ
ixmp4/core/optimization/table.py	`92.4% <100.0%> (-0.2%)`	⬇️
ixmp4/data/abstract/optimization/table.py	`96.9% <100.0%> (+<0.1%)`	⬆️
ixmp4/data/api/optimization/table.py	`94.0% <100.0%> (ø)`
ixmp4/data/db/base.py	`92.3% <100.0%> (+<0.1%)`	⬆️
ixmp4/data/db/optimization/__init__.py	`100.0% <100.0%> (ø)`
ixmp4/data/db/optimization/base.py	`100.0% <100.0%> (ø)`
ixmp4/data/db/optimization/indexset/__init__.py	`100.0% <100.0%> (ø)`
ixmp4/data/db/optimization/indexset/repository.py	`98.1% <100.0%> (ø)`
ixmp4/data/db/optimization/table/model.py	`100.0% <100.0%> (ø)`
ixmp4/data/db/optimization/utils.py	`100.0% <100.0%> (ø)`
... and 4 more

meksor · 2025-01-07T15:44:41Z

Unfortunately, only the first point worked out. Sqlalchemy can't handle a relationship as part of a UniqueConstraint, preventing the INSERT workflow that this whole refactoring is aimed at achieving. Instead, the TableData is now storing the actual str form of the values, increasing storage requirements (compared to the ideal solution). This might be a source of future improvement.

Yes the relationship is a ORM concept the database does not know about it. Your current UniqueConstraint should not work either though, unless all columns are not NULL. From the postgres docs:

In general, a unique constraint is violated if there is more than one row in the table where the values of all of the columns included in the constraint are equal. By default, two null values are not considered equal in this comparison. That means even in the presence of a unique constraint it is possible to store duplicate rows that contain a null value in at least one of the constrained columns. This behavior can be changed by adding the clause NULLS NOT DISTINCT [...]

https://www.postgresql.org/docs/15/ddl-constraints.html#DDL-CONSTRAINTS-UNIQUE-CONSTRAINTS

You could add NULLS NOT DISTINCT, but i think that will break compat. with sqlite.

Am I wrong or are the tests spotty? Or does it not matter if that constraint works at all?

Im having a hard time gauging where potential performance bottlenecks are without any benchmarking and profiling tests and since we dont have any data on both approaches (is that right?) I feel like im choosing between:

a. massive hack, potentially better performance but who knows?
b. less massive hack, potentially worse performance but also -- who knows?

both without any explanation as to why they would be performing better or worse...

I know this is becoming an annoyingly common occurrence but im leaning towards a "No" on this one again without any further info...

glatterf42 · 2025-11-28T18:40:26Z

Just summarizing the status here: I still think we would benefit from normalizing the data-storage-part of the optimization tables, possibly using the approach contained here.
However, especially after #219, this branch will not be mergeable without much refactoring, so if the approach seems useful, it might be best to start a new branch and cherry-pick (via git or manually) the changes to that one.
In order to fully understand if this is a good idea, we'd need the spare time to come up with performance tests, though. Ideally testing:

tables.add_data()
tables.remove_data()
and both options in ixmp/on JDBC for comparison

Then we could compare the JSON and normalized DB models, and also provide numbers to the MESSAGE team whether ixmp4 improves performance in this regard (noting that Parameter.remove_data() on JDBC is a very common pain point).

glatterf42 added the enhancement New feature or request label Dec 17, 2024

glatterf42 self-assigned this Dec 17, 2024

glatterf42 added 7 commits December 19, 2024 13:42

Reify some core docstrings

3ea7fb0

Include forgotten test case

8536bb0

Make util function more concise

9555053

Remove superfluous lines

b8c9b0d

Test adding data to scalar variable raises

7c463ee

Validate var/equ data only for non-empty data

ea60cc1

Introduce run.optimization.remove_solution()

b0201c6

glatterf42 requested a review from meksor December 20, 2024 07:40

glatterf42 added 10 commits December 20, 2024 14:15

Introduce run.get_by_id()

8513cac

Remove superfluous casts

321e81a

Remove superfluous lines

3cb5f39

Make return value of indexset.data consistent

54eb700

Enable all abstract EnumerateKwargs for DB-timeseries

335388b

Make iamc.data helper functions available in DB layer

39e1d19

Introduce run.clone()

9e2dab4

Update openapi schema

d517108

Remove outdated comment

2b85a87

Name helper file appropriately

0975ce3

glatterf42 force-pushed the enh/remove-superfluous-ellipses branch from f996fb5 to 64d7787 Compare December 20, 2024 13:24

glatterf42 force-pushed the enh/normalize-table-DB-storage branch from 7a65d41 to afd1630 Compare December 20, 2024 13:26

Exclude ellipses from coverage report

1885f0c

glatterf42 force-pushed the enh/remove-superfluous-ellipses branch from 64d7787 to 1885f0c Compare January 7, 2025 12:53

glatterf42 mentioned this pull request Jan 13, 2025

Add profiling & benchmarks for Indexset and Parameter #148

Closed

glatterf42 force-pushed the enh/normalize-table-DB-storage branch from afd1630 to c842910 Compare January 13, 2025 13:22

glatterf42 added 2 commits January 13, 2025 14:29

Normalize optimization.table DB model

d5d7680

Add DB migration file for Table

e0e4be5

glatterf42 added 3 commits January 13, 2025 14:29

Enable run_id filter for optimization items

4ccf17f

Add only matching indexsets from this Run to Table

c71548a

Use getattr to DRY some lines

4d44102

glatterf42 force-pushed the enh/normalize-table-DB-storage branch from c842910 to 4d44102 Compare January 13, 2025 13:29

📝 docs: Fix error type in docs

4af4ffb

glatterf42 mentioned this pull request Jan 16, 2025

Remove Column class for Parameter #151

Closed

1 task

glatterf42 mentioned this pull request Jan 30, 2025

Remove Column class for Parameter #156

Merged

1 task

glatterf42 force-pushed the enh/remove-superfluous-ellipses branch from 1885f0c to 3cc17d2 Compare January 30, 2025 10:54

glatterf42 force-pushed the enh/remove-superfluous-ellipses branch from 3cc17d2 to 8fe4cac Compare April 29, 2025 08:23

Base automatically changed from enh/remove-superfluous-ellipses to main April 29, 2025 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize `optimization.Table` DB storage #143

Normalize `optimization.Table` DB storage #143

Uh oh!

glatterf42 commented Dec 17, 2024 •

edited

Loading

Uh oh!

codecov bot commented Dec 17, 2024 •

edited

Loading

Uh oh!

meksor commented Jan 7, 2025 •

edited

Loading

Uh oh!

glatterf42 commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Normalize optimization.Table DB storage #143

Are you sure you want to change the base?

Normalize optimization.Table DB storage #143

Uh oh!

Conversation

glatterf42 commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

meksor commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glatterf42 commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Normalize `optimization.Table` DB storage #143

Normalize `optimization.Table` DB storage #143

glatterf42 commented Dec 17, 2024 •

edited

Loading

codecov bot commented Dec 17, 2024 •

edited

Loading

meksor commented Jan 7, 2025 •

edited

Loading