-
Notifications
You must be signed in to change notification settings - Fork 7
Normalize optimization.Table DB storage
#143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## enh/remove-superfluous-ellipses #143 +/- ##
===============================================================
Coverage 88.4% 88.5%
===============================================================
Files 231 231
Lines 8093 8155 +62
===============================================================
+ Hits 7160 7219 +59
- Misses 933 936 +3
|
f996fb5 to
64d7787
Compare
7a65d41 to
afd1630
Compare
64d7787 to
1885f0c
Compare
Yes the relationship is a ORM concept the database does not know about it. Your current UniqueConstraint should not work either though, unless all columns are not NULL. From the postgres docs:
https://www.postgresql.org/docs/15/ddl-constraints.html#DDL-CONSTRAINTS-UNIQUE-CONSTRAINTS You could add NULLS NOT DISTINCT, but i think that will break compat. with sqlite. Am I wrong or are the tests spotty? Or does it not matter if that constraint works at all? Im having a hard time gauging where potential performance bottlenecks are without any benchmarking and profiling tests and since we dont have any data on both approaches (is that right?) I feel like im choosing between: a. massive hack, potentially better performance but who knows? both without any explanation as to why they would be performing better or worse... I know this is becoming an annoyingly common occurrence but im leaning towards a "No" on this one again without any further info... |
afd1630 to
c842910
Compare
c842910 to
4d44102
Compare
1885f0c to
3cc17d2
Compare
3cc17d2 to
8fe4cac
Compare
|
Just summarizing the status here: I still think we would benefit from normalizing the data-storage-part of the optimization tables, possibly using the approach contained here.
Then we could compare the JSON and normalized DB models, and also provide numbers to the MESSAGE team whether ixmp4 improves performance in this regard (noting that |
This PR is a follow-up to #122: moving to the next optimization item, it redesigns the data model to normalize it and hopefully also make it more performant.
For
Tables, this is a bit more tricky than forIndexsets.Tables have multiple columns, each of which needs to be linked to anIndexsetbut can have a name other than theIndexset's name. EachTablealso needs to store data in table form, where differentTables will have a different number of columns.Previously, we solved this by dumping the data into a JSON field, but that did not achieve the desired performance, nor did it follow DB model normalization rules.
In preparation of this PR, I first thought that our ideal model would look like this:
TableandIndexsetshould get a many-to-many relationship. Sqlalchemy's association proxy extension is an elegant solution for this, as it also offers to save optional fields (like distinct dimension names) along the relationship.Table.dataonly consists of values that are stored insideIndexset.data(that theTableis linked to), so ideally, we save just references to these values in a new tableTableDataand have a one-to-many relationship betweenTableand that. Arelationshipbetween that andIndexSetDataseems like the ideal object for this.Unfortunately, only the first point worked out. Sqlalchemy can't handle a
relationshipas part of aUniqueConstraint, preventing theINSERTworkflow that this whole refactoring is aimed at achieving. Instead, theTableDatais now storing the actualstrform of the values, increasing storage requirements (compared to the ideal solution). This might be a source of future improvement.To tackle the problem of varying numbers of columns,
TableDatais now hardcoded to store up to 15 values per row, most of which will always beNULL. This requires some clean up before returning the actual DB object like droppingNULLcolumns and renaming columns (to the name from either their indexset or specificcolumn_name), which is likely the messiest part of this PR. Please let me know how that could be improved.Locally, the tests run just as long as they do on
main, but this is not proper benchmarking, which would be the actual test for this PR: it should handle a million values faster thanmain(and hopefully old ixmp). I previously tested this forParameters (which utilize upserting rather than just inserting), though, so I might have to draft another PR very similar to this one for them and run the manual benchmark again. A proper benchmark setup for optimization items (to run automatically) would be great (depending on how long we need to reach ixmp's performance).