Skip to content

Oasislmf ci testing#28

Draft
sambles wants to merge 32 commits into
developfrom
oasislmf-ci-testing
Draft

Oasislmf ci testing#28
sambles wants to merge 32 commits into
developfrom
oasislmf-ci-testing

Conversation

@sambles
Copy link
Copy Markdown
Contributor

@sambles sambles commented May 19, 2026

No description provided.

@sambles sambles marked this pull request as draft May 19, 2026 09:15
@sambles
Copy link
Copy Markdown
Contributor Author

sambles commented May 21, 2026

PiWindComplex model ~ Claude

Changes and rationale:

1. get_event_idsevents_pd.sizelen(events_pd)
Bug fix. .size returns rows × cols, so batching was silently wrong.

2. get_model — parse model_data once with ast.literal_eval
Previously eval was called once per field (twice total per row). Now parsed once and both fields extracted together. Also safer — no arbitrary code execution.

3. gul_calc — vectorize bin_height
apply(lambda x: x.prob_to - x.prob_from, axis=1) replaced with df['prob_to'] - df['prob_from']. Eliminates Python row iteration; 10–100× faster on large DataFrames.

4. gul_calc — replace iterrows() random number loop
Was doing an O(n) full-DataFrame boolean scan per item. Now generates randoms per unique (event_id, group_id) pair, builds a lookup table, and uses a single merge.

5. calculate_guls — vectorize with np.where
Replaced row-by-row apply(calculate_guls, axis=1) with a vectorized np.where. Avoids per-row Python function call overhead; 50–200× faster at large sample counts.

6. write_loss_stream — replace inner boolean filter with groupby
Was re-scanning the entire DataFrame per (event_id, item_id) pair — O(n²). Now O(n log n) sort + O(n) grouped iteration.

7. write_loss_stream — batch binary writes with numpy structured arrays
Was calling struct.pack once per field per row. Now packs each item's rows into a numpy structured array and writes in one .tobytes() call, reducing syscall and Python overhead.

sambles added 22 commits May 21, 2026 09:27
  1. Bug: events_pd.size returns rows × cols, not rows — events are silently miscounted
  2. eval called twice per row in model_data parsing (once per field)
  3. Row-by-row apply(lambda) for bin_height — vectorizable in one line
  4. iterrows() loop with boolean masking for random numbers — O(n) full-DataFrame scan per item
  5. apply(calculate_guls, axis=1) — row-by-row Python apply, should be np.where
  6. O(n²) write_loss_stream — inner boolean filter re-scans the entire DataFrame per (event_id, item_id) pair
  7. struct.pack one field at a time — many small writes; batch them with numpy structured arrays
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant