[feature] Non-zero gamma support in ChoppedTransferCompound#764
[feature] Non-zero gamma support in ChoppedTransferCompound#764Zhaoxian-Wu wants to merge 5 commits intoIBM:masterfrom
Conversation
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
4f9519d to
29c7e11
Compare
|
Hey @Zhaoxian-Wu, can you please address the lint errors and update the pr with a new commit, so we can check everything is right? Thanks! Checkout errors here: https://github.com/IBM/aihwkit/actions/runs/23523015289/job/69153947890?pr=764 |
|
Hello @maljoras @maljoras-sony can you take a look and help us here? |
Bug fixes:
- CPU TransferRPUDevice::getPulseCountLearningRate now honours
scale_fast_lr (was always returning raw fast_lr)
- CPU and CUDA ChoppedTransferRPUDevice::getPulseCountLearningRate
now applies scale_fast_lr in the auto_scale branch (was missing)
- Remove duplicate auto_momentum line in printToStream
Feature: reduceToWeights for non-zero gamma
- Add ChoppedTransferRPUDevice::reduceToWeights (CPU) and
ChoppedTransferRPUDeviceCuda::reduceToWeights (CUDA) that apply
per-element chopper correction when gamma != 0:
W[i,j] += gamma * (c_d[i]*c_x[j] - 1) * A_stored[i,j]
Enables residual-learning configurations with ChoppedTransfer.
- New CUDA kernel: kernelApplyChopperCorrectionToWeights
Cleanup:
- Remove partial buffer_as_momentum field and its CUDA kernel
- Expose scale_fast_lr to Python TransferCompound (default True;
ChoppedTransferCompound keeps its existing default False)
Docs:
- Rewrite ChoppedTransferCompound docstring: corrected
base_buffer_granularity / final_fast_lr / final_transfer_lr
formulas and full numbered recursion pseudocode
- analog_update.rst: add TTv2 / TTv3 / TTv4 / RL-v2 sections,
residual-learning and bit-slicing discussion
- using_simulator.rst: add per-algorithm subsections (TTv2, TTv3,
TTv4) and gamma residual-learning explanation with code examples
- paper_references.rst: add references [15]-[19]
Signed-off-by: Zhaoxian Wu <wuzhaoxian97@gmail.com>
29c7e11 to
b64a075
Compare
Thanks for your reminder! I have already fixed the lint and style errors of two PRs. Feel free to let me know of any other improvements. |
|
@Zhaoxian-Wu looks like a great addition, many thanks. @PabloCarmona, Let me find some time over the weekend to take a closer look. |
|
Hi @PabloCarmona , I noticed that the CI lint check is failing with the following mypy errors in src/aihwkit/simulator/tiles/periphery.py: src/aihwkit/simulator/tiles/periphery.py:983: error: Expected iterable as variadic argument [misc]
src/aihwkit/simulator/tiles/periphery.py:1009: error: Expected iterable as variadic argument [misc]
src/aihwkit/simulator/tiles/periphery.py:1011: error: Expected iterable as variadic argument [misc]Weirdly, this error seems to exist on master as well — they are not introduced by the PR. Could you run the following on your end to confirm? mypy --show-error-codes src/
|
|
Thanks @Zhaoxian-Wu, I will take a closer look and fix it in master. I'll let you know when I finish. In the meantime, let's also give time to @maljoras-sony to look at the PR and review. Thanks again to both! |
Non-zero gamma support,
scale_fast_lr, and documentationOverview
This PR makes two improvements to
ChoppedTransferRPUDevice/ChoppedTransferCompound:gammasupport —ChoppedTransferCompoundcan now be used as a residual-learning device where the fast array A contributes directly to the effective weight, with correct chopper de-correlation applied during weight reduction.scale_fast_lrparameter — a new parameter, analogous to the existingscale_transfer_lr, that controls whether the fast-device LR tracks the current optimizer LR.gammais attached to the documents. To provide sufficient context, this PR also expands the algorithm documentation to cover TTv1 through TTv4.1. Non-zero
gammasupport inChoppedTransferRPUDevicePreviously,
checkSupported()enforcedfullyHidden(), which hard-blocked any configuration where the fast array A contributes to the visible weight (gamma != 0). This restriction is lifted, and correct behaviour is implemented via areduceToWeightsoverride (CPU + CUDA).Background: A is updated with per-element chopper sign flips and is therefore stored in "chopped" form:
A_stored[i,j] ≈ c_d[i]·c_x[j]·A_true[i,j]. The base-class weight-reduction GEMV computesW = gamma·A_stored + C, which is incorrect whengamma != 0because the chopper factors are not cancelled. The new override applies a correction after the GEMV:The
- 1term accounts for the fact that the base GEMV already contributedgamma * A_stored; the correction adds only the remaininggamma * (c_d[i]*c_x[j] - 1) * A_storedto reach the correctgamma * (c_d[i]*c_x[j]) * A_stored. On CUDA this is implemented as the newkernelApplyChopperCorrectionToWeightskernel; on CPU it is a simple loop. Both paths are no-ops whengamma == 0(the default).Motivation: Non-zero
gammaimplements the residual learning mechanism described in Wu et al. (2025) [15] and Li et al. [19]: A acts as a residual correction on top of C, compensating for C's quantisation errors and device non-idealities cycle-by-cycle, while C accumulates the long-term gradient signal via discrete transfer pulses. The two-array decompositionW = gamma·A + Calso enables bit-slicing (precision enhancement): A can represent finer-grained updates than C's native conductance step, reducing the effective weight granularity without modifying the underlying analog device.Files:
src/rpucuda/rpu_chopped_transfer_device.{cpp,h},src/rpucuda/cuda/rpucuda_chopped_transfer_device.{cu,h}2.
scale_fast_lrparameterscale_fast_lris introduced as the analogue of the existingscale_transfer_lr: just asscale_transfer_lrcontrols whethertransfer_lris multiplied by the current optimizer LR,scale_fast_lrcontrols the same behaviour forfast_lr.The parameter is added to
TransferRPUDeviceMetaParameter(C++ base struct, defaultTrue) and exposed to the Python bindings and to theTransferCompounddataclass.ChoppedTransferCompoundoverrides the default toFalse, consistent with the existing convention for that device class.The corresponding logic is implemented in:
TransferRPUDevice<T>::getPulseCountLearningRate(CPU)TransferRPUDeviceCuda<T>::getPulseCountLearningRate(CUDA)ChoppedTransferRPUDevice[Cuda]<T>::getPulseCountLearningRate,auto_scalebranch (CPU + CUDA)Files:
src/rpucuda/rpu_transfer_device.{cpp,h},src/rpucuda/cuda/rpucuda_transfer_device.cu,src/rpucuda/rpu_chopped_transfer_device.cpp,src/rpucuda/cuda/rpucuda_chopped_transfer_device.cu,src/aihwkit/simulator/rpu_base_src/rpu_base_devices.cpp,src/aihwkit/simulator/configs/compounds.py3. Documentation
compounds.py—ChoppedTransferCompounddocstringA detailed pseudocode block is added to the
ChoppedTransferCompounddocstring to improve readability and serve as the authoritative reference for the internal LR-scaling logic. The block covers:base_buffer_granularity— threshold calculation frombuffer_granularity,dw_min_A, and the optionalauto_granularityperiod scalingfinal_fast_lr— derivation fromfast_lr,scale_fast_lr, thefast_lr=0fallback, and theauto_scaleformula(
base_fast_lr * desired_BL * dw_min_A / (x_max * d_max))final_transfer_lr— both the default andcorrect_gradient_magnitudesbranchesW = gamma·A + C), chopper application, H accumulation, threshold test, pulse dispatch (C += n_steps·dw_min_C), and theforget_buffer/momentuminteractiondocs/source/analog_update.rstThe algorithm overview is extended from three methods (Plain SGD, Mixed Precision, Tiki-taka) to the full TTv1-TTv4 family. New sections:
gamma, Wu et al. [15], Li et al. [18])docs/source/using_simulator.rstBufferedTransferCompound,ChoppedTransferCompound, andDynamicTransferCompoundentriesgammadocs/source/paper_references.rstFive new references added:
Testing
gamma: setgamma=0.1in aChoppedTransferCompoundconfig; verify that the tile's visible weight equalsgamma·chop_corrected_A + Crather thangamma·A_stored + C.scale_fast_lr: train withfast_lr > 0,scale_fast_lr=True, and a LR scheduler; verify the effective pulse-count LR tracks the optimizer LR on both CPU and CUDA, including withauto_scale=True.