Hi,
I can only see that you optimize the intrinsic loss in your code. Can you point me to the line where you add the intrinsic rewards to the actual environment/extrinsic rewards?
In some areas of your code I can see comments like
# total reward = int reward
which would, according to the original paper, be wrong, no?
Thank you.