-
Notifications
You must be signed in to change notification settings - Fork 69
fix: Fix infinispan configuration, exception handling, and the whole cache operation #4408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v2.x.x
Are you sure you want to change the base?
Conversation
…ection issue Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
| return clm.get("zoweInvalidatedTokenLock"); | ||
| } catch (AvailabilityException ae) { | ||
| log.debug("Cannot obtain lock", ae); | ||
| throw new StorageException(Messages.CACHE_NOT_AVAILABLE.getKey(), Messages.CACHE_NOT_AVAILABLE.getStatus()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should include at least the exception message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is there. I didn't want to improve exception handling, but I will create an issue about.
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
…to reboot/caching-error-on-lock
...service/src/main/java/org/zowe/apiml/caching/service/infinispan/config/InfinispanConfig.java
Outdated
Show resolved
Hide resolved
| type: WARNING | ||
| text: "Cache is not available: %s" | ||
| reason: "Cache is not ready to write at the moment." | ||
| action: "Verify the instance configuration or connectivity between multiple instances." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is expected to be resolved during runtime, should the message rather contain: "wait for initialization"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message was updated
…location (v3)" #3960 Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
…oot/caching-error-on-lock # Conflicts: # apiml-common/src/main/java/org/zowe/apiml/product/eureka/client/ApimlPeerEurekaNode.java # caching-service/src/main/java/org/zowe/apiml/caching/service/infinispan/config/InfinispanConfig.java # caching-service/src/main/resources/application.yml
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
…ervice: Temporary failure in name resolution) Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
This reverts commit c746652.
Signed-off-by: Pavel Jareš <[email protected]>
Signed-off-by: Pavel Jareš <[email protected]>
|


Description
This PR fixes a couple of issues:
The original configuration was for replicated cache, even the aim is to have distuributed one.
The caches could be split on level on data and index configuration. This fix change it in the way as v3 is working to split even there is no configuration (by HA instance name). It requires less configuration and work on each platform.
In the original implementation are all exception handling only in one way, there is no way that any unexpected excetion is thrown. As the resutlt the responses are incorrect and missleading, for example:
It the service tried to generate any record during the startup (when cluster was not established) it leads to failing start up. The Zowe Launcher then restart the whole service and it happened again. The solution is to not create a lock bean immediatelly but on the demand.
Because there was many of issue we decided to do not cache salt and read it always from the Infinispan. It also required a small refactoring to use the same cache during multiple operation (see validation of PAT).
There are new log messages to detect a reason of failing service and debug profile was extended for Infinispan (till now it was focused only on JGroup).
In case of connected cluster the function for obtaing data fails because it required to serialize lambda (see lamba during collection to get key and value from the cache).
Other know errors (a separated issues will be created):
It is solveble for v3, but not for v2 (a different version of Infinispan). The key exchange implementation uses a private key from keystore and needs to trust the oposite site. Therefore the public certificate must be in the keyring otherwise the cluster cannot be established.
In the case the state of cache is not distibuted, obtaining of the cache is not working. It tries to read from instance A and create in instance B. The instance B could contain a salt and it would lead to 409 key conflict.
If multiple instances at same time tries to read salt one should fail. It would be solved by a lock, It would be callable from the Gateway.
Each call from the Gateway is routed and it means each request is targeting on a different instance of Caching service (see the issue with obtaining salt). The optimal solution would be to use local instance of Caching service (deterministic routing cannot be used because in case local instance is down it is not trying to use other instance). Because the methods use methods POST, DELETE, etc. it is no trying to retry calls. The best solution would be to use load balanced client and do not call Gateway itself.
When multiple services uses the same client certificate the caches are not split (not exclusive for a specific service).
Changing of certificate (in case of modification of DN) Gateway (ZAAS) will forget for all revoked tokens and they could be used again.
For each validation of any PAT is Gateway asking for the whole cache. It means all records, including irrelevant ones are transfered to the Gateway. It means the original aim to safe calls works only with empty (or almost empty) cache. Since the data are growing the payload is growing as well and it could lead to service failure.
The optimum solution is to ask for concreate records (a batch call) or includ cache inside the Gateway.
Because there is stored just hash of PAT and no other information it is not possible identify if storing a token is still required (it is basically required till the token expired). It means the database is growing without any limitation. It especially with downloading the whole database is not optimal.
The data sould be marked with a flag, that allows to remove old records then. Ideally to set expiration of a record.
It is not possible to define a IP address to bind the main port to be listening on.
The condition to detect original reason is very fragile. Everything uses the same type of exception, only its properties are set.
It would be also benefitial to start using exception handler instead of handle them in the controller.
X-Certificate-DistinguishedNameThose values are not checked if matching. Client with any request could call the caching service and provide any value. Their usage is also questionable.
Even the service claims it is up we have no information about cluster. The state of JGroup communication does not effect the service state and we cannot determinate it. The diag endpoint was disabled to mitigate a potential security issue.
If there is a lack of communication, caching service could fails during the start-up. It was happenning of 3-instances set up.
This implementation tries to create lock at the moment when is needed and just response 503, because there is no Spring bean that should be created.
Exception during getAll operation:
Linked to # (issue)
Part of the # (epic)
Type of change
Please delete options that are not relevant.
Checklist:
For more details about how should the code look like read the Contributing guideline