Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion business_continuity/backup_restore/backup_arch.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ View the following list of the cluster backup and restore process, and how they
** `clusterclaim.hive.openshift.io`
** `clusterimageset.hive.openshift.io`
** `clustersync.hiveinternal.openshift.io`
//do we want to add the api for observability here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API doc is deprecated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the doc but the observability API name, "observability.open-cluster-management.io"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subbarao-meduri what do you think? This part of the doc explain what sources are backed up

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Obs resources should not be added to this section @dockerymick because they are Secrets and ConfigMaps with a backup label annotation , and some resources which I understand should be manually backed up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@birsanv thanks Valentina for the clarity! I will remove my hidden comment


- Exclude all resources from the following API groups:
** `internal.open-cluster-management.io`
Expand Down Expand Up @@ -153,4 +154,4 @@ You can backup third-party resources with cluster backup and restore by adding t
== Additional resources

Learn more about the policies and capabilities of the backup and restore component by going to
xref:../backup_restore/backup_validate.adoc#backup-validation-using-a-policy[Validating your backup or restore configurations].
xref:../backup_restore/backup_validate.adoc#backup-validation-using-a-policy[Validating your backup or restore configurations].
2 changes: 2 additions & 0 deletions business_continuity/backup_restore/backup_intro.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ Complete the following topics to learn more about the backup and restore operato
* xref:../backup_restore/backup_return_hub.adoc#return-initial-hub[Returning to the initial hub cluster after a restore]

* xref:../backup_restore/backup_hcp.adoc#config-hcp-backup[Backup and restore for hosted control planes and hosted clusters]

* xref:../backup_restore/backup_restore_config_obs.adoc#backup-restore-obs-config[Backup and restore configuration for Observability]
26 changes: 26 additions & 0 deletions business_continuity/backup_restore/backup_restore_config_obs.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[#backup-restore-obs-config]
= Backup and restore configuration for Observability

The Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active and passive backup patterns. You must configure Oservability to ensure data continuity and integrity during hub cluster migration or backup.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to my other comment. I'm not used to seeing phrasing like "data continuity and integrity" what do you think of this rewrite?

Suggested change
The Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active and passive backup patterns. You must configure Oservability to ensure data continuity and integrity during hub cluster migration or backup.
The Observability service uses an S3-compatible object store to persist all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active and passive backup patterns. You must configure Oservability to ensure that your data stays safe and keeps its continuity during the hub cluster migration or backup.


*Notes:*

- When a managed cluster is detached from the primary hub cluster and reattached to the backup hub cluster, metrics are not collected. To minimize gaps, consider scripting the cluster migration for large fleets.
- For product backup and restore, the Observability service automatically labels its resources with the `cluster.open-cluster-management.io/backup` label.
.Resources that are automatically backuped up and restored for Observability
|====
| Resource type | Resource name

| ConfigMaps
| `observability-metrics-custom-allowlist`, `thanos-ruler-custom-rules`, `alertmanager-config`, `policy-acs-central-status`, Any ConfigMap labeled with `grafana-custom-dashboard`

| Secrets
| `thanos-object-storage`, `observability-server-ca-certs`, `observability-client-ca-certs`, `observability-server-certs`, `observability-grafana-certs`, `alertmanager-byo-ca`, `alertmanager-byo-cert`, `proxy-byo-ca`, `proxy-byo-cert`
|====

== Additional resources

- For the steps to complete the backup and restore for Observability, see xref:../backup_restore/backup_restore_obs.adoc#backup-restore-obs[Backing up and restoring Observability service].

77 changes: 77 additions & 0 deletions business_continuity/backup_restore/backup_restore_obs.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
[#backup-restore-obs]
= Backing up and restoring Observability service

Backup and restore the Observability service to maintain data continuity and integrity during hub cluster migration or backup. It is assumed that the same S3-compatible object store is used across both the primary and backup hub clusters to minimize disruption in metric data collection.

.Prerequsites

- are there prereqs?
[#backup-restore-obs-procedure]
== Backing up and restoring Observability procedure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this header. You can just use the procedure tag here because it is the same thing as the title.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks. I thought we were holding off on using tags, as in we would create a separate issue to handle the changes for the tool conversion. I am fine with using it now


. To ensure the Observability service recognizes the hub cluster as the `local-cluster`, change the `spec.disableHubSelfManagement` parameter in the `MultiClusterHub` custom resource to `false`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know Brandi wants us to use "managed hub cluster" instead of "local hub cluster" However, "local-cluster" might be different, and there are times when I've had "local-cluster" be a value in a yaml file, so I couldn't change the name.

Nonetheless, might be worthwhile to confirm that it is best to use "local-cluster" here instead of "a managed cluster"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subbarao-meduri can you help confirm if my rewrite here makes sense and is accurate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hub cluster metrics are always collected and pushed into Thanos regardless of whether hub self management is enabled or not. See: https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.14/html-single/observability/index#obs-config

These metrics show up under the name local-cluster in Grafana by default. However, if the local-cluster is renamed to something else, metrics would show up under that name.

@dockerymick I don't believe there is an explicit requirement that hub self-management must be enabled for backup/restore to work. @birsanv can confirm this. My view is that line 13 should be removed.

cc: @coleenquadros to recheck my assertion here.

Copy link
Contributor Author

@dockerymick dockerymick Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@subbarao-meduri Thanks Subbarao for your feedback. This line was an attempt to rewrite what you mentioned in the issue description:

"For ACM 2.14 and later: Use the local-cluster renaming feature to assign unique names to each hub. This helps disambiguate metrics for each hub cluster in Grafana."

This info is not listed in the Prerequisistes portion. It is a part of the "Backing up and restoring Observability procedure " section

Copy link
Contributor

@swopebe swopebe Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some confusion (maybe), I never said not to use local cluster; it is better to use that if you are referring to a managed hub. I don't know what we mean by local hub cluster. I may be missing something--I am just not aware of that term being part of the product lexicon, such as hub cluster, managed cluster, local cluster.

This discussion originally stemmed from topics incorrectly or vaguely referring to the local cluster because I think the team didn't have the same understanding of what local cluster means or how to disable and enable.

When we speak of local cluster, or enabling and disabling it, we just need to make sure we understand ourselves how to do it, then present it the same as in the adv. config, which is what this is line is doing from what I can tell.

If you talk about a managed hub cluster, you should ask if they mean local-cluster and present it as such. That was my request for the team. I think this is fine, I would just have said:

To ensure the Observability service recognizes the hub cluster as a local-cluster, a managed hub cluster, set the spec.disableHubSelfManagement parameter in the MultiClusterHub custom resource to false.

@jc-berger @dockerymick @subbarao-meduri FYI. That is if we don't remove this line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do also now talk about the ability to rename the local cluster, so if you keep that in there, you may want to say:

If you change the default name of local-cluster to another value, the results will appear within the changed local cluster name.

Copy link
Contributor

@coleenquadros coleenquadros Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dockerymick Subbarao's assertion on the metrics being collected under the new name for the managed hub or local cluster is correct. Also yes, the hub cluster will continue to send metrics regardless of HubSelfManagement being enabled or not

. Manually back up and restore the `observatium` resource and `observability` deployment to ensure continuity across hub clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a step and below is sub steps for that step? Or can we merge this? Technically the first step here doesn't have an action. Can the step under this stand alone and this just be a paragraph?

Copy link
Contributor Author

@dockerymick dockerymick Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I went with the substep because the step was becoming lengthy. However, I did make an update to remove the substep (i am just going through the comments/suggestions and updating the branch locally). I changed it to read as the following statement:

"To preserve the tenant ID of the observatorium resource as you manually back up and restore the observatorium resource, run the following command:"

.. To preserve the tenant ID of the `observatorium` resource during the restore, run the following command:

+
[source,bash]
----
oc get observatorium -n open-cluster-management-observability -o yaml > observatorium-backup.yaml
----

. To backup the `observability` deployment, run the following command:

+
[source,bash]
----
oc get mco observability -o yaml > mco-cr-backup.yaml
----

. Shut down the Thanos compactor on your primary hub cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not a step bc there is not action. Can this be just a line to introduce the process?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add ...complete the following steps and you can keep it as a step because there is an action.

.. To prevent write conflicts and deduplication issues while working on the same object storage, stop the Thanos compactor before starting the restore operation on the backup hub cluster. Run the following command:

+
[source,bash]
----
oc scale statefulset observability-thanos-compact -n open-cluster-management-observability --replicas=0
----

.. Verify the compactor is stopped by running the following command:

+
[source,bash]
----
oc get pods observability-thanos-compact-0 -n open-cluster-management-observability
----

. To restore the `backup` resources, see xref:../business_continuity/backup_restore/backup_restore.adoc#restoring-backup-restore-operation[Using the restore operation for backup types]. You can restore the automatically backed-up ConfigMaps and Secrets listed in the backup and restore configuration for Observability.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not used to seeing links to other procedures/docs in the middle of a procedure/task. Could this link be a prereq? I know we might only need to restore the backup resoruces here in step 5, but maybe we can have a Prereq like this:

Prerequisites

  • Ensure that you can run a restore operation for backup types by completing xref:../business_continuity/backup_restore/backup_restore.adoc#restoring-backup-restore-operation[Using the restore operation for backup types].

Then the step here can be simple like,

  1. Restore the backup resources, including: the automatically backed-up ConfigMaps and Secrets listed in the backup and restore configuration for Observability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point and callout for sure


. To preserve the tenant ID for maintaing continuity in the metrics ingestion and querying, restore the `observatorium` resource to the backup hub cluster. Run the following command:

+
[source,bash]
----
oc apply -f observatorium-backup.yaml
----

. Apply the backed up `MultiClusterObservability` custom resource to start the Observability service on the new restored hub cluster. Run the following command:

+
[source,bash]
----
oc apply -f mco-cr-backup.yaml
----
+
The operator starts the Observability service and detects the existing `observatorium` resource, reusing the preserved tenant ID instead of creating a new one.

. Migrate managed clusters to the new hub cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Migrate managed clusters to the new hub cluster.
. Migrate managed clusters to the new hub cluster. Complete the following steps:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or let it stand alone without a step:

Migrate managed clusters to the new hub cluster. Complete the following steps:

.. Detach managed clusters from the primary hub cluster and reattach the managed clusters to the new restored hub cluster. After attaching the managed clusters, the clusters resume sending metrics to the Observability service.

. Shut down Observability on your primary hub cluster and clear in-memory metrics to S3 object after migrating all managed clusters. Run the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
. Shut down Observability on your primary hub cluster and clear in-memory metrics to S3 object after migrating all managed clusters. Run the following command:
. Shut down the Observability service on your primary hub cluster and clear in-memory metrics to S3 object after migrating all managed clusters. Run the following command:


+
[source,bash]
----
oc delete mco observability
----
2 changes: 2 additions & 0 deletions business_continuity/main.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ include::backup_restore/use_existing_hub_cluster.adoc[leveloffset=+3]
include::backup_restore/tag_resources.adoc[leveloffset=+3]
include::backup_restore/backup_return_hub.adoc[leveloffset=+3]
include::backup_restore/backup_hcp.adoc[leveloffset=+3]
include::backup_restore/backup_restore_obs.adoc[leveloffset=+3]
include::backup_restore/backup_restore_config_obs.adoc[leveloffset=+3]
include::volsync/volsync.adoc[leveloffset=+2]
include::volsync/volsync_replicate.adoc[leveloffset=+3]
include::volsync/volsync_convert_backup.adoc[leveloffset=+3]
Expand Down