https://issues.redhat.com/browse/ACM-22475 Backup and restore Observability (#8222)

dockerymick · jc-berger · web-flow · commit ee8130813911 · 2025-11-10T12:52:02.000-05:00
* https://issues.redhat.com/browse/ACM-22475 * More changes * more updates, modular writing * Removed hidden comments * Apply suggestions from code review Moving a prereq to steps * Reduced table syntax * Update backup_restore_obs.adoc * Updates * Update backup_restore_obs.adoc * More updates after reviewing local-cluster details * Removing hidden comment Addressed by development * Update business_continuity/backup_restore/backup_restore_config_obs.adoc Co-authored-by: jc-berger <70717303+jc-berger@users.noreply.github.com> * Update business_continuity/backup_restore/backup_restore_config_obs.adoc Co-authored-by: jc-berger <70717303+jc-berger@users.noreply.github.com> * Update business_continuity/backup_restore/backup_restore_obs.adoc Co-authored-by: jc-berger <70717303+jc-berger@users.noreply.github.com> * Few more updates after initial peer review * Update backup_restore_config_obs.adoc * Updates after review from peer * Update business_continuity/backup_restore/backup_restore_obs.adoc * Updates after dev lead review * Update backup_restore_obs.adoc * Updates after second review from peer --------- Co-authored-by: jc-berger <70717303+jc-berger@users.noreply.github.com>
diff --git a/business_continuity/backup_restore/backup_arch.adoc b/business_continuity/backup_restore/backup_arch.adoc
@@ -153,4 +153,4 @@ You can backup third-party resources with cluster backup and restore by adding t
 == Additional resources 
 
 Learn more about the policies and capabilities of the backup and restore component by going to 
-xref:../backup_restore/backup_validate.adoc#backup-validation-using-a-policy[Validating your backup or restore configurations]. 
+xref:../backup_restore/backup_validate.adoc#backup-validation-using-a-policy[Validating your backup or restore configurations]. 
diff --git a/business_continuity/backup_restore/backup_intro.adoc b/business_continuity/backup_restore/backup_intro.adoc
@@ -32,3 +32,5 @@ Complete the following topics to learn more about the backup and restore operato
 * xref:../backup_restore/backup_restore_hub.adoc#restore-data-initial-hub[Restoring data to the initial hub cluster]
 
 * xref:../backup_restore/backup_hcp.adoc#config-hcp-backup[Backup and restore for hosted control planes and hosted clusters]
+
+* xref:../backup_restore/backup_restore_config_obs.adoc#backup-restore-obs-config[Backup and restore configuration for Observability]
diff --git a/business_continuity/backup_restore/backup_restore_config_obs.adoc b/business_continuity/backup_restore/backup_restore_config_obs.adoc
@@ -0,0 +1,26 @@
+[#backup-restore-obs-config]
+= Backup and restore configuration for Observability
+
+The Observability service uses an S3-compatible object store to keep all time-series data collected from managed clusters. Because Observability is a stateful service, it is sensitive to active and passive backup patterns. You must configure Oservability to ensure that your data stays safe and keeps its continuity during the hub cluster migration or backup.
+
+*Notes:* 
+
+- When a managed cluster is detached from the primary hub cluster and reattached to the backup hub cluster, metrics are not collected. To help connect the metrics, you can script the cluster migration for large fleets.
+
+- For product backup and restore, the Observability service automatically labels its resources with the `cluster.open-cluster-management.io/backup` label.
+
+.Resources that are automatically backuped up and restored for Observability
+|====
+| Resource type | Resource name  
+
+| ConfigMaps 
+| `observability-metrics-custom-allowlist`, `thanos-ruler-custom-rules`, `alertmanager-config`, `policy-acs-central-status`, Any ConfigMap labeled with `grafana-custom-dashboard`
+
+| Secrets
+| `thanos-object-storage`, `observability-server-ca-certs`, `observability-client-ca-certs`, `observability-server-certs`, `observability-grafana-certs`, `alertmanager-byo-ca`, `alertmanager-byo-cert`, `proxy-byo-ca`, `proxy-byo-cert`
+|====
+
+== Additional resources
+
+- For the steps to complete the backup and restore for Observability, see xref:../backup_restore/backup_restore_obs.adoc#backup-restore-obs[Backing up and restoring Observability service].
+
diff --git a/business_continuity/backup_restore/backup_restore_obs.adoc b/business_continuity/backup_restore/backup_restore_obs.adoc
@@ -0,0 +1,98 @@
+[#backup-restore-obs]
+= Backing up and restoring Observability service
+
+Backup and restore the Observability service to keep data safe and to support continuity during the hub cluster migration or backup. To help with disruption in metric data collection, use the same S3-compatible object store for both the primary and backup hub clusters.
+
+.Prerequsites
+
+- Ensure that you can run a restore operation for backup types by completing the xref:../business_continuity/backup_restore/backup_restore.adoc#restoring-backup-restore-operation[Using the restore operation for backup types] process.
+
+.Procedure
+
+Complete the following steps to backup and restore the Observability service:
+
+. To ensure the Observability service recognizes the hub cluster as the `local-cluster`, a managed hub cluster, change the `spec.disableHubSelfManagement` parameter in the `MultiClusterHub` custom resource to `false`.
+
++
+*Note:* If you change the default name of your `local-cluster` to another value, the results appear within the changed local cluster name.
+
+. To preserve the tenant ID of the `observatorium` resource as you manually back up and restore the `observatorium` resource, run the following command:
+
++
+[source,bash]
+----
+oc get observatorium -n open-cluster-management-observability -o yaml > observatorium-backup.yaml
+----
+
+. To backup the `observability` deployment, run the following command:
+
++
+[source,bash]
+----
+oc get mco observability -o yaml > mco-cr-backup.yaml
+----
+
+. Shut down the Thanos compactor on your primary hub cluster by running the following command:
+
++
+[source,bash]
+----
+oc scale statefulset observability-thanos-compact -n open-cluster-management-observability --replicas=0
+----
+
+.. Verify the compactor is not active by running the following command:
+
++
+[source,bash]
+----
+oc get pods observability-thanos-compact-0 -n open-cluster-management-observability
+----
+
+. Restore the `backup` resources such as the automatically backed-up ConfigMaps and Secrets listed in the backup and restore configuration for Observability.
+
+. To preserve the tenant ID for maintaing continuity in the metrics ingestion and querying, restore the `observatorium` resource to the backup hub cluster. Run the following command:
+
++
+[source,bash]
+----
+oc apply -f observatorium-backup.yaml
+----
+
+. Apply the backed up `MultiClusterObservability` custom resource to start the Observability service on the new restored hub cluster. Run the following command:
+
++
+[source,bash]
+----
+oc apply -f mco-cr-backup.yaml
+----
++
+The operator starts the Observability service and detects the existing `observatorium` resource, reusing the preserved tenant ID instead of creating a new one.
+
+. Verify that the Observability service runs on your new hub cluster. Run the following command:
+
++
+[source,bash]
+----
+oc get pods -n open-cluster-management-observability
+----
+
+. Verify that the `observability-controller` `managedclusteraddon` does not have a status in the `DEGRADED` column, and that the `PROGRESSING` status is not set to `False`. Run the following command:
+
++
+[source,bash]
+----
+oc get managedclusteraddons -A | awk 'NR==1 || /observability-controller/
+----
+
+. Verify metrics collection from your managed clusters by accesing Grafana.
+
+. Verify that your managed clusters are connected to your new hub cluster by checking for the `Available` status for each managed cluster.
+
+. Shut down the Observability service on your previous hub cluster by removing the resources. Run the following command:
+
++
+[source,bash]
+----
+oc delete mco observability
+----
+
diff --git a/business_continuity/main.adoc b/business_continuity/main.adoc
@@ -17,6 +17,8 @@ include::backup_restore/use_existing_hub_cluster.adoc[leveloffset=+3]
 include::backup_restore/tag_resources.adoc[leveloffset=+3]
 include::backup_restore/backup_restore_hub.adoc[leveloffset=+3]
 include::backup_restore/backup_hcp.adoc[leveloffset=+3]
+include::backup_restore/backup_restore_obs.adoc[leveloffset=+3]
+include::backup_restore/backup_restore_config_obs.adoc[leveloffset=+3]
 include::volsync/volsync.adoc[leveloffset=+2]
 include::volsync/volsync_replicate.adoc[leveloffset=+3]
 include::volsync/volsync_convert_backup.adoc[leveloffset=+3]
diff --git a/observability/observability_arch.adoc b/observability/observability_arch.adoc
@@ -128,10 +128,11 @@ When you install {acm-short} the following persistent volumes (PV) must be creat
 
 To learn more about observability and the integrated components, see the following topics:
 
-- See xref:../observability/observe_environments_intro.adoc#observing-environments-intro[Observability service]
-- See xref:../observability/obs_config.adoc#obs-config[Observability configuration]
-- See xref:../observability/observability_enable.adoc#enabling-observability-service[Enabling the observability service]
-- See xref:../observability/design_grafan.adoc#using-grafana-dashboards[Using Grafana dashboards]
-- See the link:https://thanos.io/v0.36/thanos/getting-started.md/[Thanos documentation]
-- See the link:https://prometheus.io/docs/introduction/overview/[Prometheus Overview]
-- See the link:https://prometheus.io/docs/alerting/latest/alertmanager/[Alertmanager documentation]
+- For an introduction of the service, see xref:../observability/observe_environments_intro.adoc#observing-environments-intro[Observability service].
+- To learn about configuring the service, metric types labeling, and pod capacity, see xref:../observability/obs_config.adoc#obs-config[Observability configuration].
+- To enable the Observability service, see xref:../observability/observability_enable.adoc#enabling-observability-service[Enabling the Observability service].
+- For more information about viewing hub cluster and managed cluster metrics from Grafana, see xref:../observability/design_grafan.adoc#using-grafana-dashboards[Using Grafana dashboards].
+- Learn how you can backup and restore the Observability service. See link:../business_continuity/backup_restore/backup_restore_obs.adoc#backup-restore-obs[Backing up and restoring Observability service].
+- For more details about THanos, see the link:https://thanos.io/v0.36/thanos/getting-started.md/[Thanos documentation].
+- For a brief overview of Prometheus, see the link:https://prometheus.io/docs/introduction/overview/[Prometheus Overview].
+- See the link:https://prometheus.io/docs/alerting/latest/alertmanager/[Alertmanager documentation] to understand how you can send and receive alerts by using Alertmanager.