Using the GitOps way to deal with the upgrade challenges of multi-cluster tool chains
Upgrading challenges of tool chains in multi-cluster environments
Open Cluster Management (OCM) is a community-driven project focused on multicluster and multicloud scenarios for Kubernetes applications. It provides functions such as cluster registration, application and workload distribution, and scheduling. Add-on is an extension mechanism based on the foundation components provided by OCM, which allows applications in the Kubernetes ecosystem to be easily migrated to the OCM platform and has the ability to orchestrate and schedule across multiple clusters and multiple clouds. For example, Istio, Prometheus, and Submarine can be expanded to multiple clusters through Add-on. In a multi-cluster environment, how to upgrade the entire tool chain (such as Istio, Prometheus and other tools) gracefully and smoothly is a challenge we encounter in multi-cluster management. A failed upgrade of the tool chain can potentially render thousands of user workloads inaccessible. Therefore, finding an easy and safe upgrade solution across clusters becomes important.
In this article, we will introduce how Open Cluster Management (OCM) treats tool chain upgrades as configuration file changes, allowing users to leverage Kustomize or GitOps to achieve seamless rolling/canary upgrades across clusters.
Before we begin, let us first introduce several concepts in OCM.
Add-on
On the OCM platform, add-on can apply different configurations on different managed clusters, and can also implement functions such as obtaining data from the control plane (Hub) to the managed cluster. For example, you can use managed-serviceaccount, this add-on returns the specified ServiceAccount
information on the managed cluster to the hub cluster. You can use the cluster-proxy add-on to establish a reverse proxy channel from spoke to hub.
At this stage, there are some add-ons in the OCM community:
- Multicluster Mesh Addon can be used to manage (discovery, deploy and federate) service meshes across multiple clusters in OCM.
- Submarine Addon deploys the Submariner Broker on the Hub cluster and the required Submariner components on the managed clusters.
- Open-telemetry add-on automates the installation of otelCollector on both hub cluster and managed clusters and jaeget-all-in-one on hub cluster for processing and storing the traces.
- Application lifecycle management enables application lifecycle management in multi-cluster or multi-cloud environments.
- Policy framework and Policy controllers allows Hub cluster administrators to easily deploy security-related policies for managed clusters.
- Managed service account enables a hub cluster admin to manage service account across multiple clusters on ease.
- Cluster proxy provides L4 network connectivity from hub cluster to the managed clusters.
For more information about add-on, please refer to Add-on concept and Add-on Developer Guide.
OCM provides two ways to help developers develop their own add-ons:
- Hard mode: Using the built-in mechanism of addon-framework, you can follow the Add-on Development Guide to develop the addon manager and addon agent.
- Easy mode: OCM provides a new development model, which can use AddOnTemplate to build add-on. In this model, developers do not need to develop the addon manager, but only need to prepare the addon agent’s image and
AddOnTemplate
.AddOnTemplate
describes how to deploy the addon agent and how to register the add-on.
Below is the ClusterManagementAddOn
and AddOnTemplate
of a sample add-on. AddOnTemplate
is treated as an add-on configuration file, defined in supportedConfigs
. The AddOnTemplate
resource contains the manifest required to deploy the add-on and the add-on registration method.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: hello-template
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
addOnMeta:
description: hello-template is an addon built with addon template
displayName: hello-template
supportedConfigs: # declare it is a template type addon
- group: addon.open-cluster-management.io
resource: addontemplates
defaultConfig:
name: hello-template
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: AddOnTemplate
metadata:
name: hello-template
spec:
addonName: hello-template
agentSpec: #required
workload:
manifests:
- kind: Deployment
metadata:
name: hello-template-agent
namespace: open-cluster-management-agent-addon
...
- kind: ServiceAccount
metadata:
name: hello-template-agent-sa
namespace: open-cluster-management-agent-addon
- kind: ClusterRoleBinding
metadata:
name: hello-template-agent
...
registration: #optional
...
Placement Decision Strategy
The Placement API is used to select a set of ManagedClusters
in one or more ManagedClusterSets
to deploy workloads to these clusters.
For more introduction to the Placement API, please refer to Placement concept.
The “input” and “output” of the Placement scheduling process are decoupled into two independent Kubernetes APIs: Placement
and PlacementDecision
.
- Placement provides filtering of clusters through the
labelSelector
or theclaimSelector
, and also provides some built-inprioritizers
, which can score, sort and prioritize the filtered clusters. - The scheduling results of
Placement
will be placed inPlacementDecision
,status.decisions
lists the top N clusters with the highest scores and sorts them by name, and the scheduling results will dynamically change as the cluster changes. ThedecisionStrategy
section in Placement can be used to divide the createdPlacementDecision
into multiple groups and define the number of clusters in each decision group.PlacementDecision
supports paging display, and each resource supports containing 100 cluster names.
Below is an example of Placement
and decisionStrategy
. Assume that there are 300 ManagedClusters
in the global ManagedClusterSets
, and 10 of them have the label canary. The following example describes grouping the canary-labeled clusters into a group and grouping the remaining clusters into groups of up to 150 clusters each.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
name: aws-placement
namespace: default
spec:
clusterSets:
- global
decisionStrategy:
groupStrategy:
clustersPerDecisionGroup: 150
decisionGroups:
- groupName: canary
groupClusterSelector:
labelSelector:
matchExpressions:
- key: canary
operator: Exists
The grouped results will be displayed in the status
of Placement
. The canary group has 10 clusters, and the results are placed in aws-placement-decision-1
. The other default groupings are only group index, each group has 150 and 140 clusters respectively. Since a PlacementDecsion
only supports 100 clusters, the results for each group are put into two PlacementDecisions
.
status:
...
decisionGroups:
- clusterCount: 10
decisionGroupIndex: 0
decisionGroupName: canary
decisions:
- aws-placement-decision-1
- clusterCount: 150
decisionGroupIndex: 1
decisionGroupName: ""
decisions:
- aws-placement-decision-2
- aws-placement-decision-3
- clusterCount: 140
decisionGroupIndex: 2
decisionGroupName: ""
decisions:
- placement1-decision-3
- placement1-decision-4
numberOfSelectedClusters: 300
Taking the canary group as an example, its PlacementDecision
is as follows, where the label cluster.open-cluster-management.io/decision-group-index
represents the index of the group to which it belongs, cluster.open-cluster-management.io/decision-group-name
represents the name of the group it belongs to, and cluster.open-cluster-management.io/placement
represents the Placement
it belongs to. Users can flexibly obtain scheduling results through tag selectors.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
labels:
cluster.open-cluster-management.io/decision-group-index: "0"
cluster.open-cluster-management.io/decision-group-name: canary
cluster.open-cluster-management.io/placement: aws-placement
name: aws-placement-decision-1
namespace: default
status:
decisions:
- clusterName: cluster1
reason: ""
...
- clusterName: cluster10
reason: ""
Simplify upgrades the GitOps way
The above briefly introduces the concepts of add-on template and placement decision strategy.
In OCM, we regard the upgrade of add-on as the upgrade of its configuration file. The configuration here can be AddOnTemplate
or other customized configuration file such as AddOnDeploymentConfig
. An add-on upgrade is treated as a configuration file update, which enables users to leverage Kustomize or GitOps for seamless cross-cluster rolling/canary upgrades. RolloutStrategy
defines the upgrade strategy, supports upgrade all, progressive upgrades by cluster and progressive upgrades by cluster group, and can define a set of MandatoryDecisionGroups
to try new configurations first.
According to the four principles of GitOps, let’s take a look at how OCM supports the GitOps approach to address upgrade challenges in multi-cluster environments.
- Declarative
The configuration file used by add-on can be declared in ClusterManagementAddOn
. The configuration file can be declared in the global supportedConfigs
, and the configuration file will be applied to all ManagedClusterAddOn
instances. It can also be declared in different placements under installStrategy
. The ManagedClusterAddOn
of the cluster selected by each Placement will have the same configuration file. The configuration declared in placements will override the global configuration.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
spec:
supportedConfigs:
- defaultConfig:
name: managed-serviceaccount-0.4.0
group: addon.open-cluster-management.io
resource: addontemplates
installStrategy:
placements:
- name: aws-placement
namespace: default
configs:
- group: addon.open-cluster-management.io
resource: addondeploymentconfigs
name: managed-serviceaccount-addon-deploy-config
rolloutStrategy:
type: Progressive
progressive:
mandatoryDecisionGroups:
- groupName: "canary"
maxConcurrency: 1
type: Placements
- Version control
Changes in the add-on configuration file name or spec content will be considered a configuration change and will trigger an upgrade of the add-on. Users can leverage Kustomize or GitOps to control configuration file upgrades.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: AddOnTemplate
metadata:
name: managed-serviceaccount-0.4.0
spec:
agentSpec: # required
workload:
manifests:
- kind: Deployment
metadata:
name: managed-serviceaccount-addon-agent
namespace: open-cluster-management-agent-addon
...
- kind: ServiceAccount
metadata:
name: managed-serviceaccount
namespace: open-cluster-management-agent-addon
…
registration: # optional
- Automation
The OCM component addon-manager-controller under the open-cluster-management-hub namespace is a more general addon manager. It will watch the following two types of add-on and be responsible for maintaining the lifecycle of such add-on. Includes installation and upgrades. When the name or spec content of the configuration file changes, this component will upgrade the add-on according to the upgrade strategy defined by rolloutStrategy.
- Hard mode: Using the add-on developed by the latest addon-framework, you need to delete the
WithInstallStrategy()
method in the code and add annotationaddon.open-cluster-management.io/lifecycle: "addon-manager"
inClusterManagementAddOn
. For details, refer to Add-on Development Guide. - Easy mode: add-on developed using
AddOnTemplate
mode.
✗ kubectl get deploy -n open-cluster-management-hub
NAME READY UP-TO-DATE AVAILABLE AGE
cluster-manager-addon-manager-controller 1/1 1 1 10h
cluster-manager-placement-controller 1/1 1 1 10h
cluster-manager-registration-controller 1/1 1 1 10h
cluster-manager-registration-webhook 1/1 1 1 10h
cluster-manager-work-webhook 1/1 1 1 10h
- Coordination
The spec hash of the add-on configuration file will be recorded in the status of ClusterManagementAddOn
and ManagedClusterAddOn
. When the spec hash changes, add-on-manager-controller will continue to update the add-on according to the upgrade strategy defined by rolloutStrategy until lastAppliedConfig
, lastKnownGoodConfig
is consistent with desiredConfig
. In the following example, because lastAppliedConfig
does not match desiredConfig
, the add-on status is displayed as “Upgrading”.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
…
status:
installProgressions:
- conditions:
- lastTransitionTime: "2023-09-21T06:53:59Z"
message: 1/3 upgrading, 0 timeout.
reason: Upgrading
status: "False"
type: Progressing
configReferences:
- desiredConfig:
name: managed-serviceaccount-0.4.1
specHash: dcf88f5b11bd191ed2f886675f967684da8b5bcbe6902458f672277d469e2044
group: addon.open-cluster-management.io
lastAppliedConfig:
name: managed-serviceaccount-0.4.0
specHash: 1f7874ac272f3e4266f89a250d8a76f0ac1c6a4d63d18e7dcbad9068523cf187
lastKnownGoodConfig:
name: managed-serviceaccount-0.4.0
specHash: 1f7874ac272f3e4266f89a250d8a76f0ac1c6a4d63d18e7dcbad9068523cf187
resource: addontemplates
name: aws-placementl
namespace: default
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ManagedClusterAddOn
metadata:
name: managed-serviceaccount
namespace: cluster1
…
status:
conditions:
- lastTransitionTime: "2023-09-21T06:53:42Z"
message: upgrading.
reason: Upgrading
status: "False"
type: Progressing
configReferences:
- desiredConfig:
name: managed-serviceaccount-0.4.1
specHash: dcf88f5b11bd191ed2f886675f967684da8b5bcbe6902458f672277d469e2044
group: addon.open-cluster-management.io
lastAppliedConfig:
name: managed-serviceaccount-0.4.0
specHash: dcf88f5b11bd191ed2f886675f967684da8b5bcbe6902458f672277d469e2044
lastObservedGeneration: 1
name: managed-serviceaccount-0.4.1
resource: addontemplates
Three upgrade strategies
The rolloutStrategy
field of ClusterManagementAddOn
defines the upgrade strategy. Currently, OCM supports three types of upgrade strategies.
- All
The default upgrade type is All, which means the new configuration file will be applied to all the clusters immediately.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace:default
rolloutStrategy:
type: All
type: Placement
- Progressive
Progressive means that the new configuration file will be deployed to the selected clusters progressively per cluster. The new configuration file will not be applied to the next cluster unless one of the current applied clusters reach the successful state and haven’t breached the MaxFailures
. We introduced the concept of “Placement Decision Group” earlier. One or more decision groups can be specified in MandatoryDecisionGroups
. If MandatoryDecisionGroups
are defined, new configuration files are deployed to these cluster groups first. MaxConcurrency
defines the maximum number of clusters deployed simultaneously.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace:default
rolloutStrategy:
type: Progressive
progressive:
mandatoryDecisionGroups:
- groupName: "canary"
maxConcurrency: 1
type: Placements
- ProgressivePerGroup
ProgressivePerGroup means that the new configuration file will be deployed to decisionGroup clusters progressively per group. The new configuration file will not be applied to the next cluster group unless all the clusters in the current group reach the successful state and haven’t breached the MaxFailures
. If MandatoryDecisionGroups
are defined, new configuration files are deployed to these cluster groups first. If there are no MandatoryDecisionGroups
, the cluster group will be upgraded in order of index.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace:default
rolloutStrategy:
type: ProgressivePerGroup
progressivePerGroup:
mandatoryDecisionGroups:
- groupName: "canary"
type: Placements
According to the four principles of GitOps and the three upgrade strategies of OCM, users can use Kustomize or GitOps to achieve seamless rolling/canary upgrades across clusters. It is worth noting that installStrategy supports multiple placement definitions, and users can implement more advanced upgrade strategies based on this.
As in the example below, you can define two placements at the same time to select clusters on aws and gcp respectively, so that the same add-on can use different configuration files and upgrade strategies in different clusters.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace: default
configs:
- group: addon.open-cluster-management.io
resource: addondeploymentconfigs
name: managed-serviceaccount-addon-deploy-config-aws
rolloutStrategy:
type: ProgressivePerGroup
progressivePerGroup:
mandatoryDecisionGroups:
- groupName: "canary"
type: Placements
- name: gcp-placement
namespace: default
configs:
- group: addon.open-cluster-management.io
resource: addondeploymentconfigs
name: managed-serviceaccount-addon-deploy-config-gcp
rolloutStrategy:
type: ProgressivePerGroup
progressivePerGroup:
mandatoryDecisionGroups:
- groupName: "canary"
type: Placements
Three upgrade configurations
The rolloutStrategy
upgrade strategy can also define MinSuccessTime
, ProgressDeadline
and MaxFailures
to achieve more fine-grained upgrade configuration.
- MinSuccessTime
MinSuccessTime
defines how long the controller needs to wait before continuing to upgrade the next cluster when the addon upgrade is successful and MaxFailures
is not reached. The default value is 0 meaning the controller proceeds immediately after a successful state is reached.
In the following example, add-on will be upgraded at a rate of one cluster every 5 minutes.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace: default
rolloutStrategy:
type: Progressive
progressive:
mandatoryDecisionGroups:
- groupName: "canary"
maxConcurrency: 1
minSuccessTime: "5m"
type: Placements
- ProgressDeadline
ProgressDeadline
defines the maximum time for the controller to wait for the add-on upgrade to be successful. If the add-on does not reach a successful state after ProgressDeadline
, controller will stop waiting and this cluster will be treated as “timeout” and be counted into MaxFailures
. Once the MaxFailures
is breached, the rollout will stop. The default value is “None”, which means the controller will wait for a successful state indefinitely.
In the following example, the controller will wait for 10 minutes on each cluster until the addon upgrade is successful. If it fails after 10 minutes, the upgrade status of the cluster will be marked as “timeout”.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace:default
rolloutStrategy:
type: Progressive
progressive:
mandatoryDecisionGroups:
- groupName: "canary"
maxConcurrency: 1
progressDeadline: "10m"
type: Placements
- MaxFailures
MaxFailures
defines the number of clusters that can tolerate upgrade failures, which can be a numerical value or a percentage. If the cluster status is failed or timeout, it will be regarded as an upgrade failure. If the failed cluster exceeds MaxFailures
, the upgrade will stop.
In the following example, when 3 addons fail to upgrade or does not reach successful status for more than 10 minutes, the upgrade will stop.
apiVersion: addon.open-cluster-management.io/v1alpha1
kind: ClusterManagementAddOn
metadata:
name: managed-serviceaccount
annotations:
addon.open-cluster-management.io/lifecycle: "addon-manager"
spec:
supportedConfigs:
...
installStrategy:
placements:
- name: aws-placement
namespace: default
rolloutStrategy:
type: Progressive
progressive:
mandatoryDecisionGroups:
- groupName: "canary"
maxConcurrency: 1
maxFailures: 2
progressDeadline: "10m"
type: Placements
Summary
This article details how to use Open Cluster Management to address tool chain upgrade challenges in a multi-cluster environment using the GitOps way. OCM provides a Kubernetes-based management platform across multiple clusters and multiple clouds. Through Add-on and Placement API, users can upgrade the entire tool chain gracefully and smoothly. At the same time, OCM treats add-on upgrades as configuration file changes, enabling users to leverage Kustomize or GitOps for seamless rolling/canary upgrades across clusters. In addition, OCM also provides a variety of upgrade strategies, including all upgrade (All), progressive upgrade by cluster (Progressive) and progressive upgrade by cluster group (ProgressivePerGroup) to meet different upgrade needs.