Placement
API-CHANGE NOTE:
Placement
and PlacementDecision
API is upgraded from v1alpha1 to v1beta1,
v1alpha1 will be deprecated in OCM v0.7.0 and planned to be removed in OCM
v0.8.0. The field spec.prioritizerPolicy.configurations.name
in Placement
API v1alpha1 is removed and replaced by
spec.prioritizerPolicy.configurations.scoreCoordinate.builtIn
in v1beta1.
Overall
Placement
concept is used to dynamically select a set of managed clusters in
one or multiple ManagedClusterSet so that higher level
users can either replicate Kubernetes resources to the member clusters or run
their advanced workload i.e. multi-cluster scheduling.
The “input” and “output” of the scheduling process are decoupled into two
separated Kubernetes API Placement
and PlacementDecision
. As is shown in
the following picture, we prescribe the scheduling policy in the spec of
Placement
API and the placement controller in the hub will help us to
dynamically select a slice of managed clusters from the given cluster sets.
Note that the scheduling result in the PlacementDecision
API is designed to
be paginated with its page index as the name’s suffix to avoid “too large
object” issue from the underlying Kubernetes API framework.

Following the architecture of Kubernetes’ original scheduling framework, the multi-cluster scheduling is logically divided into two phases internally:
- Predicate: Hard requirements for the selected clusters.
- Prioritize: Rank the clusters by the soft requirements and select a subset among them.
Select clusters in ManagedClusterSet
By following the previous section about
ManagedClusterSet
, now we’re supposed to have one or multiple valid cluster
sets in the hub clusters. Then we can move on and create a placement in the
“workspace namespace” by specifying predicates
and prioritizers
in the
Placement
API to define our own multi-cluster scheduling policy.
Predicates
Label/Claim selection
In the predicates
section, you can select clusters by labels or clusterClaims
.
For instance, you can select 3 clusters with labels purpose=test
and
clusterClaim platform.open-cluster-management.io=aws
as seen in the following
examples:
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
name: placement1
namespace: default
spec:
numberOfClusters: 3
clusterSets:
- prod
predicates:
- requiredClusterSelector:
labelSelector:
matchLabels:
purpose: test
claimSelector:
matchExpressions:
- key: platform.open-cluster-management.io
operator: In
values:
- aws
Note that the distinction between label-selecting and claim-selecting is elaborated in this page about how to extend attributes for the managed clusters.
Taints/Tolerations
To support filtering unhealthy/not-reporting clusters and keep workloads from being placed in unhealthy or unreachable clusters, we introduce the similar concept of taint/toleration in Kubernetes. It also allows user to add a customized taint to deselect a cluster from placement. This is useful when the user wants to set a cluster to maintenance mode and evict workload from this cluster.
In OCM, Taints and Tolerations work together to allow users to control the selection of managed clusters more flexibly.
Taints are properties of ManagedClusters, they allow a Placement to repel a set of ManagedClusters in predicates stage.
Tolerations are applied to Placements, and allow Placements to select
ManagedClusters with matching taints. In tolerations
section, it includes the
following fields:
- Key (optional). Key is the taint key that the toleration applies to.
- Value (optional). Value is the taint value the toleration matches to.
- Operator (optional). Operator represents a key’s relationship to the
value. Valid operators are
Exists
andEqual
. Defaults toEqual
. A toleration “matches” a taint if the keys are the same and the effects are the same, and the operator is:Equal
. The operator is Equal and the values are equal.Exists
. Exists is equivalent to wildcard for value, so that a placement can tolerate all taints of a particular category.
- Effect (optional). Effect indicates the taint effect to match. Empty means
match all taint effects. When specified, allowed values are
NoSelect
,PreferNoSelect
andNoSelectIfNew
. (PreferNoSelect
is not implemented yet, currently clusters with effectPreferNoSelect
will always be selected.) - TolerationSeconds (optional). TolerationSeconds represents the period of
time the toleration (which must be of effect
NoSelect
/PreferNoSelect
, otherwise this field is ignored) tolerates the taint. The default value is nil, which indicates it tolerates the taint forever. The start time of counting the TolerationSeconds should be theTimeAdded
in Taint, not the cluster scheduled time orTolerationSeconds
added time.
The following example shows how to tolerate clusters with taints.
- Tolerate clusters with taint
Suppose your managed cluster has taint added as below,
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
name: cluster1
spec:
hubAcceptsClient: true
taints:
- effect: NoSelect
key: gpu
value: "true"
timeAdded: '2022-02-21T08:11:06Z'
By default, the placement won’t select this cluster unless you define tolerations,
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
name: placement1
namespace: default
spec:
tolerations:
- key: gpu
value: "true"
operator: Equal
With the above tolerations defined, cluster1 could be selected by placement
because of the key: gpu
and value: "true"
match.
- Tolerate clusters with taint for a period of time
TolerationSeconds
represents the period of time the toleration tolerates the
taint. It could be used for the case like, when a managed cluster gets offline,
users can make applications deployed on this cluster to be transferred to
another available managed cluster after a tolerated time.
For example, suppose the managed cluster becomes unreachable,
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
name: cluster1
spec:
hubAcceptsClient: true
taints:
- effect: NoSelect
key: cluster.open-cluster-management.io/unreachable
timeAdded: '2022-02-21T08:11:06Z'
If define a placement with TolerationSeconds as below, then the workload will transferred to another available managed cluster after 5 minutes.
apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: Placement
metadata:
name: demo4
namespace: demo1
spec:
tolerations:
- key: cluster.open-cluster-management.io/unreachable
operator: Exists
tolerationSeconds: 300
Prioritizers
Score-based prioritizer
In prioritizerPolicy
section, you can define the policy of prioritizers.
For instance, you can select 2 clusters with the largest memory available and
the largest addon score cpuratio, and pin the placementdecisions as seen in the
following examples.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
name: placement1
namespace: default
spec:
numberOfClusters: 2
prioritizerPolicy:
mode: Exact
configurations:
- scoreCoordinate:
builtIn: ResourceAllocatableMemory
- scoreCoordinate:
builtIn: Steady
weight: 3
- scoreCoordinate:
type: AddOn
addOn:
resourceName: default
scoreName: cpuratio
mode
is eitherExact
,Additive
,""
where""
is Additive by default.- In
Additive
mode, any prioritizer not explicitly enumerated is enabled in its default Configurations, in which Steady and Balance prioritizers have the weight of 1 while other prioritizers have the weight of 0. Additive doesn’t require configuring all prioritizers. The default Configurations may change in the future, and additional prioritization will happen. - In
Exact
mode, any prioritizer not explicitly enumerated is weighted as zero. Exact requires knowing the full set of prioritizers you want, but avoids behavior changes between releases.
- In
configurations
represents the configuration of prioritizers.scoreCoordinate
represents the configuration of the prioritizer and score source.type
defines the type of the prioritizer score. Type is either “BuiltIn”, “AddOn” or “", where "” is “BuiltIn” by default. When the type is “BuiltIn”, a BuiltIn prioritizer name must be specified. When the type is “AddOn”, need to configure the score source in AddOn.builtIn
defines the name of a BuiltIn prioritizer. Below are the valid BuiltIn prioritizer names.- Balance: balance the decisions among the clusters.
- Steady: ensure the existing decision is stabilized.
- ResourceAllocatableCPU & ResourceAllocatableMemory: sort clusters based on the allocatable.
addOn
defines the resource name and score name.AddOnPlacementScore
is introduced to describe addon scores, go into the Extensible scheduling section to learn more about it.resourceName
defines the resource name of theAddOnPlacementScore
. The placement prioritizer selectsAddOnPlacementScore
CR by this name.scoreName
defines the score name insideAddOnPlacementScore
.AddOnPlacementScore
contains a list of score name and score value, ScoreName specifies the score to be used by the prioritizer.
weight
defines the weight of the prioritizer. The value must be ranged in [-10,10]. Each prioritizer will calculate an integer score of a cluster in the range of [-100, 100]. The final score of a cluster will be sum(weight * prioritizer_score). A higher weight indicates that the prioritizer weights more in the cluster selection, while 0 weight indicates that the prioritizer is disabled. A negative weight indicates wanting to select the last ones.
A slice of PlacementDecision
will be created by placement controller in the
same namespace, each with a label of
cluster.open-cluster-management.io/placement={placement name}
.
PlacementDecision
contains the results of the cluster selection as seen in the
following examples.
apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
labels:
cluster.open-cluster-management.io/placement: placement1
name: placement1-decision-1
namespace: default
spec:
decisions:
- clusterName: cluster1
- clusterName: cluster2
- clusterName: cluster3
PlacementDecision
can be consumed by another operand to decide how the
workload should be placed in multiple clusters.
Extensible scheduling
In placement resource based scheduling, in some cases the prioritizer needs extra data (more than the default value provided by ManagedCluster) to calculate the score of the managed cluster. For example, schedule the clusters based on cpu or memory usage data of the clusters fetched from a monitoring system.
So we provide a new API AddOnPlacementScore
to support a more extensible way
to schedule based on customized scores.
- As a user, as mentioned in the above section, can specify the score in placement yaml to select clusters.
- As a score provider, a 3rd party controller could run on either hub or managed
cluster, to maintain the lifecycle of
AddOnPlacementScore
and update score into it.
Refer to the enhancements to learn more.
Future work
In addition to selecting cluster by predicates, we are still working on other advanced features including