To run software such as MySQL or Elasticsearch, it would be nice to use local fast storages and form a cluster to replicate data between servers.
TopoLVM provides a storage driver for such software running on Kubernetes.
- Use LVM for flexible volume capacity management.
- Enhance the scheduler to prefer nodes having a larger storage capacity.
- Support dynamic volume provisioning from PVC.
- Support volume resizing (resizing for CSI becomes beta in Kubernetes 1.16).
topolvm-controller: CSI controller service.topolvm-scheduler: A scheduler extender for TopoLVM.topolvm-node: CSI node service.LVMd: gRPC service to manage LVM volumes.
Blue arrows in the diagram indicate communications over unix domain sockets. Red arrows indicate communications over TCP sockets.
TopoLVM is a storage plugin based on CSI. Therefore, the architecture basically follows the one described in https://kubernetes-csi.github.io/docs/ .
LVMd responds to manage LVM.
It provides gRPC services via UNIX domain socket to create/update/delete
LVM logical volumes and watch a volume group status.
It runs as a dedicated process or a embed function in topolvm-node.
topolvm-node implements CSI node services as well as miscellaneous control
on each Node. It communicates with LVMd to watch changes in free space
of a volume group and exports the information by annotating Kubernetes
Node resource of the running node. In the meantime, it adds a finalizer
to the Node to clean up PersistentVolumeClaims (PVC) bound on the node. It also works as a custom Kubernetes controller to implement
dynamic volume provisioning. Details are described in the following sections.
topolvm-controller implements CSI controller services. It also works as
a custom Kubernetes controller to implement dynamic volume provisioning and
resource cleanups.
topolvm-scheduler is a scheduler extender to extend the
standard Kubernetes scheduler for TopoLVM.
To extend the standard scheduler, TopoLVM components work together as follows:
topolvm-nodeexposes free storage capacity ascapacity.topolvm.io/<device-class>annotation of each Node.topolvm-controllerworks as a mutating webhook for new Pods.- It adds
capacity.topolvm.io/<device-class>annotation to a pod andtopolvm.io/capacityresource to the first container of a pod. - The value of the annotation is the sum of the storage capacity requests of unbound TopoLVM PVCs for each volume group referenced by the pod.
- It adds
topolvm-schedulerfilters and scores Nodes for a new pod havingtopolvm.io/capacityresource request.- Nodes having less capacity in given volume group than requested are filtered.
- Nodes having larger capacity in given volume group are scored higher.
Quick answer: Using extended resources prevents PVC from being resized.
Extended resources are a Kubernetes feature to allow users to define arbitrary resources consumed by Pods.
What is good in extended resources is that kube-scheduler takes them into account for Pod scheduling.
However, using extended resources to schedule pods onto nodes with sufficient capacity has several issues.
One problem is that the resource requests need to be copied from PVC to Pods. For example, if a Pod has two PVC requesting 10 GiB and 20 GiB storage, the Pod should request 30 GiB storage capacity.
The biggest problem appears when PVC get resized. Suppose that a node has 100 GiB storage capacity as an extended resource, and a Pod with PVC requesting 50 GiB of storage is scheduled to the node. If PVC is resized to 80 GiB, the remaining storage becomes 20 GiB.
To keep track of the volume usage, the Pod should now request 80 GiB storage. But this is impossible because kube-apiserver does not allow editing Pod resource requests. As a consequence, kube-scheduler fails to notice the change in storage usage.
TopoLVM, on the other hand, keeps track of the volume free capacity through annotations of nodes.
TopoLVM's extended scheduler topolvm-scheduler ignores the current usage. It only cares if a node has sufficient free capacity for new Pods.
To support dynamic volume provisioning, CSI controller service need to create a
logical volume on remote target nodes. In general, CSI controller runs on a
different node from the target node of the volume. To allow communication
between CSI controller and the target node, TopoLVM uses a custom resource
called LogicalVolume.
Dynamic provisioning depends on CSI external-provisioner sidecar container.
external-provisionerfinds a new unbound PersistentVolumeClaim (PVC) for TopoLVM.external-provisionercalls CSI controller'sCreateVolumewith the topology key of the target node.topolvm-controllercreates aLogicalVolumewith the topology key and capacity of the volume.topolvm-nodeon the target node finds theLogicalVolume.topolvm-nodesends a volume create request toLVMd.LVMdcreates an LVM logical volume as requested.topolvm-nodeupdates the status ofLogicalVolume.topolvm-controllerfinds the updated status ofLogicalVolume.topolvm-controllersends the success (or failure) toexternal-provisioner.external-provisionercreates a PersistentVolume (PV) and binds it to the PVC.
When the requested size of PVC is expanded, ControllerExpandVolume of topolvm-controller is called to
change the .spec.size of the corresponding LogicalVolume resource.
If there is a difference between logicalvolume.spec.size and logicalvolume.status.currentSize,
it means that the logical volume corresponding to the LogicalVolume resource should be expanded.
So in that case, topolvm-node sends ResizeLV request to LVMd.
If it receives a successful response, topolvm-node updates logicalvolume.status.currentSize.
If it receives an erroneous response, it updates the .status.code and .status.message field with the error.
Then, if the logical volume is not a block device, topolvm-node resizes the filesystem of the logical volume
via NodeExpandVolume or NodePublishVolume.
If the filesystem requires offline resizing, the administrator should make LogicalVolume offline beforehand.
The resizing is performed in NodePublishVolume in this case.
If the filesystem is resized online, the resizing is performed in NodeExpandVolume.
Currently, all supported filesystems can be resized online, so NodePublishVolume is not involved with resizing.
TopoLVM depends on Kubernetes deeply. Portability to other container orchestrators (CO) is not considered.