How to Add and Remove Nodes
OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided. You can add CPU workers, GPU workers, and other computing devices (e.g. TPU, NPU) into the cluster.
Preparation
Pre-checks on Nodes to Add
Note: If you are going to remove nodes, you can skip this section.
-
To add worker nodes, please check if the nodes meet The Worker Requirements.
-
If you have configured any PV/PVC storage, please confirm the nodes meet PV's requirements. See Confirm Worker Nodes Environment for details.
-
If you are going to add nodes that have been deleted before, you may need to reload the systemd manager configuration on those nodes:
bash
ssh <node> "sudo systemctl daemon-reload"
Pull & Modify Cluster Settings
- Log in to your dev box machine and go into your dev box docker container, change directory to
/pai. If you don't have a dev box docker container, launch one.
bash
sudo docker exec -it <your-dev-box> bash
cd /pai
-
Use
paictl.pyto pull config files to a certain folder.Note: Check if the files you pulled contain
config.yaml. Before v1.7.0,config.yamlis stored in~/pai-deploy/cluster-cfg/config.yamlon the dev box machine. If you have upgraded to v1.7.0, please copyconfig.yamlto the<config-folder>and push it to the cluster. If yourconfig.yamlis lost, you need to create a new one. Refer to config.yaml example.
bash
./paictl.py config pull -o <config-folder>
-
Modify
<config-folder>/layout.yaml. Add new nodes intomachine-list, create a newmachine-skuif necessary. Refer to layout.yaml format for schema requirements.Note: If you are going to remove nodes, you can skip this step.
```yaml machine-list: - hostname: new-worker-node--0 hostip: x.x.x.x machine-type: xxx-sku pai-worker: "true"
- hostname: new-worker-node-1
hostip: x.x.x.x
machine-type: xxx-sku
pai-worker: "true"
```
-
Make sure that you can access all nodes in the cluster using the settings in
<config-folder>/config.yaml. If you use SSH key pairs to log in to nodes, please mount the folder~/.sshon the dev box machine to/root/.sshon the dev box docker container。 -
Modify HiveD scheduler settings in
<config-folder>/services-configuration.yamlproperly. Please refer to How to Set up Virtual Clusters and the Hived Scheduler Doc for details.Note: If you are using Kubernetes default scheduler, you can skip this step.
Use Paictl to Add / Remove Nodes
Note: All the following operations should be performed in the dev box docker container on the dev box machine.
Note:When removing nodes, the layout.yaml saved in Kubernetes will be automatically modified after the deletion is successful. We recommend backing up the <config-folder> in the file system of your dev box machine in case your dev box docker container stops.
- Stop related services.
bash
./paictl.py service stop -n cluster-configuration hivedscheduler rest-server job-exporter
- Push the latest configuration.
bash
./paictl.py config push -p <config-folder> -m service
-
Add nodes to and/or remove nodes from kubernetes.
-
To add nodes:
bash ./paictl.py node add -n <node1> <node2> ... -
To remove nodes:
bash ./paictl.py node remove -n <node1> <node2> ... -
Start related services.
bash
./paictl.py service start -n cluster-configuration hivedscheduler rest-server job-exporter