Proxmox VE 5.4 to 6.2 ライブアップグレード話 前半戦

Posted on 2020/08/12(Wed) 12:05 in technical

前書き

自宅サーバーのProxmox VE 5.1 - 3 nodes cluster環境の構築 で作ったクラスタをアップデートしつつ運用中です。(今は5.4)
以前は無理だと思っていたのですが、どうやらCorosyncから上手く移行していくとVMを動かしたままライブアップグレード出来そうなことが分かりました。
そこで、Proxmox VE 5.xのEOLが2020/7/31に到来してしまったので、遅ればせながら6.xへのアップグレードを実施しました。
前半戦はVMを落とすことなくProxmox VE 5.4を6.2にアップグレードするところまで、後半戦ではCephクラスタをLuminous(12.x)からNautilus(14.x)にアップグレードする話を書くつもりです。

構成

  • Proxmox VE 5.4 - 3 nodes cluster
  • Ceph cluster - 3 nodes cluster
  • Hardware: TX1320 M2
    • /dev/sda: system - 32GB SSD
    • /dev/sdb: ceph data (bluestore) - 1000GB SSD
    • eno1: bond0 - 1GbE
    • eno2: bond0 - 1GbE
    • bond0
      • Untagged: Admin and VM network
      • VLAN 200: WAN network (for BBR)
  • VMは大体15台くらい

実施手順

公式手順 https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0 に沿って作業します。

Corosync upgrade

アップグレード前のチェックツール pve5to6 で状態を確認します。
VMの設定回り等でエラーがある場合は事前に修正しますが、個々の設定によるので今回は割愛します。

Detect "FAIL: corosync 2.x installed, cluster-wide upgrade to 3.x needed!"

全ノードで実行します。

# pve5to6 
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =

Checking for package updates..
PASS: all packages uptodate

Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 5.4-2

Checking running kernel version..
PASS: expected running kernel '4.15.18-30-pve'.

Checking for installed stock Debian Kernel..
PASS: Stock Debian kernel package not installed.

= CHECKING CLUSTER HEALTH/SETTINGS =

PASS: systemd unit 'pve-cluster.service' is in state 'active'
PASS: systemd unit 'corosync.service' is in state 'active'
PASS: Cluster Filesystem is quorate.

Analzying quorum settings and state..
INFO: configured votes - nodes: 3
INFO: configured votes - qdevice: 0
INFO: current expected votes: 3
INFO: current total votes: 3

Checking nodelist entries..
PASS: pve03: ring0_addr is configured to use IP address '192.168.122.28'
WARN: pve01: ring0_addr 'pve01' resolves to '192.168.122.26'.
 Consider replacing it with the currently resolved IP address.
WARN: pve02: ring0_addr 'pve02' resolves to '192.168.122.27'.
 Consider replacing it with the currently resolved IP address.

Checking totem settings..
PASS: Corosync transport set to implicit default.
PASS: Corosync encryption and authentication enabled.

INFO: run 'pvecm status' to get detailed cluster status..

= CHECKING INSTALLED COROSYNC VERSION =

FAIL: corosync 2.x installed, cluster-wide upgrade to 3.x needed!

= CHECKING HYPER-CONVERGED CEPH STATUS =

INFO: hyper-converged ceph setup detected!
INFO: getting Ceph status/health information..
WARN: Ceph health reported as 'HEALTH_WARN'.
      Use the PVE dashboard or 'ceph -s' to determine the specific issues and try to resolve them.
INFO: getting Ceph OSD flags..
PASS: all PGs have been scrubbed at least once while running Ceph Luminous.
INFO: getting Ceph daemon versions..
PASS: single running version detected for daemon type monitor.
PASS: single running version detected for daemon type manager.
SKIP: no running instances detected for daemon type MDS.
PASS: single running version detected for daemon type OSD.
PASS: single running overall version detected for all Ceph daemon types.
WARN: 'noout' flag not set - recommended to prevent rebalancing during upgrades.
INFO: checking Ceph config..
WARN: No 'mon_host' entry found in ceph config.
  It's recommended to add mon_host with all monitor addresses (without ports) to the global section.
PASS: 'ms_bind_ipv6' not enabled
WARN: [global] config section contains 'keyring' option, which will prevent services from starting with Nautilus.
 Move 'keyring' option to [client] section instead.

= CHECKING CONFIGURED STORAGES =

storage 'www' is not online
PASS: storage 'rdb_vm' enabled and active.
PASS: storage 'local-lvm' enabled and active.
PASS: storage 'rdb_ct' enabled and active.
PASS: storage 'local' enabled and active.
WARN: storage 'www' enabled but not active!

= MISCELLANEOUS CHECKS =

INFO: Checking common daemon services..
PASS: systemd unit 'pveproxy.service' is in state 'active'
PASS: systemd unit 'pvedaemon.service' is in state 'active'
PASS: systemd unit 'pvestatd.service' is in state 'active'
INFO: Checking for running guests..
WARN: 7 running guest(s) detected - consider migrating or stopping them.
INFO: Checking if the local node's hostname 'pve01' is resolvable..
INFO: Checking if resolved IP is configured on local node..
PASS: Resolved node IP '192.168.122.26' configured and active on single interface.
INFO: Check node certificate's RSA key size
PASS: Certificate 'pve-root-ca.pem' passed Debian Busters security level for TLS connections (4096 >= 2048)
PASS: Certificate 'pve-ssl.pem' passed Debian Busters security level for TLS connections (2048 >= 2048)
PASS: Certificate 'pveproxy-ssl.pem' passed Debian Busters security level for TLS connections (2048 >= 2048)
INFO: Checking KVM nesting support, which breaks live migration for VMs using it..
PASS: KVM nested parameter set, but currently no VM with a 'vmx' or 'svm' flag is running.
INFO: Checking VMs with OVMF enabled and bad efidisk sizes...
PASS: No VMs with OVMF and problematic efidisk found.

= SUMMARY =

TOTAL:    39
PASSED:   29
SKIPPED:  1
WARNINGS: 8
FAILURES: 1

ATTENTION: Please check the output for detailed information!
Try to solve the problems one at a time and then run this checklist tool again.

Fix "FAIL: corosync 2.x installed, cluster-wide upgrade to 3.x needed!"

全てのノードで実行します。

# echo "deb http://download.proxmox.com/debian/corosync-3/ stretch main" > /etc/apt/sources.list.d/corosync3.list
# apt update
Hit:1 http://security.debian.org stretch/updates InRelease                     
Ign:2 http://ftp.jp.debian.org/debian stretch InRelease                        
Hit:3 http://ftp.jp.debian.org/debian stretch Release                        
Hit:5 http://download.proxmox.com/debian/ceph-luminous stretch InRelease
Get:6 http://download.proxmox.com/debian/corosync-3 stretch InRelease [1,977 B]
Hit:7 http://download.proxmox.com/debian stretch InRelease
Get:8 http://download.proxmox.com/debian/corosync-3 stretch/main amd64 Packages [38.0 kB]
Fetched 40.0 kB in 3s (10.8 kB/s)  
Reading package lists... Done
Building dependency tree       
Reading state information... Done
7 packages can be upgraded. Run 'apt list --upgradable' to see them.
# apt list --upgradeable
Listing... Done
corosync/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libcmap4/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libcorosync-common4/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libcpg4/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libqb0/stable 1.0.5-1~bpo9+2 amd64 [upgradable from: 1.0.3-1~bpo9]
libquorum5/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
libvotequorum8/stable 3.0.4-pve1~bpo9 amd64 [upgradable from: 2.4.4-pve1]
# apt dist-upgrade --download-only -y
# pvecm status
Quorum information
------------------
Date:             Mon Aug 10 13:24:45 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1/796
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.122.26 (local)
0x00000002          1 192.168.122.27
0x00000003          1 192.168.122.28

全ノードで準備が整ったら、サービスの停止、パッケージのアップデート、サービスの再起動を実施します。
今回は全ノードで同時にコマンドを実行しました。(RLoginで言うところの同時送信)

# systemctl stop pve-ha-lrm
# systemctl stop pve-ha-crm
# apt dist-upgrade -y
# pvecm status
Quorum information
------------------
Date:             Mon Aug 10 13:25:29 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.122.26 (local)
0x00000002          1 192.168.122.27
0x00000003          1 192.168.122.28
# systemctl start pve-ha-lrm
# systemctl start pve-ha-crm
# systemctl status pve-ha-lrm
● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-08-10 13:25:35 JST; 1min 3s ago
  Process: 277005 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 277019 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   Memory: 80.6M
      CPU: 458ms
   CGroup: /system.slice/pve-ha-lrm.service
           └─277019 pve-ha-lrm

Aug 10 13:25:35 pve01 systemd[1]: Starting PVE Local HA Ressource Manager Daemon...
Aug 10 13:25:35 pve01 pve-ha-lrm[277019]: starting server
Aug 10 13:25:35 pve01 pve-ha-lrm[277019]: status change startup => wait_for_agent_lock
Aug 10 13:25:35 pve01 systemd[1]: Started PVE Local HA Ressource Manager Daemon.
# systemctl status pve-ha-crm
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-08-10 13:25:35 JST; 1min 8s ago
  Process: 276958 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
 Main PID: 277004 (pve-ha-crm)
    Tasks: 1 (limit: 4915)
   Memory: 80.8M
      CPU: 454ms
   CGroup: /system.slice/pve-ha-crm.service
           └─277004 pve-ha-crm

Aug 10 13:25:34 pve01 systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Aug 10 13:25:35 pve01 systemd[1]: pve-ha-crm.service: PID file /var/run/pve-ha-crm.pid not readable (yet?) after start: No such file or directory
Aug 10 13:25:35 pve01 pve-ha-crm[277004]: starting server
Aug 10 13:25:35 pve01 pve-ha-crm[277004]: status change startup => wait_for_quorum
Aug 10 13:25:35 pve01 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
# pvecm status
Quorum information
------------------
Date:             Mon Aug 10 13:27:34 2020
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          0x00000001
Ring ID:          1.d
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.122.26 (local)
0x00000002          1 192.168.122.27
0x00000003          1 192.168.122.28

Corosync status

# systemctl status corosync
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-08-10 13:25:22 JST; 1h 0min ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 276851 (corosync)
    Tasks: 9 (limit: 4915)
   Memory: 149.1M
      CPU: 1min 629ms
   CGroup: /system.slice/corosync.service
           └─276851 /usr/sbin/corosync -f

Aug 10 13:25:24 pve01 corosync[276851]:   [TOTEM ] A new membership (1.9) was formed. Members joined: 2
Aug 10 13:25:24 pve01 corosync[276851]:   [QUORUM] This node is within the primary component and will provide service.
Aug 10 13:25:24 pve01 corosync[276851]:   [QUORUM] Members[2]: 1 2
Aug 10 13:25:24 pve01 corosync[276851]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 13:25:25 pve01 corosync[276851]:   [KNET  ] rx: host: 3 link: 0 is up
Aug 10 13:25:25 pve01 corosync[276851]:   [KNET  ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 10 13:25:25 pve01 corosync[276851]:   [TOTEM ] A new membership (1.d) was formed. Members joined: 3
Aug 10 13:25:25 pve01 corosync[276851]:   [QUORUM] Members[3]: 1 2 3
Aug 10 13:25:25 pve01 corosync[276851]:   [MAIN  ] Completed service synchronization, ready to provide service.
Aug 10 13:25:25 pve01 corosync[276851]:   [KNET  ] pmtud: PMTUD link change for host: 3 link: 0 from 469 to 8885
# corosync-cfgtool -s
Printing link status.
Local node ID 1
LINK ID 0
        addr    = 192.168.122.26
        status:
                nodeid  1:      localhost
                nodeid  2:      connected
                nodeid  3:      connected
# corosync-cpgtool
Group Name             PID         Node ID
pve_kvstore_v1\x00
                    276820               1 (192.168.122.26)
                    644968               3 (192.168.122.28)
                    620591               2 (192.168.122.27)
pve_dcdb_v1\x00
                    276820               1 (192.168.122.26)
                    644968               3 (192.168.122.28)
                    620591               2 (192.168.122.27)

Clear "CHECKING INSTALLED COROSYNC VERSION"

再び pve5to6 コマンドを実行して = CHECKING INSTALLED COROSYNC VERSION = が PASS: corosync 3.x installed. となっていることを確認します。

# pve5to6 
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =

Checking for package updates..
PASS: all packages uptodate

Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 5.4-2

Checking running kernel version..
PASS: expected running kernel '4.15.18-30-pve'.

Checking for installed stock Debian Kernel..
PASS: Stock Debian kernel package not installed.

= CHECKING CLUSTER HEALTH/SETTINGS =

PASS: systemd unit 'pve-cluster.service' is in state 'active'
PASS: systemd unit 'corosync.service' is in state 'active'
PASS: Cluster Filesystem is quorate.

Analzying quorum settings and state..
INFO: configured votes - nodes: 3
INFO: configured votes - qdevice: 0
INFO: current expected votes: 3
INFO: current total votes: 3

Checking nodelist entries..
PASS: pve03: ring0_addr is configured to use IP address '192.168.122.28'
WARN: pve01: ring0_addr 'pve01' resolves to '192.168.122.26'.
 Consider replacing it with the currently resolved IP address.
WARN: pve02: ring0_addr 'pve02' resolves to '192.168.122.27'.
 Consider replacing it with the currently resolved IP address.

Checking totem settings..
PASS: Corosync transport set to implicit default.
PASS: Corosync encryption and authentication enabled.

INFO: run 'pvecm status' to get detailed cluster status..

= CHECKING INSTALLED COROSYNC VERSION =

PASS: corosync 3.x installed.

= CHECKING HYPER-CONVERGED CEPH STATUS =

<snip>

Proxmox VE 6 packages upgrade

Corosyncのバージョンアップに成功したら、アップグレード対象ではないノードにVMを寄せることで、1台ずつProxmox VE 6にアップグレードしていくことが出来るようになります。
既に pve5to6 コマンドでPVE 5.4パッケージが最新化されていることは確認済みです。

# pve5to6 
= CHECKING VERSION INFORMATION FOR PVE PACKAGES =

Checking for package updates..
PASS: all packages uptodate

Checking proxmox-ve package version..
PASS: proxmox-ve package has version >= 5.4-2

Checking running kernel version..
PASS: expected running kernel '4.15.18-30-pve'.

Checking for installed stock Debian Kernel..
PASS: Stock Debian kernel package not installed.

<snip>

アップグレードの前処理

全ノードのaptのURLをBusterに更新して、ファイルのダウンロードまで実施します。

  • no-subscription repositoryを使用しているので、そちらも更新します
  • Cephを使用しているので、そちらも更新します(これはLuminousからNautilusにアップグレードするための変更ではなく、Debian stretch から buster にするための変更)
# sed -i 's/stretch/buster/g' /etc/apt/sources.list
# sed -i -e 's/stretch/buster/g' /etc/apt/sources.list.d/pve-no-subscription.list
# echo "deb http://download.proxmox.com/debian/ceph-luminous buster main" > /etc/apt/sources.list.d/ceph.list
# apt update
# apt dist-upgrade --download-only -y

1台目のアップグレード処理

1台目で起動中のVMをすべて別のノードに移動

ここはよくある方法で実施します。大体こういう感じです。

# qm migrate 1012 pve02 --online

1台目をPVE 6にupgradeする

PVE 5.4からPVE 6.2にアップグレードします。(実施日の都合上 6.0 ではなくいきなり 6.2 です)

# apt dist-upgrade

以下のような通知が出ますが、Enterで先に進めます。

W: (pve-apt-hook) !! ATTENTION !!
W: (pve-apt-hook) You are attempting to upgrade from proxmox-ve '5.4-2' to proxmox-ve '6.2-1'. Please make sure to read the Upgrade notes at
W: (pve-apt-hook)       https://pve.proxmox.com/wiki/Upgrade_from_5.x_to_6.0
W: (pve-apt-hook) before proceeding with this operation.
W: (pve-apt-hook) 
W: (pve-apt-hook) Press enter to continue, or C^c to abort.

issueのアップグレードは Y にしておきます。

Configuration file '/etc/issue'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** issue (Y/I/N/O/D/Z) [default=N] ? Y

libc6のアップグレードでサービスを自動的に再起動するかは Yes にしました。
/etc/systemd/timesyncd.conf の上書きは N にしました。(内部NTPサーバーの設定が入っているので)

Configuration file '/etc/systemd/timesyncd.conf'
 ==> Modified (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** timesyncd.conf (Y/I/N/O/D/Z) [default=N] ? N

/etc/apt/sources.list.d/pve-enterprise.list は要らないので N にしました。

Configuration file '/etc/apt/sources.list.d/pve-enterprise.list'
 ==> Deleted (by you or by a script) since installation.
 ==> Package distributor has shipped an updated version.
   What would you like to do about it ?  Your options are:
    Y or I  : install the package maintainer's version
    N or O  : keep your currently-installed version
      D     : show the differences between the versions
      Z     : start a shell to examine the situation
 The default action is to keep your current version.
*** pve-enterprise.list (Y/I/N/O/D/Z) [default=N] ? N

パッケージのアップグレードが完了したら再起動します。

# reboot

SEIL/x86の問題?

1台目をアップグレード後、PVE 5.4 -> 6.2にSEIL/x86をライブマイグレーションしたとき、なぜかVRRPがmaster->backup, backup->masterになるのを繰り返してしまい、watch-groupに入っていたPPPoE接続が切れてしまう問題を踏みました。
VRRPサービスの再起動で直りましたが、原因はよくわかりません。

# vrrp lan1 disable
# vrrp lan1 enable

2台目以降のアップグレード処理

1台目のアップグレードが適切に完了していればクラスタの中に PVE 5.4/6.2 が混在した状態となり、ライブマイグレーションが可能になります。
1台目と同様、2台目以降もVMを別のノードに移動させた後アップグレードします。(手順割愛)
なお 6.2 -> 5.4 のライブマイグレーションは問題なく実行可能でしたが 5.4 -> 6.2 のライブマイグレーションはパラメータミスマッチのエラーが出て失敗します。
通常ライブマイグレーション出来るかどうかは pve5to6 スクリプトの = MISCELLANEOUS CHECKS = で検出されているはずなので、事前に問題を解消しておけば特段問題は出ないはずです。

アップグレード後の後処理

Corosyncを3系にアップグレードするために一時的に追加していたリポジトリを削除します。

# rm /etc/apt/sources.list.d/corosync3.list

終わり

ちょっとSEIL/x86がトラブった以外は、ほぼ問題なくProxmox VE 5.4から6.2へのアップグレードができました。
この後はCephクラスタを Luminous から Nautilus にアップグレードする作業があるのですが、それはまた後程。