驱逐后的节点无法加入pulsar集群

服务器有三个节点,准备逐台替换为性能更高的机器
1.我在节点一上执行驱逐命令驱逐了bookie
步骤如下
bookkeeper shell listunderreplicated
systemctl stop bookkeeper
bookkeeper shell decommissionbookie
等最后一个ledger迁移完成下线了节点一
2.换了一个新的服务器使用了同样的IP更换了台服务器,pulsar的base目录从原来服务器拷贝
重启后bookkeeper无法启动
报错如下
2022-10-20T17:31:05,490+0800 [main] ERROR org.apache.bookkeeper.bookie.Bookie - There are directories without a cookie, and this is neither a new environment, nor is storage expansion enabled. Empty directories are [/data1/pulsar/bookkeeper/journal/current, /data1/pulsar/bookkeeper/ledgers/current]
2022-10-20T17:31:05,490+0800 [main] INFO org.apache.bookkeeper.proto.BookieNettyServer - Shutting down BookieNettyServer
2022-10-20T17:31:05,512+0800 [main] ERROR org.apache.bookkeeper.server.Main - Failed to build bookie server
org.apache.bookkeeper.bookie.BookieException$InvalidCookieException:
at org.apache.bookkeeper.bookie.Bookie.checkEnvironmentWithStorageExpansion(Bookie.java:494) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.bookie.Bookie.checkEnvironment(Bookie.java:273) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.bookie.Bookie.(Bookie.java:731) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.proto.BookieServer.newBookie(BookieServer.java:152) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.proto.BookieServer.(BookieServer.java:120) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.server.service.BookieService.(BookieService.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.server.Main.buildBookieServer(Main.java:304) ~[org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.server.Main.doMain(Main.java:226) [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]
at org.apache.bookkeeper.server.Main.main(Main.java:208) [org.apache.bookkeeper-bookkeeper-server-4.14.5.jar:4.14.5]

之后我们通过生成
4
bookieHost: “10.1.35.31:3181”
journalDir: “/data1/pulsar/bookkeeper/journal”
ledgerDirs: “1\t/data1/pulsar/bookkeeper/ledgers”
instanceId: “3d1b5365-e30a-4f57-91c6-5cc67282bbe5”
分别放到对应的目录但是遇到了以下问题
新加入的这个节点报错,但是把function功能disable后能正常收发数据,但是connector功能无法使用
2022-10-21T17:46:43,713+0800 [cluster-service-coordinator-timer] ERROR org.apache.pulsar.common.util.Runnables - Unexpected throwable caught
java.lang.RuntimeException: org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:96) ~[org.apache.pulsar-pulsar-functions-worker-2.10.1.jar:2.10.1]
at org.apache.pulsar.functions.worker.WorkerUtils.lambda$getIsStillLeaderSupplier$4(WorkerUtils.java:443) ~[org.apache.pulsar-pulsar-functions-worker-2.10.1.jar:2.10.1]
at org.apache.pulsar.functions.worker.ClusterServiceCoordinator.lambda$start$0(ClusterServiceCoordinator.java:74) ~[org.apache.pulsar-pulsar-functions-worker-2.10.1.jar:2.10.1]
at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:54) [org.apache.pulsar-pulsar-common-2.10.1.jar:2.10.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_342]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_342]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_342]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_342]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_342]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_342]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
Caused by: org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
at org.apache.pulsar.client.admin.internal.BaseResource.sync(BaseResource.java:299) ~[org.apache.pulsar-pulsar-client-admin-original-2.10.1.jar:2.10.1]
at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:642) ~[org.apache.pulsar-pulsar-client-admin-original-2.10.1.jar:2.10.1]
at org.apache.pulsar.client.admin.Topics.getStats(Topics.java:1047) ~[org.apache.pulsar-pulsar-client-admin-api-2.10.1.jar:2.10.1]
at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:92) ~[org.apache.pulsar-pulsar-functions-worker-2.10.1.jar:2.10.1]
… 10 more
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784) ~[?:1.8.0_342]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) ~[?:1.8.0_342]
at org.apache.pulsar.client.admin.internal.BaseResou

替换节点的步骤有点问题。核心的变量是因为你们前后两台机器的使用相同的 IP ,这在元数据未及时更新的时候,会导致后续识别出错。

过程分析:

  1. 在旧 bookie 正常下线后,Zookeeper 的该 bookie 的元数据并未及时清理。
  2. 这时添加新的 bookie 节点,使用了相同的 IP,集群判断是老节点,因此使用已经缓存的信息来验证节点。
  3. 发现这个要求添加进集群的 bookie 节点里面的 ledger、journal 目录都没有元数据信息,因此给出判断:这可能是个新机器,因此拒绝启动服务。言下之意需要我们手动解决元数据同步问题。

此时正常可以参考 Missing disks or directories 官方的来解决。
主要就是使用 bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools <zkserver> <oldbookie> <newbookie> 命令来将 zk 的元数据同步到这个新节点上。【还有种做法是直接删除 zk 元数据对应节点 bookie 元数据。对于不熟悉元数据的同学,不建议】

建议:
现在出现了其他问题,可能和这台 bookie节点上线失败有关联。建议重新安装上面的步骤先下线,再重新上线。确保上线正常后,再测试其他功能。相信就不会出现其他连带问题了。