Puslar 集群 broker 和 bookkeeper 进程宕掉

集群配置:3节点 2核8G集群,每个节点都部署有bookeeper、broker、zookeeper,jvm内存为默认2G配置,pulsar版本2.9.0

故障前后puslar服务监控
发现其中一个节点 bookies的ledgers数量有突增,但是topic数量前后没有变化,producers数量有突增

ledgers数量突增bookie所在主机的broker报错日志

2022-06-15T17:31:56,542+0000 [main-SendThread(172.23.24.11:2181)] WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 20419ms for session id 0x20030770a00001d
2022-06-15T17:31:56,542+0000 [main-SendThread(172.23.9.102:2181)] WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 20939ms for session id 0x1003077081c0022
2022-06-15T17:31:56,542+0000 [main-SendThread(172.23.29.173:2181)] WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 19882ms for session id 0x300307756940017
2022-06-15T17:32:42,380+0000 [pulsar-web-36-8] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:31:56 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 45153
2022-06-15T17:32:55,683+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:32:42 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:32:55,683+0000 [pulsar-web-36-5] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:32:42 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:32:55,683+0000 [pulsar-web-36-7] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:32:42 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 639
2022-06-15T17:32:56,235+0000 [main-SendThread(172.23.9.102:2181)] WARN  org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 43814ms for session id 0x1003077081c0021
2022-06-15T17:33:01,809+0000 [pulsar-web-36-6] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:32:57 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 4472
2022-06-15T17:33:04,570+0000 [pulsar-load-manager-1-1] INFO  org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Writing local data to metadata store because maximum change 10.228314995765686% exceeded threshold 10%; time since last report written is 825.846 seconds
2022-06-15T17:33:04,570+0000 [metadata-store-zk-session-watcher-7-1] WARN  org.apache.pulsar.metadata.impl.ZKSessionWatcher - ZooKeeper client is disconnected. Waiting to reconnect, time remaining = 25.0 seconds
2022-06-15T17:33:36,223+0000 [main-SendThread(172.23.24.11:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x20030770a00001d for sever ip-172-23-24-11.us-west-2.compute.internal/172.23.24.11:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 20419ms for session id 0x20030770a00001d
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
2022-06-15T17:33:40,205+0000 [main-SendThread(172.23.9.102:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x1003077081c0021 for sever ip-172-23-9-102.us-west-2.compute.internal/172.23.9.102:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 43814ms for session id 0x1003077081c0021
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
2022-06-15T17:33:41,535+0000 [main-SendThread(172.23.29.173:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x300307756940017 for sever ip-172-23-29-173/172.23.29.173:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 19882ms for session id 0x300307756940017
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
2022-06-15T17:33:41,590+0000 [main-SendThread(172.23.9.102:2181)] WARN  org.apache.zookeeper.ClientCnxn - Session 0x1003077081c0022 for sever ip-172-23-9-102.us-west-2.compute.internal/172.23.9.102:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed out, have not heard from server in 20939ms for session id 0x1003077081c0022
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1258) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
2022-06-15T17:33:44,359+0000 [pulsar-web-36-8] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:33:41 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 2768
2022-06-15T17:33:47,177+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:33:41 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 5586
2022-06-15T17:33:49,564+0000 [pulsar-web-36-8] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:33:44 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 5164
2022-06-15T17:33:55,668+0000 [metadata-store-zk-session-watcher-7-1] INFO  org.apache.pulsar.broker.PulsarService - Received metadata service session event: ConnectionLost
2022-06-15T17:33:57,366+0000 [pulsar-web-36-4] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:33:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 7657
2022-06-15T17:33:57,377+0000 [metadata-store-zk-session-watcher-7-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl - Received MetadataStore session event: ConnectionLost
2022-06-15T17:33:59,710+0000 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
2022-06-15T17:33:59,710+0000 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
2022-06-15T17:33:59,682+0000 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
2022-06-15T17:34:58,245+0000 [pulsar-web-36-7] WARN  org.apache.pulsar.broker.web.ResponseHandlerFilter - Http request Request[GET //172.23.29.173:8080/metrics/]@6d7e0405 async context timeout.
2022-06-15T17:34:58,245+0000 [pulsar-web-36-5] WARN  org.apache.pulsar.broker.web.ResponseHandlerFilter - Http request Request[GET //172.23.29.173:8080/metrics/]@2e1d9ab9 async context timeout.
2022-06-15T17:34:58,245+0000 [pulsar-web-36-1] WARN  org.apache.pulsar.broker.web.ResponseHandlerFilter - Http request Request[GET //172.23.29.173:8080/metrics/]@637ade4e async context timeout.
2022-06-15T17:33:52,694+0000 [cluster-service-coordinator-timer] ERROR org.apache.pulsar.functions.worker.MembershipManager - Failed to get status of coordinate topic persistent://public/functions/coordinate
org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:713) ~[org.apache.pulsar-pulsar-client-admin-original-2.9.0.jar:2.9.0]
        at org.apache.pulsar.client.admin.Topics.getStats(Topics.java:867) ~[org.apache.pulsar-pulsar-client-admin-api-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:92) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.lambda$start$5(PulsarWorkerService.java:510) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.ClusterServiceCoordinator.lambda$start$0(ClusterServiceCoordinator.java:73) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_181]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[?:1.8.0_181]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_181]
        at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:706) ~[org.apache.pulsar-pulsar-client-admin-original-2.9.0.jar:2.9.0]
        ... 11 more
2022-06-15T17:35:49,662+0000 [pulsar-web-36-4] WARN  org.apache.pulsar.broker.web.ResponseHandlerFilter - Http request Request[GET //172.23.29.173:8080/metrics/]@337015d8 async context timeout.
2022-06-15T17:35:49,673+0000 [pulsar-acceptor-20-1] WARN  io.netty.channel.AbstractChannel - Force-closing a channel whose registration task was not accepted by an event loop: [id: 0xf5b9d4b3, L:/172.23.29.173:6650 - R:/172.23.167.252:47044]
java.util.concurrent.RejectedExecutionException: event executor terminated
        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:343) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:825) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:815) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:483) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:87) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:81) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.MultithreadEventLoopGroup.register(MultithreadEventLoopGroup.java:86) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:215) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.epoll.AbstractEpollServerChannel$EpollServerSocketUnsafe.epollInReady(AbstractEpollServerChannel.java:120) [io.netty-netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:4.1.68.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) [io.netty-netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:4.1.68.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [io.netty-netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
2022-06-15T17:35:49,674+0000 [pulsar-acceptor-20-1] ERROR io.netty.util.concurrent.DefaultPromise.rejectedExecution - Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated
        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:343) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:825) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:815) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:841) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:499) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:184) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.bootstrap.ServerBootstrap$ServerBootstrapAcceptor.channelRead(ServerBootstrap.java:215) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [io.netty-netty-transport-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.channel.epoll.AbstractEpollServerChannel$EpollServerSocketUnsafe.epollInReady(AbstractEpollServerChannel.java:120) [io.netty-netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:4.1.68.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) [io.netty-netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:4.1.68.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [io.netty-netty-transport-native-epoll-4.1.68.Final-linux-x86_64.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
2022-06-15T17:35:49,684+0000 [main-EventThread] INFO  org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase - ZooKeeper client is disconnected from zookeeper now, but it is OK unless we received EXPIRED event.
2022-06-15T17:35:49,684+0000 [main-SendThread(172.23.29.173:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-172-23-29-173/172.23.29.173:2181.
2022-06-15T17:35:49,703+0000 [main-SendThread(172.23.29.173:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
2022-06-15T17:35:49,703+0000 [main-SendThread(172.23.29.173:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.23.29.173:13657, server: ip-172-23-29-173/172.23.29.173:2181
2022-06-15T17:35:49,703+0000 [main-EventThread] INFO  org.apache.distributedlog.zk.LimitedPermitManager - EnablePermits = false, Epoch = 0.
2022-06-15T17:35:49,703+0000 [main-EventThread] INFO  org.apache.distributedlog.zk.LimitedPermitManager - EnablePermits = false, Epoch = 0.
2022-06-15T17:35:49,683+0000 [main-SendThread(172.23.29.173:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-172-23-29-173/172.23.29.173:2181.
2022-06-15T17:35:49,704+0000 [main-SendThread(172.23.29.173:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
2022-06-15T17:35:49,704+0000 [main-SendThread(172.23.29.173:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.23.29.173:13659, server: ip-172-23-29-173/172.23.29.173:2181
2022-06-15T17:35:49,683+0000 [pulsar-web-36-6] INFO  org.eclipse.jetty.server.RequestLog - 172.23.29.173 - - [15/Jun/2022:17:35:49 +0000] "GET /admin/v2/persistent/public/functions/coordinate/stats?getPreciseBacklog=false&subscriptionBacklogSize=false HTTP/1.1" 307 0 "-" "Pulsar-Java-v2.9.0" 1
2022-06-15T17:35:49,681+0000 [prometheus-stats-37-1] ERROR org.apache.pulsar.functions.worker.MembershipManager - Failed to get status of coordinate topic persistent://public/functions/coordinate
org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:713) ~[org.apache.pulsar-pulsar-client-admin-original-2.9.0.jar:2.9.0]
        at org.apache.pulsar.client.admin.Topics.getStats(Topics.java:867) ~[org.apache.pulsar-pulsar-client-admin-api-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:92) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.lambda$start$5(PulsarWorkerService.java:510) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.WorkerStatsManager.generateLeaderMetrics(WorkerStatsManager.java:311) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.WorkerStatsManager.getStatsAsString(WorkerStatsManager.java:306) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.FunctionsStatsGenerator.generate(FunctionsStatsGenerator.java:45) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.generateFunctionsStats(PulsarWorkerService.java:186) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsGenerator.generate(PrometheusMetricsGenerator.java:114) ~[org.apache.pulsar-pulsar-broker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet.lambda$doGet$0(PrometheusMetricsServlet.java:77) ~[org.apache.pulsar-pulsar-broker-2.9.0.jar:2.9.0]
        at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.9.0.jar:2.9.0]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.2.jar:4.14.2]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[?:1.8.0_181]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_181]
        at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:706) ~[org.apache.pulsar-pulsar-client-admin-original-2.9.0.jar:2.9.0]
        ... 19 more
2022-06-15T17:35:49,705+0000 [prometheus-stats-37-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to generate prometheus stats
java.lang.RuntimeException: org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:96) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.lambda$start$5(PulsarWorkerService.java:510) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.WorkerStatsManager.generateLeaderMetrics(WorkerStatsManager.java:311) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.WorkerStatsManager.getStatsAsString(WorkerStatsManager.java:306) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.FunctionsStatsGenerator.generate(FunctionsStatsGenerator.java:45) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.PulsarWorkerService.generateFunctionsStats(PulsarWorkerService.java:186) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsGenerator.generate(PrometheusMetricsGenerator.java:114) ~[org.apache.pulsar-pulsar-broker-2.9.0.jar:2.9.0]
        at org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet.lambda$doGet$0(PrometheusMetricsServlet.java:77) ~[org.apache.pulsar-pulsar-broker-2.9.0.jar:2.9.0]
        at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.9.0.jar:2.9.0]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.2.jar:4.14.2]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_181]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: org.apache.pulsar.client.admin.PulsarAdminException$TimeoutException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:713) ~[org.apache.pulsar-pulsar-client-admin-original-2.9.0.jar:2.9.0]
        at org.apache.pulsar.client.admin.Topics.getStats(Topics.java:867) ~[org.apache.pulsar-pulsar-client-admin-api-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:92) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        ... 17 more
Caused by: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) ~[?:1.8.0_181]
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) ~[?:1.8.0_181]
        at org.apache.pulsar.client.admin.internal.TopicsImpl.getStats(TopicsImpl.java:706) ~[org.apache.pulsar-pulsar-client-admin-original-2.9.0.jar:2.9.0]
        at org.apache.pulsar.client.admin.Topics.getStats(Topics.java:867) ~[org.apache.pulsar-pulsar-client-admin-api-2.9.0.jar:2.9.0]
        at org.apache.pulsar.functions.worker.MembershipManager.getLeader(MembershipManager.java:92) ~[org.apache.pulsar-pulsar-functions-worker-2.9.0.jar:2.9.0]
        ... 17 more
2022-06-15T17:35:49,705+0000 [main-SendThread(172.23.24.11:2181)] INFO  org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-172-23-24-11.us-west-2.compute.internal/172.23.24.11:2181.
2022-06-15T17:35:49,705+0000 [main-SendThread(172.23.24.11:2181)] INFO  org.apache.zookeeper.ClientCnxn - SASL config status: Will not attempt to authenticate using SASL (unknown error)
2022-06-15T17:35:49,679+0000 [pulsar-client-io-47-1] INFO  org.apache.pulsar.client.impl.ClientCnx - [id: 0x30e91381, L:/172.23.29.173:38383 ! R:172.23.24.11/172.23.24.11:6650] Disconnected
2022-06-15T17:35:49,706+0000 [pulsar-client-io-47-1] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/functions/coordinate] [participants] Closed connection [id: 0x30e91381, L:/172.23.29.173:38383 ! R:172.23.24.11/172.23.24.11:6650] -- Will try again in 0.1 s
2022-06-15T17:35:49,706+0000 [pulsar-client-io-47-1] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/functions/metadata] [c-pulsar-us-fw-172.23.29.173-8080-function-metadata-tailer-reader-bcbff9dddf] Closed connection [id: 0x30e91381, L:/172.23.29.173:38383 ! R:172.23.24.11/172.23.24.11:6650] -- Will try again in 0.1 s
2022-06-15T17:35:49,706+0000 [pulsar-client-io-47-1] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/functions/assignments] [c-pulsar-us-fw-172.23.29.173-8080-function-assignment-tailer-reader-1f14466da5] Closed connection [id: 0x30e91381, L:/172.23.29.173:38383 ! R:172.23.24.11/172.23.24.11:6650] -- Will try again in 0.1 s
2022-06-15T17:35:49,678+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 1
2022-06-15T17:35:49,706+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,706+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,706+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,707+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,707+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,707+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.81.183 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,707+0000 [pulsar-web-36-3] INFO  org.eclipse.jetty.server.RequestLog - 172.23.80.218 - - [15/Jun/2022:17:35:49 +0000] "GET /metrics HTTP/1.1" 302 0 "-" "Prometheus/2.24.0" 0
2022-06-15T17:35:49,707+0000 [main-SendThread(172.23.24.11:2181)] INFO  org.apache.zookeeper.ClientCnxn - Socket connection established, initiating session, client: /172.23.29.173:45940, server: ip-172-23-24-11.us-west-2.compute.internal/172.23.24.11:2181
2022-06-15T17:35:49,708+0000 [pulsar-acceptor-20-1] WARN  io.netty.channel.AbstractChannel - Force-closing a channel whose registration task was not accepted by an event loop: [id: 0xdd0885cd, L:/172.23.29.173:6650 - R:/172.23.175.135:57208]
java.util.concurrent.RejectedExecutionException: event executor terminated
        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:923) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:350) ~[io.netty-netty-common-4.1.68.Final.jar:4.1.68.Final]
        at 

故障前后主机监控

建议可以到 pulsar github 提issue, Issues · apache/pulsar · GitHub

可以看看zk集群的情况,如果zk出现问题会导致集群无法服务。