Add requests-in-flight metric#1539
Conversation
|
@slaunay this looks amazing! looking forward to see the final results! |
|
Looking into why the build failed, it seems that toxiproxy fails to establish connection to the target broker Here are the logs from the failed build ( The failed initial metadata request led to a connection to a different broker In that test case the metrics are validated against the cluster (i.e. all brokers) and a specific one Looking into that build, it seems that there are plenty of those errors: My guess is that broker number 1 stopped somehow (there is no Note that this failure is not related to the current PR but something that probably happens from time to time so I will submit a separate PR to address it. |
|
👋 I'm about to cut a new release when kafka 2.4.0 is released in a few days. Any further thoughts on this PR? |
|
You are probably asking the community but I would love to this integrated as I would then be able to propose some changes to fully honour The changes in I believe it will be a great addition to improve performance. |
|
Last build failed with configuration It seems that there is a race condition between closing the TCP listener and creating a new one on the same port: Not sure why though as there is some synchronization in place to ensure that It is possible but unlikely that another test or process has reused that port before it is used on the new I have not seen that failure before but I do not believe it is related to this PR. |
|
It's just a flaky test, I restarted the affected build. |
- add requests-in-flight and requests-in-flight-for-broker-<brokerID> metrics to the broker - move some b.updateOutgoingCommunicationMetrics(...) just after b.write(...) for consistency - add -max-open-requests flag to kafka-producer-performance - update kafka-producer-performance to monitor total requests in flight - update unit tests - document new metrics This is a squash of the official PR based on more recent Sarama master: IBM#1539
varun06
left a comment
There was a problem hiding this comment.
SGTM, there are some conflicts though.
- add requests-in-flight and requests-in-flight-for-broker-<brokerID> metrics to the broker - move some b.updateOutgoingCommunicationMetrics(...) just after b.write(...) for consistency - add -max-open-requests flag to kafka-producer-performance - update kafka-producer-performance to monitor total requests in flight - update unit tests - validate new metric in TestFuncProducing* - document new metrics
0acbf44 to
2db8d8a
Compare
|
^ I had to rebase as recent changes (#1573) was conflicting indeed and I took the opportunity to squash the original three commits into one. |
Feature
Add
requests-in-flightmetric similar to the Kafka Java client described as:This is an interesting metric to monitor for optimizing throughput when writing to a Kafka cluster on a high latency network link and something that is recommended when implementing the Kafka protocol as described in their network section:
The related configuration in Sarama is
config.MaxOpenRequestsand defaults to5like the Java Kafka client (max.in.flight.requests.per.connection).This configuration is actually not fully honoured when using the
AsyncProducer, more on that below.Changes
requests-in-flightandrequests-in-flight-for-broker-<brokerID>metrics to the brokerb.updateOutgoingCommunicationMetrics(...)just afterb.write(...)for consistency-max-open-requestsflag tokafka-producer-performancekafka-producer-performanceto monitor total requests in flightTesting done
Running the updated tests:
Using the updated
kafka-producer-performanceto see how many total requests are in flight when writing to 5 brokers (with a network latency of ~70ms):Note that the maximum number of requests in flight is
5which corresponds to the number of brokers (the topic has 64 partitions spread mostly evenly between those 5 brokers).And when writing to a single broker/topic partition:
As you can see the number of requests in flight seems to never go above
1per broker despite havingconfig.MaxOpenRequests = 5.This can be explained by a few things:
kafka-producer-performanceis captured every 5 seconds and there might not be much requests in flight at that time (could even be0when accumulating records or building aProduceRequestprior to sending it)-required-acks=-1)AsyncProduceronly allows a singleProduceRequestin flight at a time to a brokerThe last point can be seen when looking at the following section of the
brokerProducerwhere the "bridge" goroutine builds and forwardsProduceRequests one at a time:https://github.com/Shopify/sarama/blob/675b0b1ff204c259877004140a540d6adf38db17/async_producer.go#L660-L674
and the fact that the
Producereceiver on aBrokeris blocking till a response is received or an error occurs:https://github.com/Shopify/sarama/blob/675b0b1ff204c259877004140a540d6adf38db17/broker.go#L336-L355
I have another contribution pending to address that limitation without creating more go routines that results in better throughput in such scenario.
Nevertheless having such metric is one way to check how many requests are awaiting a response to a given broker or to the whole cluster.