I am trying to understand why the F5 always shows 2-3 times more active sessions for a pool member than is actaully in the physical server state table. In addition I am seeing a problem with having Linux (ubuntu) and Solaris servers in the same pool. What happens is that the Solaris servers get most all of the connections and the Ubunut servers which are on better hardware are sitting mostly idle... The distribution method we use is least connections (node) and a both performance layer 4 or standard TCP depending on location. So I guess to questions from this: 1) My uderstanding of LTM is the TCP connections which are closed normally for 4-way/3-way close should be immediately closed on the F5. The server always intiates the active close and hence goes into TIME_WAIT. Why does the pool member active connections always show so much more than the server really has active? (Server side I can see this via netstat and F5 I can use b pool | grep cur) 2) Ubuntu has hard coded 60 sec TIME_WAIT in the kernel but Solaris it is a tuneable paramter which we have set to 10secs for performance reasons. ( These connections are very short/fast so no issues with lower time). Why would the f5 send most everything to Solaris servers on poor hardware which translates to slower response times? ( we are not using oneconnect) I cant seem to find any data that would explain this behaviour and it does not make any technical sense. we are on archaic code (9.25) which I have no control over but I have not seen this issue with multiple OS before. I have also tried to use a Round Robin pool balance method which also did not work and same behaviour... Does anyone have any logic as what is the problem here? Thanks Andy

It doesn't appear to be hard coded (and it's actually 120s) - see here: http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/

It defintely is as I have looked into the source and it is harcoded to 60 seconds. include/net/tcp.h as: define TCP_TIMEWAIT_LEN (60*HZ)

sorry for formatting on that last post....

No worries. I say 120s as the TIME WAIT is 2x MSL and I've assumed the value of 60 is for the MSL rather than the full time wait. Worth testing I guess. Did you read the article? It can clearly be changed with a sysctrl entry; net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = XXX

yep I read it but that key does not exist in ubuntu and hence is not exposed. ( At least in the 10.4LTS which we are using) I dont believe this is the real issue anyway though as I can do the opposite and increase solaris to 60 seconds. The behavior does not change and TIME_WAIT should only impact the server and not the F5. Meaning if that was an issue , I should see problem with socket creartion for new connections as there would be to many in TIME_WAIT. That is not the case here , so there is something else wrong. THere is NO way to modify TIME_WAIT on ubuntu 10.4 and I have not looked into new versions but I dont beleive that has changed. I will have to go and double check.

LTM TCP Connection management

18 Replies

What_Lies_Bene1
Cirrostratus
Sep 04, 2013
OK, so could you drop the idle timeout. I'm clutching at straws but you could also enable loose close.

OneConnect would also help reduce the number of server-side connections and reduce the load somewhat on the servers.
nitass
Employee
Sep 04, 2013
I have also tried to use a Round Robin pool balance method which also did not work and same behaviour

i think round robin should work. how did you test? can you reproduce the issue?
andy_12_5042
Nimbostratus
Sep 04, 2013
I have tried Round Robin and it would work for a while and then under heavy load, we start to see the same issue with much more traffic going to Solaris Servers.

I have reduced the ide time to as low as 10 seconds but that does not help since these are active. The F5 sees these connections as EST and not persistent but this is not reflected in the server session state table. There appears to be a difference in how long it is holding connections between these servers and I just cant understand why.
nitass
Employee
Sep 04, 2013
I have tried Round Robin and it would work for a while and then under heavy load, we start to see the same issue with much more traffic going to Solaris Servers.

how did you measure traffic to each server? was it from statistics on bigip?

are you using any setting which may affect load distribution?

sol10430: Causes of uneven traffic distribution across BIG-IP pool members

http://support.f5.com/kb/en-us/solutions/public/10000/400/sol10430.html
andy_12_5042
Nimbostratus
Sep 04, 2013
traffic was measured from both the F5 pool member statistics as well as the server side session table.. The server will always reflect the most accurate number of sockets that are in EST or TIME_WAIT state for example.

None of the things mentioned in that article apply here. Since I have seen this with Round robin that eliminates it being just with least connnections. This is one of those issues that I would need to get at the internals which I cant do without support. For example with some other vendor devices I can turn on specific types of debugging and observer the decision logic on where a request is sent based on the current configuration which is very helpful in these cases. It would at least provide some logic as to why more traffic is getting sent to same set of servers.
- What_Lies_Bene1
  Cirrostratus
  Sep 04, 2013
  I doubt very much if we'll get to the root cause of this, particularly with such an old version of code. However (Nitass gave me this idea in response to another post) perhaps it can be overcome using a more 'intelligent' load balancing method. Candidates would be Weighted Least Connections, Dynamic Ratio, Observed or Predictive.
andy_12_5042
Nimbostratus
Sep 04, 2013
Yeah I agree and was starting to think that is the only possible solution at this point. I will have to test some different types of balancing methods and see what I can do.

Thanks for the comments guys! I dont know how I ended up with another gig that is using such old software and no support :)

Andy
What_Lies_Bene1
Cirrostratus
Sep 04, 2013
You're welcome, it's always the way. Please do post back if this does the trick. Here's a quick run down of the methods I mentioned;

Weighted Least Connections – Member & Node - This method load balances new connections to whichever Pool Member or Node has the least number of active connections, however, you define a Connection Limit (Weight) for each Pool Member or Node based on your knowledge of its abilities. The Connection Limits are used along with the active connection count to distribute connections unequally in a Least Connections fashion.

This method is suitable where the real servers have differing capabilities.

As each connection can have differing overheads (one could related to a request for a HTML page, the other a 20Mb PDF document that needs to be generated and downloaded) this is not a reliable way of distributing bandwidth and processing load between servers.

Member method: The weights and connection count for each Pool Member is calculated only in relation to connections specific to the Pool in question.

Node method: The weights and connection count for each Node is calculated in relation to all the Pools the Node is a Member of.

If all Pool Members have the same Connection Limit then this method acts just like Least Connections.

Dynamic Ratio – Member & Node - Also known as Dynamic Round Robin, this method is similar to Ratio but dynamic; real-time server performance (such as the current number of connections and response time) analysis is used to distribute connections unequally in a circular (Round Robin) fashion. This may sound like Observed but keep in mind connections are still distributed in a circular way.

This method is suitable where the real servers have differing capabilities.

Member method: The performance of each Pool Member is calculated only in relation to the Pool in question.

Node method: The performance of each Node is calculated in relation to all the Pools the Node is a Member of.

Observed – Member & Node - This method load balances connections using a ranking derived from the number of Layer Four connections to each real server and each server’s response time to the last request. This is effectively a combination of the Least Connections and Fastest methods.

Not recommended except in specific circumstances and not at all for large Pools. Connections to each Pool Member are only considered in relation to the specific Pool in question.

Member method: The weights and connection count for each Pool Member is calculated only in relation to connections specific to the Pool in question.

Node method: The weights and connection count for each Node is calculated in relation to all the Pools the Node is a Member of.

Predictive – Member & Node - Similar to Observed but more aggressive as the resulting Pool Member rankings are analysed over time and if a Pool Member’s ranking is improving it will receive a higher proportion of connections than one whose ranking is declining.

Not recommended except in specific circumstances and not at all for large Pools.

Member method: The ranking and analysis for each Pool Member is calculated only in relation to connections and response times specific to the Pool in question.

Node method: The ranking and analysis for each Node is calculated in relation to connections and response times for all the Pools the Node is a Member of.

Forum Discussion

LTM TCP Connection management

18 Replies

Recent Discussions

health monitor without hostname

F5OS missing interfaces in tenant

F5 f5-bigip-runtime-init for on-prem

HOW TO HIRE A HACKER TO RECOVERY LOST OR STOLEN BITCOIN. CONNECT WITH FASTFUND RECOVERY SERVICRS.

F5 WAF risk assessment

Related Content

NGINX Management Suite API Connectivity Manager - Modern API driven Applications

Minimizing Security Complexity: Managing Distributed WAF Policies

Re: Management Certificate

Scaling SSL VPN using BIG-IP Local Traffic Manager (LTM)

Using iControl REST API to manage F5 BIG-IP Advanced Firewall Manager (AFM)