LDAP and timeLimit Value
Ryan.Tapp at csulb.edu
Tue Nov 20 13:07:35 EST 2018
First off, thank you to everyone who provided input regarding the load issues we experienced with the start of the semester. As expected, load testing has pointed pretty conclusively to the data connector being the weak link. Testing has shown that at any realistic load we experience at any given time, the IdP doesn't break a sweat. Over the last several weeks I've been focusing on our LDAP implementation and have identified some key areas that we can improve (all of which will be phased in over the next couple of months in preparation for the start of the Spring semester).
I've built some new LDAP servers that will be used exclusively for our Shib environment but I'm getting some sporadic timeouts to these new servers that I can't explain (and don't seem to be happening in the same fashion in our current production). In troubleshooting the issue, I've been able to replicate it when connecting a single IdP to a single new LDAP (eliminating the network load balancer), then change over to current production LDAP and see the problem not happen. I'm pretty certain it's my newly built LDAP servers (same LDAP we use with production: Microsoft's AD LDS server - not Active Directory) and not an IdP issue itself. What's incredibly frustrating is the timeout issue goes away after a number of attempts (so a bunch of timeouts for 2-3 minutes, then it starts working flawlessly) and I have no issues until I restart the IdP (and/or LDAP server). This "number of attempts" doesn't seem to be related to the initial startup of IdP or LDAP, I've restarted both and let them sit for minutes and sometimes hours before testing. It's like I have to hammer the LDAP server for a period of time to get it to "wake up" for searches, although binding never has an issue. Although I can't completely eliminate the firewall from my testing, the network team assures me I'm not getting any disruption from that front. Like production, these are virtual machines but beefed up significantly from current production.
>From a packet capture of the LDAP traffic to and from the IdP and LDAP, I see:
a. typically just under 1 second between searchRequest from IdP and searchResDone success back from production LDAP.
b. on new build LDAP with same IdP when the server is "asleep":
a. when responseTimeout is default 3 seconds, I get the IdP waiting 3 seconds then sending an abandonRequest to LDAP. The IdP logs a "timeout connecting to LDAP" type error and I get no attributes back. This all seems as expected.
b. when responseTimeout is changed to 30 seconds, I get the LDAP server only waiting 4 seconds before sending back searchResDone timeLimitExceeded to the IdP. The IdP doesn't log any error, I just get no attributes back.
c. on a new build LDAP with same IdP but the server has decided to "wake up""
a. I see the response time start to improve (and once under the 4 second timeout start to succeed) as attempts progress, eventually reaching the same just under 1 second elapsed time that I see with our production LDAP environment.
I'm still convinced the issue is ultimately with my new LDAP servers, but my question is about that 4 seconds... where is that coming from? The packet capture seems to clearly show (in production and newly built IdP) that the IdP is including that value regardless of any timeout value in the IdP (see example below). I'm at a loss as to why I'm seeing that and what I can do about it (at least for my testing and troubleshooting). Any ideas?
Of course, I've been staring at this stuff for weeks, so hopefully I'm not just looking past something simple (highly possible).
Example searchRequest from IdP to LDAP:
scope: wholeSubtree (2)
derefAliases: neverDerefAliases (0)
filter: equalityMatch (3)
attributes: 0 items
California State University Long Beach
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users