General Guidance on IdP Environment Sizing
cantor.2 at osu.edu
Fri Sep 28 13:37:23 EDT 2018
On 9/28/18, 1:09 PM, "users on behalf of Paul Fardy" <users-bounces at shibboleth.net on behalf of paul.fardy at utoronto.ca> wrote:
> Could you expand on this?
With the caveat that I consider every second spent thinking about this sort of thing to be wasted time and know very little about it. The IdP scales linearly. So get a decent load balancer, add servers, declare victory and move on. Time is expensive, servers aren't.
> This isn't our experience. Our IdP spends most off its time awaiting LDAP and Kerberos
> queries. CPU utilization is low: consistently less than 10% usage of a 4CPU VM, with a few peaks of 20%. We increased
> CPUs for the load. We can't determine if there was any benefit.
You can probably expand your pool sizes then, or your CPUs are incredibly powerful and you should see great performance. My VM is constantly pegged because it's so underpowered. A loaded IdP handling lots of traffic is going to be at 100% across any CPUs it can hit and if it's not pegged, it's throttled by other limitations. My physical box is never loaded, but it spikes routinely to 500-600%.
> The time taken to serve the request, in seconds.
SSO requests should finish in the aggregate in under a second unless your back-end and overall architecture is too slow. There are noisy exceptions in the data but a monthly 95th percentile should certainly be under 1 sec.
If it's not busy with CPU and is not getting in and out then your pools are too small and it's stuck waiting for connections, or your back end is too slow.
> This has helped find SLOW responses, though it cannot help tune for fast responses. The granularity is seconds. It would > be great if we could log LDAP latency to the IdP logs.
You can use the metrics support to take all sorts of timings that would be at least approximating specific LDAP behavior, at least for attributes. I have never had to bother but it's there.
> Initially, we weren't even re-using LDAP connections. So every authn and every attribute query included connect, bind,
> search, close.
That's certainly a killer. The complexity of pooling authn in LDAP is one big reason I use Kerberos protocol with AD whenever I can.
> I think the IdP software's internal issue would be JVM multi-threading. Could context switching be slow?
If it saturates then it will thrash and that's easy to see most of the time if it happens. I haven't heard of that in a long time. Performance falls off a cliff.
> Can we determine if or when it's overloaded?
You said it's not overloaded.
> I increased CPUs. But that's really speculating that the JVM needs help. If
> it's mostly waiting, it doesn't need more hardware. Our system isn't using it. I just one wonder if it might need more.
If it's waiting for anything, it needs a new architecture to get its work done, and no, hardware won't help.
> Sizing memory is also an issue. We increased our RAM before a student registration load day and we tuned tomcat: we
> increased the thread pool, ... which immediately increased memory usage, but was is usage or wastage?
Memory is virtually all for metadata until people stop using huge batches and then it's pretty irrelevant.
> setting stack size to 384K
That seems incredibly oversized, but I haven't ever looked at it.
> We found our bottlenecks in LDAP latency. One box runs well for us.
That sounds like an LDAP problem to me, which fits everything else you're saying.
More information about the users