Our clustering solution for IdP v3

Julian Williams julian.williams at it.ox.ac.uk
Sun Feb 28 17:44:05 EST 2016


Dear Shibboleth Users,

At Oxford we plan to replace our single node IdP v2 with an IdP v3
cluster in late March. We've been working on this for some time, on and
off (~6 months) and have arrived at a cluster solution that we think
will work for us but is a bit of a departure from anything I've seen
documented. I'd like your feedback on this and whether we're doing
anything crazy!

First off, a bit of relevant background to our environment:

* We run a busy IdP in the UK Federation which *only* consumes the
federation metadata.

* We still have more SPs than we'd like using the back-channel for SAML1
attribute requests but we plan on working with people to cut this right
down over the next 6 months to a point we might be able to switch it off
(although that might be wishful thinking).

* We use a computedid for the PersistentID / TargetedID and so don't
need to store this in a persistent datastore.

* We run Shibboleth 'on top of' a Standford WebAuth SSO layer (using
RemoteUser to pass the username through to Shibboleth). This means that
even if Shibboleth SSO temporarily breaks for some reason and users are
forced by Shibboleth to reauth, they won't particularly notice as their
WebAuth session will get them straight back into Shibboleth.

* We don't plan on changing our IdP's metadata as part of this work. But
we may change it in the relatively near future as we'd like to stop
advertising support of SAML2 artifact resolution, and publish our SAML2
NameIDFormat as being persistent (that we will start providing with v3)
rather than the current transient.

* We won't be using the uApprove feature so don't need persistent
storage for that.

* We are testing (including clustering) using a test entityID which is
also in the UK Federation. We have tested as many SPs as we can that can
be persuaded to use our test IdP in place of our production IdP.

* We will have some time to test against our pre-live cluster, using the
the production entityID/metadata by making local '/etc/hosts' overrides
(although this won't work for the SAML1 SPs because of the back-channel).

* We will be making use of a Citrix Netscaler (HA pair) to do the
load-balancing. We already have some experience in using this for other
services.


Do we need persistent storage?

We had been thinking until relatively recently that we would have to use
a PostgreSQL backend (our favoured rdbms here) to store the transientid
necessary for supporting back-channel connections across multiple nodes.
However our (albeit limited) DR testing so far has picked up that the
IdP struggles to provide *any* service if the dbms disappears off the
network. This is a bit of a worry because this PostgreSQL solution would
be a Master-Slave with manual failover so we would be vulnerable to some
significant downtime occurring before the Slave could be switched to
being the Master.

We haven't tried memcached but we don't think it would work for us in
our environment. I think this is partly due to security concerns as our
IdPs operate on fairly open public networks.

So we're now thinking that we may get away without needing any
persistent storage *if* we can do something clever with the Netscaler
and direct all SAML1 associated requests (including the back-channel) to
a single-node (with failover to another node). The rest of the SAML2
associated requests will go to a load-balanced cluster with 3 nodes.

We think we've got this working for the front-channel by using a
combination of the Netscaler content switching, matching on paths
beginning '/idp/profile/Shibboleth/SSO' for SAML1 or
'/idp/profile/SAML2' for SAML2, to direct to different target
'clusters'; and also rewrite rules on the Netscaler to set a cookie to
make sure that the subsequent requests in the 'conversation' go to the
same 'cluster'. The back-channel port 8443 will just be directed to the
same single-node cluster as the policy for the front-channel SAML1.

If the SAML1 use declines as hoped we would eventually be able to remove
the policy and single-node target.

This seems like a good solution for us if does prove to work. The one
downside I can think of is operationally having to treat the nodes
differently because 1 of them will be getting all the SAML1 traffic.
However, part of me is thinking where's the gotcha, and that we may have
overlooked something silly. Have we?

Anyway, if you've read this then thanks for taking the time. Any
feedback much appreciated.

Cheers,

Julian



-- 
Julian Williams (Systems Developer, Identity and Access Management)
Systems Development and Support, IT Services, University of Oxford



More information about the users mailing list