For each service, I have two VMs clustered- one in a Colo, and one Local. I have one user that I'm using on the web service for that particular cluster. I have an SPN setup for HTTP that contains the name of each machine and the user. I access the machine through the clustername, and it goes to one server or another.
So for later reference, we have machine A, machine B (in the colo), cluster AB spanning the machines, and user C. I connect to the test endpoint, and everything is fine.
I replicate that- let's say Machine X, machine Y (in the colo), cluster XY, and user Z. I connect to the test endpoint and everything is fine.
So I have a client, that connects to cluster AB, that then connects to cluster XY to use its service.
I irregularly get 401s. When it connects, it's fine... on the second hop, I can see that it's the client, and have it's credentials. But sometimes... it just fails for no reason. I look in the logs, and there's a 401 there. This never happens when I connect direct to the machines.
Our network guy said he thinks it's an interaction of VMware VMs with Hypervisor VMs... but I had to call BS on that. There's no rhyme nor reason in the machines that it's connecting to creating a pattern (I added a key to pass through the machines so I can see the exact path it's taking, and there's just no pattern).
I think it's passing the kerberos ticket through the NLB... but he said that it just passes it along, so it's only passing what it's sent. The other part of this is we made another endpoint just to test authorization, because it seems like if we're getting a 401, then we always get a 401, but if that first hop works, it will work from then on (we have affinity set, so the path stays the same through multiple calls), but that started failing today. The other dev got a 201... then when he called the real endpoint, got a 401. (All of this is on the second hop- the first hop always works)
I know more about Kerberos than I wanted to know now after looking at this for a couple of months. And I think we finally have the code right. But the netops says that there is no way that NLB is causing this issue.
Have you ever seen NLB cause issues with kerberos tickets?