Intro
Once again we got into this undesirable situation in which for no apparent reason Cognos logins simply stopped working. As administrator of the Cognos gateway piece and only that piece I was ready to swear up and down it cuoldn’t be my fault, but I took a closer look anyways, just in case. Here is what I found.
The details
We had been running Cognos v 10 without incident for over a year now when the word came from the application owner that logins stopped working in the morning. That is, through the gateway. If they tested through the application server directly that was still still working.
I knew I hadn’t changed anything so I assumed they did, or their LDAP back-end authentication server wasn’t working right.
Checking the apache web server logs showed nothing out of the ordinary. And my gateway had been running for a couple months straight (no reboot) when the problem occurred. I tested myself and saw for myself. You would get the initial pop-up log=in screen, then after submitting your authentication information it would just come back to you blank.
I managed to get a trace to the back-end dispatcher. Even that was pretty unspectacular. There was some HTTP communication back-and-forth in a couple of independent streams. The 2nd stream even contained something that looked promising:
CAMUsername=drjohn&CAMPazzword=NEwBCScGSIb3DQ... |
And the server response to that was already to return the user to the login page as though already something had gone wrong.
The app owner said she observed the same problem on the test system, but ignore that comment because that isn’t tested or used much. That’s another environment I hadn’t touched in a long while – 16 months.
I confirmed I could log on to the development dispatcher but not through my gateway. What the heck?
So I decided to look at my own blog posts for inspiration. It seems the thing to do in times like this is to save the configuration. So I tried that on the test system – that’s what it’s for, after all. I breathed a sigh of relief when in fact the save went through – you just never know. But, yes, I got the green checks. And I’ll be darned if that didn’t fix it! So with that small victory, I saved the configuration on my two gateway servers and, yup, logins started working there as well!
I am very annoyed at IBM for making such a faulty product. My private speculation as to what happened is that when you save the configuration you generate cryptographic information, which means PKI which includes certificates which have expiration dates. I suppose the certificates being used to exchange information securely between gateway and dispatcher simply expired and the software inconsiderately produced no errors about the matter other than to stop working. Even when I launched the cogconfig program no errors were displayed initially.
IBM’s role
I strongly suggested to the application owner to open a case with IBM about this ridiculous behaviour. But since it’s IBM I’m not too confident it will go anywhere.
cogconfig.sh
The details of launching the configuration tool in linux is described in the references. But note that unlike that article, this time I did not need to delete any key files or any other files at all. Just save the configuration.
August 16th, 2016 update
Well almost exactly two years later I stepped on that same rake again! Nothing changed yet authentications stopped working. I even took a a packet trace to prove that the gateway was sending packets to the app server, which it was. The app server only reported this very misleading message: unable to authenticate because credentials are invalid. Worse, I was vaguely aware there could be a problem with the cryptographic keys, but I assumed such a problem would scramble all communication to the app server. yet that same app server log file showed the userid of the user correctly! So I was really misled by that.
When they logged on directly to the app server it was fine. See that date? That’s almost exactly two years after my original posting of this article. So I guess the keys were good for two years this last time, then they expired without warning and without any proper logging. Thanks IBM! The solution was exactly the same: run cogconfig.sh and save. That’s it… I obviously forgot my own advice and I did not regenerate the config from time to time.
For the future
I think to avoid a repetition of this problem I may save the configuration every six months or so. 16 months is a strange time in the world of certificates for an expiration to occur. I don’t get that. So maybe my explanation as to what happened is bogus, but it’s all we have for now.
Conclusion
Another mysterious Cognos error. This one we resolved a tad faster than usual because prior experience told us there was one possible action we could reasonably take to help the situation out. There were no panicked reboots of any servers, by the way. Did you have this problem? Welcome to enterprise software!
References
The details on where the configuration program, cogconfig.sh is to be found and run is described in this article.
If you forgot which is your dispatcher, you can grep the file cogstartup.xml for 9300, which is the port it runs on, to give you some hints.