We had no reasons to be anxious about this component. It has been running for about an year now. It used to handle around 1000 messages per day and email out a automated report twice every day. The solution was based on robust integration tools and technologies i.e. TIBCO EMS for delivering messages and Spring Integration for reading and handling them. Everything was predictable, boring and nice.
And one morning everything changed. This component froze with a null pointer exception. Nothing more, nothing less. There were no logs. They never are when you need them. Nothing had changed in the code or in the mode of delivery. There were no obvious miscreants. Business had found out the break - as one of the automated reports had failed - and were demanding an estimated time of fix. It was a picture perfect start for the firefighters of the product team - and they poured out their first cup of coffee.
So, the team swung into action. Half a day later - after multiple calls with business (not very pleasant, any one of them, mind you) - it was suggested that it might - just might be - that a couple of messages in the 1000 or so, did not have a required field - which by the way was guaranteed to be there by the business processes. So we took these two messages off and switched on the component. Lo and behold, crashed again. This time because there were much more messages than it could handle (remember messages kept coming in while the team was troubleshooting the problem). I will not bore you with the multitude of calls that followed, and how a fix was arrived and delivered. It suffices to say that too many man hours were spent on this for my comfort. And this lead me to write down my thoughts on this.
I am all for communications, meetings, workshops, creation of all sorts of requirements and design documents. I see the value in all of them. I really do - although it has been accused many a times that I don't. But, at the end of the day, there is no substitute for a minimal amount of street smartness. A healthy amount of cynicism goes a long way in designing a resilient system. In this particular case, a couple of things had gone wrong.
1. We trusted the data quality of the feed coming in from a different system. And we should not have. No. This is not going to be written down in any book discussing integration patterns. It is just something that a seasoned developer would not do, but a new one - although as sharp as a tac - would slip up on. Folks had trusted the requirement document that guaranteed that certain fields would be populated. But, the fact is, when the fields were not populated, it was not Ok for our component to go down. A seasoned developer would have consulted the requirements document and developed to it - but would not have trusted the requirement document. He would have been cynical.
2. We trusted the data volume of the feed. And we should not have. Again, this was something written down in the document and the code hence was technically correct. But, if only the developer would have said, "Hang on, if you are saying 1000 is the tops that you expect, fine, I will pull only 1000 at one go. If there are more, I will pull a second batch. And more batches if I need. But never more than 1000." we would have been fine. We should not have pulled all data from message queue - assuming it will be less than 1000, because it was written down in the document. A seasoned developer would have been cynical of the document.
The component is fixed and everything is back in business. It is no biggie. This was not the first time something like this happened and I am willing to wager that it will not be the last. The point that I am trying to make is that the business of software production is not - and perhaps will never be - like the production line of a hardware commodity. It is most unlikely to enjoy the stability, predictability, and repeatability of the production line of - say a car. So, the proliferation of processes, documents, meetings will not going to be as successful in this business.
Processes are fine. Documents are fine. Productivity measuring tools and code quality matrices are great. Workshops are great. Peer reviews are a must. But they are quite unlikely to be a substitute for a person who loves coding, takes pride in it, and goes that extra mile to ensure that his code does not fail. These people will always be in short supply and in great demand. As an industry, sooner or later we will have to find a way to create, foster and retain these individuals.
That's it for today. Happy coding.
If you want to get in touch, you can look me up at Linkedin or Google +.
Thanks for sharing your thoughts with us. Your last sentence hits me, as for sure we need some change here.
ReplyDeleteYup agree 100%. I too believe in not trusting any external data - its either the interfacing engine which checks it or the application. It cant just jump in.
ReplyDeleteAnyways I see the humour and point in your commentary - nicely put.
i say fire your business analysts.
ReplyDelete