July 3, 2018 // By Brent Edwards
Today we have a tale of good intentions and failed implementation. A tale of misleading results and frustrating debugging. A tale that has happened countless times in software systems around the world and will continue to happen countless more times in the years to come.
Our tale begins as our hero starts a new week. The previous week was a big one for our hero, who finished the week with the merge of a major update our hero’s team had been working very hard on. This merge was rather large, so our hero was paying very close attention for strange application behavior that could be attributed to the merge.
During the morning status meeting, the team begins discussing the fact that a team member has not been able to log into the app since they arrived earlier that morning. Our hero becomes concerned, pulls down the latest code, and runs the app. Sure enough, the app is preventing users from logging in.
Our hero’s quest has begun.
The application in question is a mobile application which connects to a back-end API. The code has been developed by a third party, which is no longer involved. The application was originally developed as a proof-of-concept. Alas, the proof-of-concept code became production code, bringing with it all the proof-of-concept “get it done quickly” decisions and technical debt. It is at this point that our hero’s team becomes involved.
There are many points at which a failure could be happening to prevent a user from logging in to the application. Our hero begins by adding breakpoints to the mobile application code in the hopes of finding an obvious cause of the problem.
The first pass through hit one of the expected breakpoints, then skipped the rest and instead hit the breakpoint in the exception handler block. The “exception handler” block does nothing more than log the exception and move on. In other words, it swallows the exception. Since our hero’s team is not responsible for this code, it is not their responsibility to fix it at this time. Our hero notes that the exception being caught is a NullReferenceException. Our hero finds this highly unhelpful and is forced to run the code a second time, this time stepping through each line of code after the first breakpoint. Lo and behold, our hero finds the offending null reference. It is a variable that was returned from a method as null and was subsequently used by the remaining code without a proper null check. Thus, the NullReferenceException.
To learn the true cause of the null reference, our hero must dig deeper.
Our hero runs the app a third time, eventually stepping into the null returning method to find yet another exception swallowing block catching yet another NullReferenceException. This exception swallowing block was slightly different, though: it returns null in the case of an exception. Our hero is looking at a method which catches a NullReferenceException and handles it by itself returning a null reference.
Our hero comes to the cold realization that this is going to be a very long quest.
Undaunted, our hero forges on. Method after method, our hero descends the call stack. Battling missing null checks and swallowed NullReferenceExceptions at every level, our hero is determined to find the source of the problem. Deep down, in the bowels of the application code, our hero scrolls and steps and scrolls and steps until… Finally, our hero sees something different! Through blurry eyes, our hero squints at the code on the screen.
What is this? our hero thinks. It looks like an HttpClient…
Wait… It’s receiving a 500 error from the server.
At that moment, after tracing the nulls through a labyrinth of unnecessarily complex code with extremely poor exception handling, our hero is staring at the abyss as the sad understanding washes over.
The original error isn’t coming from the mobile code at all. It’s coming from the API code on the server.
Our hero stares incredulously at the code for a moment longer. Then, unwilling to be bested by poorly written code, our hero takes a deep breath and leaps into the abyss.
It’s dark here. And cold. So cold.
Our hero takes a moment to survey the surroundings. This is the server, all right.
Our hero gets right to work looking for the logs. A few clicks later, the logs are on the screen and our hero begins inspecting them.
There’s the authentication request! And there’s the error! Hooray! The API issue is going to be easy to diagnose!
StringFormatException, our hero thinks. I can look at the API endpoint code to find where a string is being formatted incorrectly and claim victory!
Our hero does just that. There’s the API endpoint code. A quick scan of the method and… There! That’s where the code is incorrectly formatting the string! The IDE has even helpfully underlined the offending code with a green squiggly. Problem solved!
Wait a minute…
What is that code trying to format?
It looks like it’s trying to format an exception message…
Our hero quickly looks the code over. Sure enough, the code in question is the exception handling block. The exception handling block is catching an exception, then failing to format a message with it for the logs, and, therefore, causing the API to return a 500 error.
Our hero’s heart sinks.
More poorly written exception handling code. More line by line debugging and traversing the call stack.
Our hero rallies.
It can’t end like this. I must find the source of this exception!
Our hero sets up the server code for local debugging, sets some breakpoints, takes a deep breath, and sends a request.
Line by line, method by method, our hero continues the quest. This time, battling StringFormatExceptions.
There it is. The cause of all of this. After seeing several exception handlers catch exceptions and then throw their own due to poorly written string formatting code, our hero is staring at the real root cause. A null reference that is a true null reference. Not one caused by poorly written exception code. But an actual null reference caused by a member being used without being properly initialized.
Our hero looks at the commit history. Sure enough, there is code that was inadvertently removed by a team member.
Our hero re-implements the code and runs the server code locally again. This time the server code handles the request flawlessly. Our hero gets excited. The mobile application moves past the login screen.
This is real progress!
The mobile application continues to show the loading spinner. Our hero waits patiently. Seconds tick by. Our hero gets anxious.
Then, a flash. The mobile application changes screens. Success! The user can log into the app now!
Our hero sits back, exhausted, and looks at the page displayed by the mobile app.
Wait. Something is wrong. That’s not the correct page.
Our hero quickly brings up the code that navigates the user after logging in, sets some breakpoints, and runs the application again.
First breakpoint hits. Excellent. Next breakpoint… Wait. Exception handler breakpoint?
Our hero stares at the screen in complete disbelief.