The worst bug I’ve seen (in a while)

When interviewing developers, one of my favorite questions to ask is:

What is the worst bug you’ve ever encountered?

As an interviewer, I’m looking to get a glimpse into what types of issues this person finds challenging and what they remember as a lesson to take with them into their future. It‘s a good question because it forces the interviewee to tell a story (so you can evaluate their communication skills) and you get a good glimpse into what their debugging process looks like. Developers usually spend the majority of their time problem solving, so it’s important to see that the candidate has a good process for troubleshooting.

If you pose this question to an experienced front-end developer, you’ll often get a story about some nasty Internet Explorer or Safari bug, or some quirky Javascript thing. If you’re interviewing back-end developers, you’ll often get a story about file handles, some bizarre third-party dependency bug or a race condition that was difficult to track down. After hearing so many different stories, I had forgotten all of my own nasty bug stories. Then, last week, my team ran into an issue that was incredibly hard to reproduce and all of a sudden I found myself chasing after “the worst bug I’ve ever seen”.

Day 1: And so it begins

Let’s start with the bug report that was posted to our Slack #onfire channel from our Director of Operations, Georgina, who was working late into the night:

Before I go much further, I should quickly explain what my team works on. I work at Cumul8 where we build a platform that lets users connect, analyze, visualize and monitor their data in real-time. A core part of this platform, is a web application which allows users to build custom dashboards with configurable drag & drop tiles.

The Cumul8 Dashboard Module

Building dashboards is a core feature of our service, so this bug was correctly escalated to an “On Fire” status which we use as an “all hands on deck” broadcast where everyone jumps in to try and lend a helping hand. The good news is that the issue was reported to be isolated to Chrome which meant that we had a workaround for a critical demo in the morning. The bad news was that nobody on our dev team could reproduce it. The entire dev team was mystified.

In the morning I started investigating. So, what do we know so far?

  • The issue only happens on Google Chrome (Firefox and Safari was reported to be working fine)
  • The user who reported the issue was on a Mac OS laptop at the time
  • The behavior was that the browser completely freezes when a drag event occurs

Naturally, the first step for any developer is to try and reproduce the issue. I fired up a local development build of our React-based web application and started dragging tiles around. A browser freeze is almost always a sign of an infinite loop or some other CPU-pegging Javascript process, so I monitored the developer tools to see if I could spot anything unusual (an error, a warning, or an indication of heavy CPU or memory usage).

I was able to drag tiles on my local environment without any issues

Just like the rest of my team, I had no luck reproducing the issue on the same production server where the bug was reported. The Javascript console was clean, and the CPU usage on the Chrome tab was under control. Not a great start.

At this point, our lead QA Engineer notifies me that he was actually able to reproduce this issue the day before, but since the issue only occurred on Chrome on his Linux test box, and we have no known Linux Chrome users at the moment, it was marked as non-critical and the build was pushed out to production. Alright— so we now have at least 2 people who have reproduced the problem and we know it’s probably not an OS-specific issue. It can be reproduced on both Linux and Mac.

Before I started bugging the other engineers, I figured I’d run through a few more local tests to see if I can get it to occur on my machine. I tried some common troubleshooting steps:

  • I cleared all browser cache and did a full browser restart
  • I tried various local browsers Chrome, Chrome Canary, Firefox, Firefox Nightly, and Safari
  • I tried running the application with and without the dev tools open
  • I tried different application configurations (such as different types of dashboard tiles and orientations)
  • I tried logging into our application with different users of various different permissions levels
  • I tried building a local dev build as well as a local production build of the application
  • I opened a TeamViewer session to my Windows box at home to see if I could reproduce it there

But alas — nothing I did showed any signs of the issue.

At this point, I was beginning to get frustrated and just as I was about to start looking through the diffs of the most recent commits in Git — that’s when another engineer on the team notified everyone that he had a reproducible case of the issue on his Linux desktop using Chrome. We now have 3 users reporting the issue! I walked over to his desk to take a look.

“There it was. The bug!“ (It’s a bit hard to tell, but I’m trying to drag and bottom tile, but it is stuck and the browser is completely frozen after the drag start event was initialized)

There it was. The bug! At last I could see with my own eyes that it was real. The CPU spikes to 100% and the browser tab is completely stuck. The 101.8% CPU usage in the Chrome task manager confirms the problem.

The highlighted (blue) Cumul8 tab showing 101.8% CPU usage during the frozen state

Prior experience tells me that this type of behavior is almost certainly an infinite loop of some kind. But something doesn’t add up.

  • Why is this only happening for 2 of my Linux developers and 1 Mac laptop user? Why isn’t this happening for any of my other developers, or my own Mac and Windows environments?
  • And why is the issue not happening for the first (top-left-most) tile? That is truly strange!

At least, now that I’ve gotten my hands on a computer where the issue was reproducible I started going through a barrage of troubleshooting steps:

  • Is there something special about this particular dashboard or these particular tiles? No — the issue occurs for all dashboards and a variety of different users I tried.
  • Is there something special about this browser? No — it’s an official build of Chrome. Upgrading to the latest version doesn’t help. Running in Incognito mode with all extensions disabled doesn’t help. Clearing cache doesn’t help.
  • Is there something special about this computer in general? Maybe? The issue occurs on Chrome, but not on Firefox. And all the other reported cases have been on Chrome as well. Could it be some kind of Chrome bug?

I started downloading a few different versions of Chrome to test against. The last few stable versions all reproduce the bug in exactly the same way. Ok, so probably not a Chrome bug then.

At this point I knew it was time to look at some code. But how do you figure out where an infinite loop occurs if the entire browser crashes, along with the dev tools? I tried a few things:

  • Manually inserting some breakpoints in logical spots like onDrag handlers and mouse events in the relevant code where the issue might be happening. Nope— nothing was obviously out of place, and no signs of rogue loops.
  • Recording a performance profile in Chrome dev tools. Nope — the profiler would die as soon as the bug freezes the app.
  • Recording a memory allocation profile in Chrome dev tools. Nope— the button to start a profile cannot be pressed while the app is frozen.

Well that sucks. Without knowing even approximately which files or components are causing the issue, there was no way I could guess which part of our huge 10mb minified codebase could be causing the issue. I needed some way to reproduce the issue in a dev environment, where I could insert some console.log or debugger statements.

So I fired up a local dev build of the application on the machine where the issue was reproducible. But to my surprise —as soon as I tried dragging tiles in the dev build — the issue disappeared.

“…as soon as I tried dragging tiles in the dev build — the issue disappeared.”

It seems that the problem only manifests itself on the production server. Oh dear... Does this mean it’s time to SSH into the production server and inject debug code into a live running application? You know — the thing you’re never ever ever supposed to do? The thing people get fired for doing?

Before resorting to such drastic measures, I decided it was worth trying to simulate a production build locally just to see if there was a chance this was some kind of strange minification issue with the production build step (which mangles some of our code in an effort to reduce the bundle size). I’ve seen issues in the past where UglifyJS would corrupt certain statements and given that the issue only occurred on the production server — it was worth a shot. But unfortunately a fresh production build on the local machine did not reproduce the issue.

So let me get this right… I’m running the same code, on the same machine, connected to the same database, logged in with the same user, using the same browser — but the bug is gone, but if I switch over to a different tab where I have the official production site —everything explodes…

Just as my sanity was starting to fade, the original user who reported the bug on the Mac laptop using Chrome let us know that the bug suddenly started occurring on his Safari browser, even though earlier in the day Safari was working just fine for him…

“… the list of mysteries was only getting longer”

At this point, a full working day had passed and I had gotten no where, with few ideas of what to try next and the list of mysteries was only getting longer. At times like this, a good night's sleep is often the best next step. As I packed my bags to head home, I asked our Dev Ops team to deploy a new release to production and I crossed my fingers that maybe a fresh build would magically make this problem go away. Perhaps I'd come back in tomorrow morning to find out it was just some weird build corruption issue that had manifested itself in an unusual way. I left the office knowing it was a shot in the dark. After all, if it really was a build issue, wouldn’t the bug be reproducible for all users, on all machines?

Day 2: Now it’s personal!

The next day, with a clear head, a fresh cup of coffee and the report from QA that the bug was still occurring on the fresh build -- I started going through the Git commits of past releases looking for anything of value.

“I started going through the Git commits of past releases looking for anything of value.”

There was a handful of commits all related to dashboards and tiles over the last few days (it’s a very actively developed feature for our team this month). The commits were from various developers, and within there was an upgrade to a custom fork of the react-grid-layout library that is the main component that manages the drag and drop behavior of our tiles. I figured this was probably a good place to start my investigation. If it was going to be a code issue — I would prefer to blame some other developer, rather than a member of my own team :)

While the git clone react-grid-layout command was doing its thing, I fired up StackOverflow to see if anyone had any clever tips for debugging infinite loops in Javascript. The highest rated answer was suggesting to use the Pause button in the “Sources” tab of Chrome dev tools. From what I remembered the day before, the entire dev tools panel was frozen along with the browser tab thread, but I figured it was worth a shot trying it again. I found a machine which had the bug, and tried pressing the Pause button. Finally — a breakthrough!

“The dev tools paused the infinite loop and showed me exactly what code was currently being executed…”

The dev tools paused the infinite loop and showed me exactly what code was currently being executed — a for loop inside our grid library and it was about 6 million iterations in by the time it paused. It’s 100% an infinite loop in the code.

Finally, some progress. We now have a confirmation that the issue is definitely a rogue loop. We have a confirmation that issue is caused by our code and not some OS or browser level bug. And, more importantly, we know exactly what line of code we’re stuck on.

So here I am, looking at minified production code where the variable names are all mangled and I notice something that catches my eye. Variable o which represents n.margin is defined as [16, 16] but variable n is undefined . How can n.margin have a value but n be undefined? Could this be some kind of strange scoping issue where n gets reassigned/corrupted mid-way through the loop?

Fast forward a few hours — it turned out to be a complete red herring. When pausing the debugger in the dev tools in the middle of an infinite loop, apparently variables will be shown as undefined in the dev tools. It doesn’t actually mean they’re undefined, just that the dev tools can’t show you their value. But I did waste a few hours trying to go down this path.

Looking at minified code started to drive me nuts, so I pulled up this function in our source code to try and understand what it’s doing.

/**
* Translate x and y coordinates.
* @param {Number} top Top position (relative to parent) in px
* @param {Number} left Left position (relative to parent) in px
* @return {Object} x and y in px.
*/
calcXY (top, left) {
const {
colWidth,
containerWidth,
margin,
rowHeight,
w
} = this.props;
let x = left > 0 ? left : 0;
if (x + w > containerWidth) {
x = containerWidth - w;
}
while (x % (colWidth + margin[0]) !== 0) {
x--;
}
let y = top > 0 ? top : 0;
while (y % (rowHeight + margin[1]) !== 0) {
y--;
}
return { x, y };
}

We see two while loops, so we’re definitely in the right place for this type of bug to be occurring. Side note: If you compare this source code to the minified code above, you’ll actually see that the while loops get converted to for loops by either babel or UglifyJs .

It’s a bit hard to understand what this function is doing, but it’s effectively going through each pixel of our containerWidth (aka. the width of the available drag area), looking for a place where the dragged tile can be placed. It continues searching until it runs out of pixels to check using the % operator as the loop’s break point. Immediately there are a few issues with this code.

  • It’s incredibly inefficient, as it does a scan of every single pixel of our drag canvas, instead of using some other (more efficient) algorithm to find the closest available spot based on the tile’s current position.
  • It assumes that top, left, containerWidth , colWidth and rowHeight are always defined. If any of those values become null or undefined we will end up with an infinite loop. Luckily, this function was strongly typed via Flow in our source code, but still — it’s a pretty dangerous assumption to make. You need to be super careful with while loops.
  • It’s a section of code that’s pretty difficult to understand, and it doesn’t have any comments explaining what the loops are doing. It took me at least 15 minutes of playing with the code before I finally got the full grasp of what it was trying to do and why it was done this way.

There are some obvious lessons to take from here already, but one thing is still not clear. Why do we get into an infinite loop, only on some computers, and only on our production server?

There are really only two ways that this code can get into an infinite loop:

  1. One of the relevant DOM measurements are undefined or null, or
  2. One of the relevant DOM measurements are not evenly divisible by the width of our tile

I put in some console.log statements to take a look at the widths of my elements and I manually stepped through the code, one iteration at a time. About 20 to 30 steps in, all widths were still correctly defined. So it doesn’t seem like it’s #1.

“I put some console.log statements to print out the widths of my elements…”

I then started resizing the grid container to all sorts of different dimensions to see if maybe the issue was caused by a very specific browser width. That would certainly explain why only some users were reporting the issue. If a user had their browser resized to just the right (aka. wrong) size so that it didn’t perfectly divide the width of the tile — we’d get an infinite loop. But nothing I tried caused any division issues.

Then I figured, maybe it might be easier to reproduce if I created a dashboard with a large number of tiles visible on the screen. Maybe one of them would result in a division problem. So I zoomed out via the browser’s zoom feature (Ctrl+-) to squeeze more tiles on the screen, and all of a sudden something unexpected happened…

“…I zoomed out via the browser’s zoom feature and all of a sudden something unexpected happened…”

My top value (which is the offset in pixels from the top of the drag container to the top of the tile that’s being dragged) was no longer giving me integers for the pixel value. I was getting fractional pixels. I guess when you zoom, the browser does some funky scaling stuff and 1px is not 1px anymore. It’s now 0.9965343475342px! And now that my top variable was afloat value, the % operator would never evenly divide and —tada — infinite loop!

The fix and the lessons

The mysteries finally started to unravel.

Why was the issue occurring only for some users, on only some of their browsers?

Because only a few users used the zoom feature on the site, and each browser has its own zoom setting. If the user zoomed in on Chrome, but not on Firefox — he would only see the issue on Chrome. It also explains why a user said it was working fine in Safari, and then later in the day it stopped working. He zoomed out!

Why was the issue only happening on the production server?

It actually wasn’t. It just so happened that the zoom value is saved per URL and those users happened to have zoomed out only on the production URL. When they visited a different URL, or tried to run a production build locally (on localhost) — the zoom would reset back to default and the bug wouldn’t show up.

Why was only the top-left-most tile not affected by the bug?

Because it was the only tile whose pixel position was an clean[0,0] during the zoom state, which meant that the calculation would proceed without any problems.

Everything else all of a sudden made sense, and there was no doubt in my mind that this was the problem.

I immediately issued a patch fix which was to wrap the pixel calculations in a Math.round() to ensure we always get our pixels as integers. A more robust solution to consider in the future (as stated earlier) would be to find a more efficient algorithm for tile placement that doesn’t involve while loops at all. We’ll be investigating something like this in the future.

For me, there were 3 important lessons to take away from this experience:

  1. For debugging issues with a slew of mysteries, I think there’s value in creating a troubleshooting checklist. Starting with obvious things like reproducing locally, trying different browsers, different OS’s, different screen resolutions, different zoom values, different users, etc… I don’t think every bug needs to go through this lengthy step-by-step routine, but in the event of an issue being difficult to reproduce — a checklist would be valuable to ensure you go through all possible combinations systematically and no conditions are forgotten. This checklist should also be frequently updated by all developers as new troubleshooting methods are used to diagnose real problems. The main goal of such a checklist would be to reduce the troubleshooting time. It took me nearly 2 working days from the time the issue was found, until it was patched and this was largely because I wasn’t always sure what to try next. I relied on my experience to randomly select the next troubleshooting step, rather than something that was going to get me more information or clear up some mystery.
  2. When trying to debug a complex issue — try to avoid getting frustrated and maintain a scientific approach. A big reason why it took me so long to get to the root of the problem was because I went on many tangents when new information was discovered. When our QA said the issue was specific to Linux, I dropped what I was doing (looking at Git diff) and spent time looking for a Linux environment to test with. When it seemed like the issue was only occurring on production, I spent a lot of time reviewing the production build systems. All of those efforts were ultimately wasteful, and if I had approached the problem more systematically and kept reviewing the Git diffs, I would have probably found the issue much quicker. My approach wasn’t awful, but I let emotions and people’s excitement over new information disrupt my next steps and that lead me on wild goose chases.
  3. All while loops should be reviewed with a much higher degree of scrutiny. The code that caused this particular bug went through a peer code review, unit tests, automated E2E tests and manual QA. Our manual QA actually did catch the issue, but it wasn’t categorized correctly. We didn’t realize how widespread it was and we shipped a broken version. The lack of comments and lack of detailed review on that section of code was also sloppy. An investigation into all while loop usage might be necessary, and perhaps even enabling an eslint rule to disallow the use of while. There are very few legitimate reasons in Javascript to use while loops, but the dangers of using them incorrectly are quite high as they have a real chance of bringing down the entire application.

My favorite part about finally putting this bug behind me is now — if I ever go for an interview — and somebody asks me “What is the worst bug you’ve ever encountered?” — I’ll have a great story to share.

If you found this story interesting, and you’d like to come work with us, visit http://cumul8.com/careers/ where we keep an active list of all of our open positions.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store