The worst bug I’ve seen (in a while)

Day 1: And so it begins

Let’s start with the bug report that was posted to our Slack #onfire channel from our Director of Operations, Georgina, who was working late into the night:

The Cumul8 Dashboard Module
  • The issue only happens on Google Chrome (Firefox and Safari was reported to be working fine)
  • The user who reported the issue was on a Mac OS laptop at the time
  • The behavior was that the browser completely freezes when a drag event occurs
I was able to drag tiles on my local environment without any issues
  • I cleared all browser cache and did a full browser restart
  • I tried various local browsers Chrome, Chrome Canary, Firefox, Firefox Nightly, and Safari
  • I tried running the application with and without the dev tools open
  • I tried different application configurations (such as different types of dashboard tiles and orientations)
  • I tried logging into our application with different users of various different permissions levels
  • I tried building a local dev build as well as a local production build of the application
  • I opened a TeamViewer session to my Windows box at home to see if I could reproduce it there
“There it was. The bug!“ (It’s a bit hard to tell, but I’m trying to drag and bottom tile, but it is stuck and the browser is completely frozen after the drag start event was initialized)
The highlighted (blue) Cumul8 tab showing 101.8% CPU usage during the frozen state
  • Why is this only happening for 2 of my Linux developers and 1 Mac laptop user? Why isn’t this happening for any of my other developers, or my own Mac and Windows environments?
  • And why is the issue not happening for the first (top-left-most) tile? That is truly strange!
  • Is there something special about this particular dashboard or these particular tiles? No — the issue occurs for all dashboards and a variety of different users I tried.
  • Is there something special about this browser? No — it’s an official build of Chrome. Upgrading to the latest version doesn’t help. Running in Incognito mode with all extensions disabled doesn’t help. Clearing cache doesn’t help.
  • Is there something special about this computer in general? Maybe? The issue occurs on Chrome, but not on Firefox. And all the other reported cases have been on Chrome as well. Could it be some kind of Chrome bug?
  • Manually inserting some breakpoints in logical spots like onDrag handlers and mouse events in the relevant code where the issue might be happening. Nope— nothing was obviously out of place, and no signs of rogue loops.
  • Recording a performance profile in Chrome dev tools. Nope — the profiler would die as soon as the bug freezes the app.
  • Recording a memory allocation profile in Chrome dev tools. Nope— the button to start a profile cannot be pressed while the app is frozen.
“…as soon as I tried dragging tiles in the dev build — the issue disappeared.”
“… the list of mysteries was only getting longer”

Day 2: Now it’s personal!

The next day, with a clear head, a fresh cup of coffee and the report from QA that the bug was still occurring on the fresh build -- I started going through the Git commits of past releases looking for anything of value.

“I started going through the Git commits of past releases looking for anything of value.”
“The dev tools paused the infinite loop and showed me exactly what code was currently being executed…”
* Translate x and y coordinates.
* @param {Number} top Top position (relative to parent) in px
* @param {Number} left Left position (relative to parent) in px
* @return {Object} x and y in px.
calcXY (top, left) {
const {
} = this.props;
let x = left > 0 ? left : 0;
if (x + w > containerWidth) {
x = containerWidth - w;
while (x % (colWidth + margin[0]) !== 0) {
let y = top > 0 ? top : 0;
while (y % (rowHeight + margin[1]) !== 0) {
return { x, y };
  • It’s incredibly inefficient, as it does a scan of every single pixel of our drag canvas, instead of using some other (more efficient) algorithm to find the closest available spot based on the tile’s current position.
  • It assumes that top, left, containerWidth , colWidth and rowHeight are always defined. If any of those values become null or undefined we will end up with an infinite loop. Luckily, this function was strongly typed via Flow in our source code, but still — it’s a pretty dangerous assumption to make. You need to be super careful with while loops.
  • It’s a section of code that’s pretty difficult to understand, and it doesn’t have any comments explaining what the loops are doing. It took me at least 15 minutes of playing with the code before I finally got the full grasp of what it was trying to do and why it was done this way.
  1. One of the relevant DOM measurements are undefined or null, or
  2. One of the relevant DOM measurements are not evenly divisible by the width of our tile
“I put some console.log statements to print out the widths of my elements…”
“…I zoomed out via the browser’s zoom feature and all of a sudden something unexpected happened…”

The fix and the lessons

The mysteries finally started to unravel.

  1. For debugging issues with a slew of mysteries, I think there’s value in creating a troubleshooting checklist. Starting with obvious things like reproducing locally, trying different browsers, different OS’s, different screen resolutions, different zoom values, different users, etc… I don’t think every bug needs to go through this lengthy step-by-step routine, but in the event of an issue being difficult to reproduce — a checklist would be valuable to ensure you go through all possible combinations systematically and no conditions are forgotten. This checklist should also be frequently updated by all developers as new troubleshooting methods are used to diagnose real problems. The main goal of such a checklist would be to reduce the troubleshooting time. It took me nearly 2 working days from the time the issue was found, until it was patched and this was largely because I wasn’t always sure what to try next. I relied on my experience to randomly select the next troubleshooting step, rather than something that was going to get me more information or clear up some mystery.
  2. When trying to debug a complex issue — try to avoid getting frustrated and maintain a scientific approach. A big reason why it took me so long to get to the root of the problem was because I went on many tangents when new information was discovered. When our QA said the issue was specific to Linux, I dropped what I was doing (looking at Git diff) and spent time looking for a Linux environment to test with. When it seemed like the issue was only occurring on production, I spent a lot of time reviewing the production build systems. All of those efforts were ultimately wasteful, and if I had approached the problem more systematically and kept reviewing the Git diffs, I would have probably found the issue much quicker. My approach wasn’t awful, but I let emotions and people’s excitement over new information disrupt my next steps and that lead me on wild goose chases.
  3. All while loops should be reviewed with a much higher degree of scrutiny. The code that caused this particular bug went through a peer code review, unit tests, automated E2E tests and manual QA. Our manual QA actually did catch the issue, but it wasn’t categorized correctly. We didn’t realize how widespread it was and we shipped a broken version. The lack of comments and lack of detailed review on that section of code was also sloppy. An investigation into all while loop usage might be necessary, and perhaps even enabling an eslint rule to disallow the use of while. There are very few legitimate reasons in Javascript to use while loops, but the dangers of using them incorrectly are quite high as they have a real chance of bringing down the entire application.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ev Haus

Ev Haus

Head of Technology at, Founder/CEO at, Owner of Magna eSports, Musician, Libertarian, Human