We Almost Lost Our Production Database

Mohammad Faisal - Oct 18 '23 - - Dev Community

To read more articles like this, visit my blog

Last Friday, I received a call from my friend (who is also my business partner). Some customers called him, saying they could not see any data on their dashboards.

“Are you sure that they can log in?” I asked my friend.

“Yes, they can but don’t see anything after logging in,” he replied.

He is not technical, and there was no developer to call as it’s a holiday in our country. He asked me to investigate the issue.

That’s how it started…

Possible Reasons

I have faced enough similar situations in my previous jobs. The probable causes that immediately came to my mind are.

  • The user is doing something wrong. (very improbable)

  • Getting the user details API is not working for some reason

  • Credential’s got mixed up somehow(maybe frontend caching or something)

  • Worst case scenario- Data is lost

Now, I must go through each possibility and find out what’s wrong.

Discarding the first possibility

Let me give a semi-technical overview of how the thing works.

We use Firebase for authentication. After successful login, firebase gives back a token, and the subsequent requests are made with the token to our Node.js backend that provides a JWT.

So if a user can log in, it means the credentials are correct because Firebase can’t mess up, right? 😛

So, something must be wrong on our end.

The worst nightmare!

In this scenario, the first thing to discover is the user’s credentials to check what’s wrong. But asking a customer his credentials seemed wrong to me. It can convey a wrong message to them.

So I created a new user and tried to log in. It was successful, and to my surprise, I could see the details on the dashboard!

It was confirmed that the authentication on Firebase was working at this stage, and our API is also working correctly! So that leaves with only one option left.

The data is lost!

Well, well, well. This is the worst thing that can happen to a company. Also, the customer we are talking about is a paid customer, and there was some sensitive information stored on that user's profile.

So it was essential to find out what was wrong.

It was not the end!

After discovering that the API was working correctly, I called my friend, collected more customer info, and checked their profile.

As it turned out, none of our previous customers from the last six months have any data on their dashboards. All the credentials were proper, but unfortunately, none could see any data.

I was out of options. No possible answer or reason came to my mind.

So we called a meeting

We called an emergency meeting with the stakeholders. As it was the weekend, getting the developers to the meeting was impossible.

“Is there any new feature that enables the user to delete everything?” I asked.

It was a stupid question, but I was not involved with the development process for the last few months, so I was not updated about the features.

As it turned out, a partial delete of a user's information was required. The backend API was ready, but the changes were not made to the frontend.

So, there is no way that the customer could delete some data from the database himself!

I thought,” Okay. Is it possible that somehow that API got called for all the users?”. After checking the backend code, I found no change that deletes the user’s vital information, so we had to discard this option.

New Clue!

One of our friends reviewed the application's deployment process for the last few months. Our company is transitioning, and active development is not happening right now.

We designed the usual process to deploy the backend using a GitHub pipeline that deploys the application into the AWS ECS cluster. It was an automatic process.

The friend who reviewed the deployment process told us the previous deployment process was changed. Now, we are deploying to an EC2 instance.

Well, Well, Well. I recalled a previous incident where the environment file got swapped, and it caused the API to hit the wrong database.

Gotcha!

As we ran out of options, we finally decided to deploy the application manually again. I was not sure that this is why, but after the new deployment with the correct environment file, everything worked magically!

Moment of self-reflection

So, was it all? We just wasted two days to discover that the re-deployment was the solution. What happened?

As it turned out, my friend made a clone of the database by hand. He was not an expert in data management, so he thought, let’s learn with the newly cloned database.

For some reason, I am still not sure why. Some of the tables were not cloned. And the API was pointing to that wrong database the whole time. We also were searching for data in the wrong database.

What did we learn?

This incident once again taught us what disaster could a bad process bring. As the company is going through a tough time and no active development is happening, my friend didn’t feel the necessity to inform us about the changes he made to the database and the deployment process.

As a result, we lost some customer trust and wasted 2 days for nothing.

But even a bad incident has something good in it, right?

Maybe next time something like this happens and I will suggest to re-deploy without looking at anything :P Maybe I will encourage everyone and myself to follow the good practices even more.

Whatever it is. I hope something like this will never happen again.

Thanks for reading. Have a great day!

Have something to say? Get in touch with me via LinkedIn or Personal Website

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .