Duplicate Digest Email Incident Retro From January

Molly Struve (she/her) - Mar 16 '21 - - Dev Community

Sorry all that this incident retro is a couple of months late. We did the retro internally immediately after the incident occurred and then I put it on my TODO list to make it public on DEV and that's where it stayed for far too long. 😬

Background Context

Over the past few months, we have been working hard to make the Forem setup more configurable for our creators. In an effort to do this we recently updated how we create a Forem's community name. This was done by merging the following PR that appended the collective_noun, if it was not disabled for the particular Forem, to the SiteConfig.community_name through a DataUpdateScript:

Adds collective_noun Fields Back to SiteConfig Model #12055

What type of PR is this? (check all applicable)

  • [ ] Refactor
  • [ ] Feature
  • [x] Bug Fix
  • [ ] Optimization
  • [ ] Documentation Update

Description

This PR adds the collective_noun and collective_noun_disabled columns back to the SiteConfig model after running into a data_update script failure in another PR. This is necessary so that when running the data_update script responsible for updating the community_name, we have access to the collective_noun and collective_noun_disabled methods. Additionally, this PR adds a data_update script that conditionally updates SiteConfig.community_name if SiteConfig.collective_noun_disabled returns false. If, however, the forem has collective_noun_disabled set to true, then the community_name will remain as-is and untouched.

Related Tickets & Documents

Relates to PRs #11846 , PR #11973 , and PR #12054

QA Instructions, Screenshots, Recordings

To QA these changes, first ensure that all checks pass and the Travis build is green, then:

  • Run the data_update script manually and ensure that your Community's name updates accordingly

UI accessibility concerns?

N/A

Added tests?

  • [x] Yes
  • [ ] No, and this is why: please replace this line with details on why tests have not been included
  • [ ] I need help with writing tests

Added to documentation?

[optional] Are there any post-deployment tasks we need to perform?

The data_update script included in this PR needs to successfully run!

[optional] What gif best describes this PR or how it makes you feel?

The Office: Days of Nonsense

Data Update Script:

module DataUpdateScripts
  class AppendCollectiveNounToCommunityName
    def run
      return if SiteConfig.collective_noun_disabled || SiteConfig.collective_noun.blank?

      SiteConfig.community_name = "#{SiteConfig.community_name} #{SiteConfig.collective_noun}"
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

The PR nicely addresses and updates all places in the code where we were using community_name and therefore went out without any problems.

However, 4 days later on Wednesday, December, 30th, we began to get reports of folks receiving duplicate digest emails.

I quickly jumped in and determined the problem to be an Emails::EnqueueDigestWorker that was continuously being restarted by deploys and sending the duplicate digest emails.

Bug Trigger*

  • NOTE: I did not label this section "Cause" because I believe the real, root cause of this problem came much earlier in the year. Scroll down for more details.

The Emails::EnqueueDigestWorker is a worker that we have "turned off" for DEV because it takes too long to complete given the high number of users we have. We know it will keep restarting on deploys so we choose to use a rake task to send digest emails. To keep the worker from running we have a guard clause at the beginning of the worker that tells it to exit early if the environment is DEV. The guard clause at the time of the incident looked like this:

return if SiteConfig.community_name == "DEV"
Enter fullscreen mode Exit fullscreen mode

The problem was, when the above PR was deployed we changed community_name from "DEV" to "DEV Community" causing this guard clause to no longer work which allowed the worker to start running.

Mitigation

The first thing I did to mitigate the number of those affected by this problem was to clear all of the existing digest worker jobs out of Sidekiq. Once that was done, I pushed out a quick fix to update the hardcoded string to the current community name by adding "Community" to it.

Bug Fix:Update dev.to check to use DEV Community #12082

Recently we changed our site name on dev.to from DEV to DEV Community. This is used in our code in a couple of places and needs to be updated. From a prod console

irb(main):002:0> SiteConfig.community_name == "DEV Community"
=> true

Without the guard clauses working properly we enqueued our Email Digest worker which takes hours to complete for DEV bc of all the users. Since we restart Sidekiq on every deployment, this worker was continuously restarting which is why the duplicate emails were getting sent.

While this fixed the immediate issue, the hardcoded string is what got us into this jam in the first place. To prevent future problems I quickly followed up with a second, more resilient PR, that used a more robust method to check for the DEV environment.

Refactor:Use dev_to? to Check for DEV #12083

What type of PR is this? (check all applicable)

  • [x] Refactor

Description

Rather than checking the community name which we just found out can change easily, lets check the domain like we do in other places since that is far less likely to change.

Related Tickets & Documents

https://github.com/forem/forem/pull/12082

alt_text

Root Cause and Learnings

Even though the the community name PR was the trigger for this issue, it was far from the root cause. This issue was really caused by the poor decision by myself to hard code this string in the first place.

Bug Fix:Run Individual Digest Workers inline for DEV #10070

What type of PR is this? (check all applicable)

  • [x] Bug Fix

Description

Followup to https://github.com/forem/forem/pull/10065, we need to also ensure the individual workers run inline and are not piling up in Sidekiq since running them in parallel is what is stressing the database.

Added tests?

  • [x] yes

alt_text

Hardcoded strings, especially those for variables like this that can change, should be avoided at all costs when writing code. At the very least you should be writing a method or using a CONSTANT to represent the string you want to check. This allows you to easily Find/Replace that method or CONSTANT in the future so it doesn't get lost in the shuffle as it did here.

The second problem was that I deployed that fix without a PR review. I announced the problem in Slack and linked the PR, but based on the Slack message our database was in a dire state which is why I chose to expedite the fix. In the future, trying to get at least 1 PR review should always be the goal even if it means waiting an extra minute or two. Even if you don't immediately take the advice of the PR review, at least you can go back and update things later.

The first fix that I pushed out in order to quickly update the digest worker worked but was un-ideal. This was pointed on in the PR review. I was still able to quickly push the hotfix but then immediately went back and wrote more thoughtful and robust code after. That flow is how the initial bug should have been handled. Our recently renewed commitment to sticking more to our processes now is what led to us fixing this problem correctly the second time around, while the first time we left the door open for future problems.

Finally, the more we can move away from this "DEV is a special snowflake" concept the better. If every environment is treated the same then you don't have to worry about edge cases popping up unexpectedly in production for specific environments. The final goal is to remove that guard clause altogether so every Forem sends its email digests the same way. Once again, this eliminates another special case solution that lowers the risk of future surprises.

Handling Hotfixes

Always try to get a PR review even for hotfixes! Most of us are not running life-saving software. If your site remains down for an extra min or two because you did the responsible thing and got a code review, it's not the end of the world. Having a good code review will decrease your chances of shipping unreliable code that could cause future issues.

If a fix MUST go out immediately and you cannot get a review beforehand(maybe it's super late and no one is online), then get a follow-up review. Ask someone to review the code after the fact to make sure it is sound and is not so hacky as to lead to future problems.

TL;DR

Don't hard code strings and get a PR review! And if all else fails, learn from your mistakes and strive not to make them again. Happy coding!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .