Our experience with upgrading ElasticSearch

Eva Marija Banaj Gađa - Jul 6 '21 - - Dev Community

Why upgrading ElasticSearch was not an easy task

ElasticSearch, they say, packs a “ton of goodness into each release“, and if you skip a few tons of goodness, it can lead to goodness overflow that we experienced while upgrading it.

One might say we had a peculiar idea of good usage of ElasticSearch mapping types, so we just used them for everything — keys in arrays, table names, search etc.

That was the primary reason why the upgrade waited so long. I mean, we were stuck on version 5.3.2 aiming to jump to 7.10.1. The code depended heavily on the mapping types.

Another problem entirely was the complete removal of custom plugins. One feature we had, had to be completely shut down because it needed a custom elastic plugin to perform. Luckily, it was never enabled on the production so it was no biggie, right?

No more mapping types, what now?

To give you a better idea of what I’m talking about, here is a small sample of what our mappings looked like before upgrading ElasticSearch:

mapping_type_1:
    active: {type: byte, index: 'true'}
    additional: {type: integer, index: 'true'}
    . . .
mapping_type_2:
    active: {type: byte, index: 'true'}
    additional: {type: integer, index: 'true'}
    . . .
. . .
mapping_type_36:
    active: {type: byte, index: 'true'}
    additional: {type: integer, index: 'true'}
    . . .
Enter fullscreen mode Exit fullscreen mode

We had four indices in our 5.3.2 cluster, three of those posed no problem to upgrade. We even managed to completely remove one index because there were around 300 documents indexed in it, so there was no reason why that data could not be retrieved directly from the database.

That one index that remained, had 36 mapping types that were same-same but different. At this point, we did what anyone would have done — check the ElasticSearch official documentation for the recommended procedure. And now we had two options:

  • creating 36 different indices, one for each mapping type
  • combining all the fields in one ultimate mapping that would cover all 36 mapping types.

We went with the second option, combining all the fields in one mapping. By doing that, we got one index with a lot, and I mean, a LOT of fields. But it was still better that the other option, creating 36 different indices with almost identical mappings. Another argument for “one ultimate mapping option“ was the fact that we would have to cross index search all the indices without losing any performance.

One mapping to rule them all

Good. We have a course of action, what now?

Let’s summarize the situation:

  • there is a file that contains the index mappings → let’s call it the static mapping file, since those fields never change
  • there are a 1000+ files that contain additional fields for each mapping type → let’s call these dynamic mapping files, because those fields change often
  • there are 36 tables in the database and 36 corresponding mapping types in the _static mapping _file
  • there are 36 tables in the database that correspond to one or more dynamic mapping files
  • the code depends on the mapping types in the index to retrieve data, search etc.

We started the great cleanup / refactor / rewrite session to merge all those numerous dynamic mapping files into one file which would then be combined with static mappings. The mapping types were removed in this step, and the mapping type name was added as a new field to the static mappings. That way we didn’t have to rewrite the entire application and we could use ElasticSearch 7.10.1. The new static mappings file ended up looking something like this:

_doc:
    class: {type: text, index: 'true'}
    active: {type: byte, index: 'true'}
    additional: {type: integer, index: 'true'}
    . . .
Enter fullscreen mode Exit fullscreen mode

This “easy” part was followed by the removal of dependencies on mapping types across the entire code base. Hours turned to days, days to weeks, and a few weeks later we finally managed to refactor all the places that fetched mapping types from elastic and did magic with them.

Better indexing procedure with zero downtime

Indexing documents, creating and manipulating indices in any way was a whole procedure that required a hefty multi-step document. It seemed as good a time as any to refactor it.

Instead of a three-page procedure we now had five console commands: Create, Delete, Index, Replay and AddToQueue all of which used ruflin/elastica to communicate with the ElasticSearch cluster in the background.

Queue

The update queue is just one table in the database where the ID of the changed document and the name of the index are stored. Once the queue is enabled, any changes that go to the ElasticSearch index with the write alias are also recorded to the queue.

The AddToQueue command is intended to be used to easily add one or more IDs to the update queue table. This could be useful if for some reason some documents aren’t in sync with the database.

Replay

The Replay command then takes chunks of ids from the update queue and bulk upserts (insert or update) that data into the appropriate index that has the write alias. Once the documents are updated or inserted, the records are simply deleted from the update queue table.

Index

The Index command creates a new index with a write_new alias, enables syncing changes to the queue and bulk inserts data from the database to the index. After all documents are inserted, the write alias is switched to the new index, the update queue is replayed via the Replay command, the read alias is switched to the new index and the old one is deleted. And voila, indexing with zero downtime!

Up and running

How are we going to deploy this huge change in a way that everything works? Once again, to the documentation! This left us with several possibilities:

Since we wanted to upgrade without downtime, we went with the second option → reindex from a remote cluster. For this to happen we had to have two parallel clusters:

  • the old 5.3.2 cluster that is still used in production
  • this cluster has 4 indices, and each index has both read and write alias pointing to it

  • and a new empty 7.10.1 cluster

We deployed the code overnight when we have the least amount of traffic on the site. To guide you through our deploy process I will list the deploy actions.

Deploy actions :

  • We took one of the application servers out of the production pool, deployed new code on it and set it to connect to the new 7.10.1 cluster.

  • After that we created three new indices with the Create command. Each index had read and write alias pointing to it.

  • We enabled saving changes to the queue on the old cluster. These changes will later be replayed on the new cluster, ensuring everything is up to date and users will not be aware of the cluster switch.
  • Now that everything is ready, we ran the Index command for each index in our cluster. The Index command first created a new index with the alias write_new.

  • After the index creation, the command then bulk inserted data fetched from the database into the new index. Indexing documents in all three indexes took about three hours.

  • After indexing all documents to a single index, the indexing command switched the write and read alias to the new index, and the write_new alias and the old index were deleted. This was done for all indices in 7.10.1 cluster.

  • Half of the application servers had now been taken out of the production pool and the new code had been deployed on them.

  • After the deploy had finished, these servers were returned to the production pool and the other half was taken out.
  • We now ran the Replay command that updates documents from the update queue, making sure users don’t see stale data for more than a few minutes.
  • After replaying changes, we disabled syncing data to the update queue. Production now used the new cluster and all changes were saved directly into the new 7.10.1 cluster.
  • The code was then deployed to the other half of the servers that were now out of the production pool.

  • All servers were added back to the production pool and 7.10 cluster was up and running.
  • No new data was saved to the old cluster at this point, and it could be shut down. We decided to leave it for 24 hours as backup in case something went wrong.

Nothing went wrong. Mission accomplished!


. . . . . . . . .