Today I decided to run one more absurd test on Amazon Aurora, this time playing with different MySQL engines.
A little absurd experiment with MySQL and Aurora
Amazon RDS and Amazon Aurora fully support the InnoDB storage engine for MySQL DB instances. There are features like snapshot restore that are supported for the InnoDB storage engine only. But InnoDB is not the only engine available on RDS for MySQL or Aurora MySQL
You can see all the engines enabled running a simple:
MySQL> show engines;
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine | Support | Comment | Transactions | XA | Savepoints |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
| FEDERATED | NO | Federated MySQL storage engine | NULL | NULL | NULL |
| MEMORY | YES | Hash based, stored in memory, useful for temporary tables | NO | NO | NO |
| InnoDB | DEFAULT | Supports transactions, row-level locking, and foreign keys | YES | YES | YES |
| PERFORMANCE_SCHEMA | YES | Performance Schema | NO | NO | NO |
| MyISAM | YES | MyISAM storage engine | NO | NO | NO |
| MRG_MYISAM | YES | Collection of identical MyISAM tables | NO | NO | NO |
| BLACKHOLE | YES | /dev/null storage engine (anything you write to it disappears) | NO | NO | NO |
| CSV | YES | CSV storage engine | NO | NO | NO |
| ARCHIVE | YES | Archive storage engine | NO | NO | NO |
+--------------------+---------+----------------------------------------------------------------+--------------+------+------------+
InnoDB is the default engine and the one you should use for (almost) every single table you have. Unless you do not care about the data stored in the table.
But InnoDB is not the fastest and lightest storage engine on RDS. No, I am not talking about MyISAM, I am talking about the rubbish bin, the BLACKHOLE storage engine:
The BLACKHOLE storage engine acts as a “black hole” that accepts data but throws it away and does not store it. Retrievals always return an empty result.
What?
The best way to optimize IOPS and have fewer problems with data is not to keep the data in the first place. OK, we will store less data but do we really have benefits in CPU usage?
What is the impact on DBLoad and almost all on DBLoadNonCPU? How much does it really cost to store (useless) data?
Any real difference?
We are going to run a little experiment running an Aurora Serverless instance at a fixed size of 4 ACU (approximately 8GB).
Let's create a simple table and a procedure to populate it with a few (million and useless) records. Nothing fancy here, just some random numbers and 5 million records for every call.
CREATE TABLE data
(id bigint(20) NOT NULL AUTO_INCREMENT,
datetime TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
value float DEFAULT NULL,
PRIMARY KEY (id)) ENGINE=InnoDB;
DELIMITER $$
CREATE PROCEDURE generate_data()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 5000000 DO
INSERT INTO data (datetime,value) VALUES (
FROM_UNIXTIME(UNIX_TIMESTAMP('2022-01-01 01:00:00')+FLOOR(RAND()*31536000)),
ROUND(RAND()*100,2));
SET i = i + 1;
END WHILE;
END$$
DELIMITER ;
We can now call our simple procedure (twice, to see that results are not random and reproducible) and compare the metrics of the database with InnoDB and BLACKHOLE engines. We sleep a bit (200 seconds) between executions to have clearer metrics:
CALL generate_data();
SELECT sleep(200);
CALL generate_data();
SELECT sleep(200);
ALTER TABLE data ENGINE=BLACKHOLE;
CALL generate_data();
SELECT sleep(200);
CALL generate_data();
And the winner is...
As expected, the difference in CPU load of the database is massive, storing (useless) data is not cheap.
The insert throughput increases, reducing the time required to run the procedure.
Are you kidding, right?
No, there are some corner cases where running a simple:
ALTER TABLE myUselessTable ENGINE=BLACKHOLE;
is not totally crazy. Among the scenarios:
Emergency button. OK, your database is really running out of IOPS, you have no logical sharding to split the traffic to separate instances. And vertical scaling will not help. If you have some data you can lose, running a table on BLACKHOLE can be the out-of-jail card.
Speed up application changes. OK, that cool feature in your product is not so cool anymore, you have decided to stop collecting data and remove it but you will not have the fix / new version for days or weeks. You might be lucky enough and be able to drop the records in the meantime.
3 Test the impact of your new cool feature heavy on Write IOPS on a complex workload. Try with and without BLACKHOLE, as in these dummy tests. You might get some useful and scary numbers.
Sustainability?
As everybody talks about sustainability, let's add a final note. Last month I attended a talk about sustainability in serverless given by Sheen Brisals at the Serverless Architecture Conference in Berlin. One of Sheen's key messages was to get rid of useless data: if you don't need the data, don't store it.
Gert Leenders recently shared a similar view:
There's no value in a Petabyte data lake without data access.
I fully agree the best way to have fewer problems with data is not to have the data. And the cost of useless data is not only storage.
Don’t store data you do not need to store. You are not only taking storage you do not need. You are burning CPU cycles. You are warming the planet. You are draining your credit card. You are slowing your application.
You might not even need to change your product to rip some benefits. Go ahead with that simple ALTER TABLE.
Or drop the entire database and stop that RDS / Aurora instance. OK, maybe I am just kidding.