I kinda like the idea of writing about social media + databases and it seems the readers here are too. So, let’s explore it in a new direction: looking at Twitch.tv or any type of platform with instant messaging.
If you’re just getting started with databases in general or databases + social media in particular, you might want to start off by reading my initial post, Database 101: Data Consistency for Beginners for Beginners. That article captures my own exploration of how many database paradigms exist as I look far beyond my previous experience with just SQL and MySQL. I’m keeping track of my studies in this Database 101 series.
Table Of Contents
- 1. Prologue
- 2. The Project: Twitch Sentinel
- 3. Making Good Decisions
- 4. Modeling our Ideas into Queries
- 5. Twitch as our payload!
- 6. Make it BURN!
- 7. Final Considerations
1. Prologue
One of my tasks as a Developer Advocate is to inspire developers to create and try new things. I also need to challenge myself to create interesting things that will engage people.
In the past few weeks, I was working on a talk called “ScyllaDB from the Developer Perspective: Building Applications.” My goal was to build a simple application with a database that would consume 1k/3k operations. In this article, I’ll share what I learned. By the end of this article, a Beginner Developer should know how to build an application with a cool database to consume tons of IOPS (or maybe not – , let me know in the comments).
And at the end of your reading, you, a Beginner Developer (or maybe not, let me know on the comments) can build an application with a cool database to consume tons of IO/s.
2. The Project: Twitch Sentinel
Some things that I really like to do in my free time are:
- Create useless PoCs (Proof of Concepts) that help me understand more about concepts in computer science.
- Do daily live coding at Twitch.tv (here is my channel).
- Write blog posts about my learnings and experiences.
Reflecting on that, I decided to start a Twitch project that involves high data ingestion; let’s call it Twitch Sentinel
.
The idea is simple: collect all the messages from as many channels as possible on Twitch.tv, store the data, and retrieve some metrics from it.
Screenshot of my Twitch chat saying "hi" to you. lol
Can you imagine how to store more than 1k messages per second? It sounds like a cool challenge. Here are a few metrics that you can generate for your study:
- How many messages a specific streamer receives on their chat daily.
- How many messages a user sends per day.
- The most probable hour that a streamer or user is online, based on their messages.
- Peak of messages by hour, and so on.
But you might be asking, "Why should I create that if Twitch gives me the Analytics Panel?" The answer is: USELESS PoCs that will teach you something new!
Have you ever created an analytics dashboard before? With a really fast database? Thinking of new ways to model it? If the answer is YES, drop a like on this post! And if not, let me teach you what I've learned about this subject!
3. Making Good Decisions
When you start a new project, you should know that every part of the code and the tools that you will use to deliver it will directly impact the future of your project.
Also, the database that you choose for it directly impacts = the performance of the entire application.
If you read the first post of my series Database 101, you probably remember discussions about the CAP Theorem and Database Paradigms. Now we're about to check some properties related to how fast or consistent a database can be.
3.1 ACID Acronym
If you're interested in Data Modeling, you’ve probably already heard about ACID. If not, ACID is an acronym for Atomicity
, Consistency
, Isolation
and Durability
. Each item of this acronym makes up something called a Transaction
in Relational Databases.
A Transaction
is a single operation that tells your Database that if any of these concepts of ACID fails, your query will not succeed. Let's explore this more:
- Atomicity: Each piece of your transaction is unique and if any of these pieces fails, it will fail and rollback to the original state of that piece of data.
- Consistency: Guarantee that your data will only be accepted if it's in the right shape of that model, based on: table modeling, triggers, constraints etc.
- Isolation: All transactions are independent and will not interfere with a running transaction.
- Durability: If the transaction is finished, the data will be persisted even if the system fails after that.
As you can see, ACID gives you an interesting safety guard for all your data, but this requires some tradeoffs:
- It's EXPENSIVE to maintain, since for each transaction you will LOCK (Isolation) your data. It will need many nodes
- If your goal is to speed things up with higher throughput, ACID will keep you from that since all these operations need to be succeed.
- You can't even think in Flexible Schemas, a.k.a "Documents" since it broke the Consistency property.
So what? We just start to write JSON or TXT as our database? Well, it's not a bad thing but let's do it with Non Relational Databases (a.k.a NoSQL.).
3.2 BASE Acronym
When we jump into NoSQL databases, probably the most important thing to know is: there are paradigms and each paradigm is strong for something specific.
But something that is common between most NoSQL Database Paradigms is the BASE properties. This acronym stands for: Basically Available, Soft State and Eventually Consistency.
Now things get interesting because those properties allow you to do "less validation" on your data because you guarantee the integrity from the developer side. But before talking about that, let's understand why and all the meanings:
- Basically Available: You can Read/Write data from your database anytime you want, but it doesn't mean that this data is updated.
- Soft State: You can shape your schema on the fly mostly and it turns into a Developer task, not a DB Administrator one.
- Eventually Consistency: All the data will be synced between many datacenters until it gets the Consistency Level needed.
It seems, BASE properties give us a database that can receive any type of data with respect to data modeling. Also it's always available to query it and it's a good option for high data ingestions. But there are also tradeoffs:
- If you need strong data consistency, it may be not the best approach;
- Higher complexity since the DB Modeling is delegated to the Developer;
- Data Conflicts can be a pain if you need to sync the nodes around the clusters that you have.
Now we know more about a few properties that a Database can have generally have, so it's time to decide based on our project.
3.3 Concluding a Good Decision
ACID vs BASE? Who wins? The type of project decides! Often you can use multiple databases in a project, so it’s not a problem at all. But if you need to choose just one, choose wisely. To be clear:
- ACID should be chosen when you NEED TO HAVE Data Consistency with Transactions and performance isn’t a key consideration.
- BASE should be chosen when you have a higher demand for IOPS and are sure about your Data Modeling.
For our project, we're going to receive a huge amount of of messages from Twitch.tv. We need to store them quickly and to handle all those messages. So, of course we're going to abandon the ACID 'Safety Guard'
and jump into BASE 'Do Whatever Wisely'
xD
Thinking about that, I decided to use CQL and ScyllaDB since it handles our idea of receiving millions of messages per second and at the same time has the Consistency and support for ACID. If Discord uses ScyllaDB it for storing messages, why not use it with Twitch? :D
4. Modeling our Ideas into Queries
Screenshot of my Twitch chat saying "hi" to you. lol
When you're using ScyllaDB, your main focus needs to be on which query you want to run. Thinking on that, we need to:
- Store messages and read them when necessary.
- Store and read often from a streamer list.
- Count all messages sent by chat users in a specific stream;
So our data modeling should be like:
No big deal here, it's supposed to be simple. We need the fastest throughput possible and complex queries don't allow us to have such a performance.
Here are a few queries that we can run with the data retrieved from this model:
4.1 Number of messages per user (top 5)
SELECT
chatter_id,
chatter_username,
count(*) AS msg_count
FROM
dev_sentinel.messages
GROUP BY
chatter_id,
chatter_username
ORDER BY
msg_count DESC
LIMIT 5
4.2 Number of unique users per stream(er)
SELECT
streamer_id,
COUNT(DISTINCT chatter_id) AS unique_chatters
FROM
dev_sentinel.messages
GROUP BY
streamer_id
ORDER BY
unique_chatters DESC
4.3 Earliest and latest message of a given user
SELECT
min(sent_at) AS earliest_msg,
max(sent_at) AS latest_msg
FROM
dev_sentinel.messages
WHERE
chatter_username = 'danielhe4rt'
This is very similar to the modeling I discussed in the post. There, I have a more detailed explanation of a few pieces of the code. Feel free to take a look :p
5. Twitch as our payload!
Ok, now we have the Database concept modeled and now we need the "real world" payload. I don't know if you like these tutorials that just mock all the data just to show you a huge number that doesn't even mean anything at the end... I simply don't – that's and that's why I wanted to bring something real for you to explore.
On Twitch's streaming platform, they have a bunch of APIs that can interact with the streamer chat. The most notorious is called 'TMI' - Twitch Messaging Interface - which is a client that connects directly to any Twitch Streamer chat that you want. Here's a list of clients for you to check it out:
- tmi.js - Interface for NodeJS
- tmi.php - Interface for PHP (Easily integrated to Laravel)
- pytmi - Interface for Python
- twitch-irc-rs Interface for Rust
Anyway, the idea is the same for all of these clients: you need to choose a channel and connect to it. The code looks like this:
$client = new Client(new ClientOptions([
'connection' => [
'secure' => true,
'reconnect' => true,
'rejoin' => true,
],
'channels' => ['danielhe4rt']
]));
$client->on(MessageEvent::class, function (MessageEvent $e) {
print "{$e->tags['display-name']}: {$e->message}";
});
$client->connect();
Each Twitch payload has the "tags" array, that brings us a JSON with all the data related to that specific message:
{
"badge-info": {
"subscriber": "58"
},
"badges": {
"broadcaster": "1",
"subscriber": "3036",
"partner": "1"
},
"client-nonce": "3e00905ed814fb4d846e8b9ba6a9c1da",
"color": "#8A2BE2",
"display-name": "danielhe4rt",
"emotes": null,
"first-msg": false,
"flags": null,
"id": "b40513ae-efed-472b-9863-db34cf0baa98",
"mod": false,
"returning-chatter": false,
"room-id": "227168488",
"subscriber": true,
"tmi-sent-ts": "1686770892358",
"turbo": false,
"user-id": "227168488",
"user-type": null,
"emotes-raw": null,
"badge-info-raw": "subscriber/58",
"badges-raw": "broadcaster/1,subscriber/3036,partner/1",
"username": "danielhe4rt",
"message-type": "chat"
}
On this payload, we gonna need only:
- room-id: Identifier related to the specific broadcaster channel.
- user-id: Identifier related to the user who sent the message
- tmi-sent-at: message timestamp
On the Message interface, you will also receive a string with the message
This is a simple project, but seriously, try to abstract more ideas from that and let me know! I'll gladly help you to create something bigger!
6. Make it BURN!
As I told you at the beginning of this article, my goal with this project was to build a highly scalable application with a really cool database that handles our needs by receiving a huge chunk of payload per second.
So, we connected on the ~20k most accessed chats on Twitch.tv and got an average of 1,700 ~ 2,000 messages per second. This gives us an average of 6 MILLION messages per hour. Have you ever coded something that had such a high data ingestion?
While the application is receiving all this data and posting it to ScyllaDB, here are some statistics of a T3-Micro Cluster, the cheapest instance at AWS.
It handles 1k requests/s like it's nothing with latency sub millisecond P99 latency. Also, the load of this lightweight machine for 1k/s is just 8%, so you can do something monstrously faster if you want.
Most of the time, it depends on how many streamers will be connected to your bot and how many messages viewers send per second.
7. Final Considerations
This project taught me a lot about how important it is to choose the right tool for the right job. In this specific case, the database needs to be something that you will use while thinking at a higher scale.
Just remember that it is totally fine to have more than one database inside a project. Each one resolves a generic or a specific problem inside the development environment. Always do proper research and a PoC with as many tools as possible if you have time!
If you want to build by yourself, check this tutorial on Killercoda and don't forget to follow me at socials!
Stay hydrated and see you soon.
Twitter DanielHe4rt PT-BR
Twitter Danielhe4rt EN
Twitch Channel (Daily LiveCoding Sessions)
Instagram
YouTube