Analytics involves the computation of data to find patterns. From simple calculations - such as the sum of values or its average to statistics calculated with machine learning algorithms, analytics frequently deals with big amounts of data that need to be processed as fast as possible.
We saw in the previous part of this series of articles, how indexes help to improve the speed of database operations. And in all honesty, the problem in the laboratory application shown in this article can be solved by merely adding an index. However, in real-world applications with many tables that have a lot of columns, dealing with indexes can become difficult. In this article, I explore one interesting technology that helps with analytic SQL queries: ColumnStore using Spring Boot and Vaadin Flow.
Read part 1 of this series if you want to learn how the example application that we'll use here works, and part 2 if you want to see the power of database indexes in action. You can find the code on GitHub. Also, and if you prefer, I recorded a video version of this article:
The Requirement
Let's say we are developing a web page (or view, in Vaadin terminology) to show analytical data; a report if you wish. In particular, we want to start by showing the top authors (from the book_demo
database) according to the number of books they have published. Let's say that a top author is an author with more than 12 published books. We have to show a grid with the name of the top author and the number of books published.
Adding a Native Query to the Repository
Let's start by adding a native query. We could use a JPQL query, but to understand the point of this article more easily, a native query will simplify things. Additionally, we get to see how to use native queries with Spring Data.
When using native queries with Spring Data, we don't have to use the JPA Entity classes. Instead, we can define a customized view of the result of a query in form of a Java interface with one getter for each column in the result of the query. Here's the query we need in the repository class:
@Repository
public interface BookRepository extends JpaRepository<Book, Integer> {
interface TopAuthor{
String getAuthor();
int getBookCount();
}
...
@Query(value = "select author, count(id) as bookCount from book group by author having count(id) > ?1", nativeQuery = true)
List<TopAuthor> findTopAuthors(int bookCount);
}
Notice that we used nativeQuery = true
in the @Query
annotation. Spring Data will create objects of type TopAuthor
to build the result List
.
We also need to update the service class to allow the UI to retrieve the top authors:
@Service
public class BookService {
private final BookRepository repository;
...
public List<BookRepository.TopAuthor> findTopAuthors(int bookCount) {
return repository.findTopAuthors(bookCount);
}
}
Adding a View (Vaadin Flow)
We need to add a new Vaadin Flow view, or web page, or route. They are all almost synonyms (except that technically it's not a web page, but let's not worry about it for now). Here's an implementation:
@Route("report")
public class ReportView extends VerticalLayout {
public ReportView(BookService service) {
var grid = new Grid<>(BookRepository.TopAuthor.class);
grid.setSizeFull();
grid.setItems(service.findTopAuthors(12));
add(grid);
setSizeFull();
}
}
This short class creates a vertical layout (with a grid) in it and exposes it at http://localhost:8080/report. The Grid
component is populated with the top authors returned by the service class.
Testing the Slow Behavior
Start the application by running the main
method in the Application
class and see the result in the browser. Depending on the number of books that your database has, the view is going to be rendered quickly or dramatically slowly! In my case, I added one million books to the database (I wrote an article that shows you how to generate many books for your database). With one million books, the view takes several seconds to appear in the browser as you can see in the video.
Using ColumnStore
Like I mentioned in the introduction, this can be solved by adding an index to the author
column. This works in this over-simplistic example, but in a real-life project, you could easily have hundreds of tables with dozens of columns each. Dealing with indexes becomes kind of difficult, to say the least. MariaDB's ColumnStore allows us, among other things, to gain the advantages of indexes without having to manually configure indexes.
Let's create a new table using the ColumnStore engine:
CREATE TABLE book_analytics
(
id int(11) NOT NULL,
author varchar(255) DEFAULT NULL,
pages int(11) DEFAULT NULL,
publish_date date DEFAULT NULL,
title varchar(255) DEFAULT NULL
) engine=ColumnStore;
I left out the image_data
column from the original book
table since it makes little sense to have it in a table used for analytical purposes. Notice that we cannot add indexes to tables that use the ColumnStore engine. We don't have to. That's one of the advantages.
We have to populate this new table from the original one. Since MariaDB uses pluggable engines, they can be mixed in the same database instance and in SQL statements, so we can just copy the data from one table to another using SQL even if they use different engines:
insert into book_analytics(id, author, pages, publish_date, title)
select id, author, pages, publish_date, title from book;
Testing the Improvement
Before testing the improvement, we need to use the book_analytics
table in the query we added to the repository:
@Query(value = "select author, count(id) as bookCount from book_analytics group by author having count(id) > ?1", nativeQuery = true)
List<TopAuthor> findTopAuthors(int bookCount);
Rerun the application and see how now the books are retrieved in less than a second!