This is a basic cheat sheet, glossary and the very beginning of getting started with Apache Spark, every time we will share a new post with terms or code snippets, they will appear here as well at a generic form.
If you work with Apache Spark and look for a cheat sheet, this is for you as well!
First thing first:
-1- the workspace:
First, we need to create the workspace, we are using Databricks workspace and here is a tutorial for creating it.
-2- Basic Apache Spark Vocabulary :
Dataframe
This is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates. Dataframe data is often distributed across multiple machines. It can be in-memory data or on disk.
Dataset
Strongly typed collection of objects that can be transformed in parallel using functional or relational operations. Each Dataset is a typed view of Dataframe.
Dataset is defined as "lazy", meaning the computations are only triggered when an action is invoked.
RelationalGroupedDataset
A set of methods for aggregations on a DataFrame
, created by groupBy, cube or rollup.
This is an evolving page and more terms, code snippets and architecture design will be added.