Sets, sets are everywhere

stereobooster - Apr 20 '19 - - Dev Community

I strongly believe that it is ok to learn on the go (unless you have a life-critical profession, like a doctor of medicine or you writing software for auto-pilot). If you learn by doing you mostly get practical knowledge. Sometimes it is useful to pause and to summarise practical knowledge down to theory or study theory if this area is already well known.

In math (not always) they tend to teach theory without explaining what practical use of it.

In this post, I want to talk about the mathematical subject, which may seem as irrelevant for a developer at first glance - sets. I will explain basic ideas and how they appear in development in many different places, for example as a data structure, in types, in RDBMS.

I will give a short theoretical definition followed by a practical example. I took some of the definitions from the book The Grammar of Graphics by Leland Wilkinson.

A set

A set is a collection of unique objects. An object in a set is called an element or a member of the set. If we agree to denote sets by capital letters (for example, X) and members of the set by lower-case letters (for example x), then mathematical notation for "x is an element of X" would be

x ∈ X, where  x = 1 and X = {1}

Set as a data structure

In some programming languages set is a built-in data structure, for example in JS:

const x = 1
const X = new Set([1]);
X.has(x)

or in Python

x = 1
X = set([1])
print(x in X)

In practice, it can be used to find all unique values in a list, for example in JS

const unique = [...new Set(list)];

Null set and numbers

The null set ( or {}) is an empty set or set with no elements. In math, they denote set of real numbers by R (), the integers by Z (), natural numbers by N ()

Types as sets

Types can be modeled as sets (at least to some extent). We can say that type is a set of all values of this type, for example when we say that some variable is of type integer, we mean that it can take all values of the set of integer numbers Z (obviously with limitation of computer memory).

The empty set would correspond to type with no values. What can it be? It is a type of function with no return value (some people would call it procedure), for example in TypeScript:

function hello(name: string): void {
  console.log(`Hello, ${name}!`);
}

Don't be confused by the undefined in case of JavaScript or null in different languages. We say that void corresponds to the empty set, but in practice it may have one (or two in JavaScript, depending on how you treat it) value undefined/null, this is done for convenience, so that value which represents the absence of value would be possible to pass around as the result of function or variable. There are different approaches to represent the absence of a value, for example Maybe monad or Null Object Pattern.

Side note: Also in TypeScript, they have a real empty set

let x: never;
x = undefined; // error

it gets very confusing at this point, but don't be. In math, there is a straightforward definition but in JavaScript, they made an error by introducing two nulls (undefined and null). We need null for convenience otherwise we would be forced to use Maybe monad or similar solution. TypeScript is a superset of JavaScript, so it inherits its all quirks, but also because it is a modern type system they introduced proper empty set, for example, to use in an exhaustiveness checking.

Subset, superset

If every element of set A is also an element of a set B, then A is a subset of B, denoted A ⊆ B. If A is a subset of B, but there is at least one element of B that is not in A, then A is proper subset of B, denoted A ⊂ B. As well we can define the reverse version of subset, called superset and denoted and proper superset denoted as .

In JavaScript

const Zet = require('zet');
const A = new Zet([1, 2, 3]);
const B = new Zet([3, 4, 5]);
A.subset(B)

In Python

A = {1,2,3}
B = {3,4,5}
A <= B 

TypeScript as a superset of JavaScript

What do they mean when they say that TypeScript is a superset of JavaScript? They mean that if we model JavaScript and TypeScript languages as a set of all syntactically correct programs (in those languages), one of the sets would be a subset of the other one:

TypeScript ⊃ JavaScript

The Liskov Substitution Principle and subsets

In Object Oriented Design they have SOLID principles and L in the acronym stands for the Liskov Substitution Principle, which says that derived classes must be substitutable for their base classes.

We can say that class is the set of all instances of this class, then if we have class A and class B - a subclass of A:

class A {}
class B extends A {}

and they follow the Liskov Substitution Principle, we can see that set B is a subset of A because all instances of class B also considered instances of class A (from behavior point of view, nominally they are different).

Note: we just showed correspondence between sets, types (in programming) and classes.

Cardinality

If each element of A can be paired with an element of B such that each element of A occurs exactly once, and each element of B occurs exactly once, in the pairings we say that A and B are equivalent, denoted A ~ B. If a set A is equivalent to a subset of the set of integers, we say that set A is countable. If A is null or equivalent to the set {1, 2,.. m }, where m is a positive integer, we say A is a finite set. The cardinality of a finite set is m, the count of its elements.
An indexed set is set of the form {(1, a), (2, b),.. (m, n)}
(a, b) is called tuple (or ordered pair). (a, b,.. z) is called n-tuple, also sometimes people refer to n-tuples as tuples.

Relational database and DataFrames

Tables in relational databases meant to be set of tuples (set of tuples also can be called frame). Some databases ignore this idea, for example, MySQL.
This explains why it is nice to have a primary key - it is easy to guarantee the uniqueness of each tuple (no need to check all values, we can check the only one).
This also explains why it is ok that PostgreSQL returns rows in different order unless ORDER BY specified for each query - because the set doesn't provide order. MySQL, for comparison, returns rows in the order they were inserted (at least it did when I checked the last time).

The cardinality of a finite set in SQL

SELECT COUNT(*) FROM table_name;

In Python, they have pandas.DataFrame. Which can be modeled as an indexed set of tuples or as a tuple of indexed sets. It is implemented as a column-oriented store, for the mathematical model it doesn't matter, it changes performance, but not an abstraction.

Intersection and union operations

The intersection of two sets A and B, denoted by A ∩ B, is the set which elements belongs to A and B sets simultaneously:

A ∩ B = { x: x ∈ A and x ∈ B }

For example

A = {1, 2} and B = {2, 3, 4} then A ∩ B = {2}

The union of two sets A and B, denoted by A ∪ B, is the set which elements belongs to either A or B set:

A ∪ B = { x: x ∈ A or x ∈ B }

For example

A = {1, 2} and B = {2, 3, 4} then A ∪ B = {1, 2, 3, 4}

The disjoint union of two sets A and B, denoted by A ⊔ B, produces set whose members are tagged elements. A tagged element is one for the form x:$ where x ∈ X is the element and the symbol $ is the tag. A tag sometimes called an identifier or a color; it may be a string, a numerical value, or another piece of information. With a disjoint union, we tag element with the name of the set containing it. For example

A = {1, 2} and B = {2, 3, 4} then A ⊔ B = {1:A, 2:A, 2:B, 3:B, 4:B}

The (Cartesian) product of sets A and B, denoted by A × B, is set with combinations of elements from sets A and B

A × B = { (a,b): a ∈ A or b ∈ B }

For example

A = {1, 2} and B = {2, 3, 4} then A × B = {(1,2), (1,3), (1,4), (2,2), (2,3), (2,4)}

More about types

If we say there is a correspondence between sets and types there suppose to be correspondence for operations as well.

Let's use TypeScript for examples. Union:

type A = 'a' 
type B = 'b'
type C = A | B

In type systems, disjoint unions sometimes are called discriminated unions or tagged unions. For example

interface Rectangle {
    type: "rectangle";
    width: number;
    height: number;
}
interface Circle {
    type: "circle";
    radius: number;
}
type Shape = Rectangle | Circle;

in the given example type field is a tag. This kind of pattern often used in event processing and works well with pattern matching.

Before we can move on to Cartesian product we need to talk about how record types can be represented with sets

type A = { a: number };
type B = { b: number}

We can represent this as a set of tuples, where the first element in every tuple is the name of the property:

A = {(a, 1), (a, 2),.. (a,n)}
B = {(b, 1), (b, 2),.. (b,n)}

How would a Cartesian product look like?

C = A × B = {((a,1), (b,1)), ((a,1), (b,2)), ((a,2), (b,1)),.. ((a,n), (b,n)) }

As you can guess tuple of tuples represent Record with two fields:

type C = { a: number, b: number }

Then Cartesian product corresponds to "Intersection Types" (at least this is how they call it in TypeScript).

type C = A & B

PS

My idea was to introduce a bare minimum of the theory, but enough to see how it is used in real life. Tell me if it is graspable or not, especially if you don't have a CS degree.

Do you consider useful to understand a bit of theory behind those subjects? Did any of those ideas click for you?

Cover image: Construção, Ione Saldanha

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .