We recently migrated our Meeshkan product from Python TypedDict
to dataclasses
. This article explains why. We'll start with a general overview of types in Python. Then, we'll walk through the difference between the two typing strategies with examples. By the end, you should have the information you need to choose the one that's the best fit for your Python project.
Table of Contents
- Types in Python
-
Classes and
dataclasses
-
TypedDict
-
Migrating from
TypedDict
todataclasses
- Conclusion
Types in Python
PEP 484, co-authored by Python's creator Guido van Rossum, gives a rationale for types in Python. He proposes:
A standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information.
For me, static analysis is the strongest benefit of types in Python.
It takes code like this:
# exp.py
def exp(a, b):
return a ** b
exp(1, "result")
Which raises this error at runtime:
$ python exp.py
File "./exp.py", line 4, in <module>
exp(1, "result")
File "./exp.py", line 2, in exp
return a ** b
TypeError: unsupported operand type(s) for ** or pow(): 'int' and 'str'
And allows you to do this:
# exp.py
def exp(a: int, b: int) -> int:
return a ** b
exp(1, "result")
Which raises this error at compile time:
$ mypy exp.py # pip install mypy to install mypy
exp.py:4: error: Argument 2 to "exp" has incompatible type "str"; expected "int"
Found 1 error in 1 file (checked 1 source file)
Types help us catch bugs earlier and reduces the number of unit tests to maintain.
Classes and dataclasses
Python typing works for classes as well. Let's see how static typing with classes can move two errors from runtime to compile time.
Setting up our example
The following area.py
file contains a function that calculates the area of a shape using the data provided by two classes:
# area.py
class RangeX:
left: float
right: float
class RangeY:
up: float
down: float
def area(x, y):
return (x.right - x.lefft) * (y.right- y.left)
x = RangeX(); x.left = 1; x.right = 4
y = RangeY(); y.down = -3; y.up = 6
print(area(x, y))
The first runtime error this produces is:
$ python area.py
Traceback (most recent call last):
File "./area.py", line 14, in <module>
print(area(x, y))
File "./area.py", line 10, in area
return (x.right - x.lefft) * (y.right- y.left)
AttributeError: 'RangeX' object has no attribute 'lefft'
Yikes! Bitten by a spelling mistake in the area
function. Let's fix that by changing lefft
to left
.
We run again, and:
$ python area.py
Traceback (most recent call last):
File "./area.py", line 14, in <module>
print(area(x, y))
File "./area.py", line 10, in area
return (x.right - x.left) * (y.right- y.left)
AttributeError: 'RangeY' object has no attribute 'right'
Oh no! In the definition of area
, we have used right
and left
for y instead of up
and down
. This is a common copy-and-paste error.
Let's change the area
function again so that the final function reads:
def area(x, y):
return (x.right - x.left) * (y.up - y.down)
After running our code again, we get the result of 27
. This is what we would expect the area of a 9x3 rectangle to be.
Adding type definitions
Now let's see now how Python would have caught both of these errors using types at compile time.
We first add type definitions to the area
function:
# area.py
class RangeX:
left: float
right: float
class RangeY:
up: float
down: float
def area(x: RangeX, y: RangeY) -> float:
return (x.right - x.lefft) * (y.right - y.left)
x = RangeX(); x.left = 1; x.right = 4
y = RangeY(); y.down = -3; y.up = 6
print(area(x, y))
Then we can run our area.py
file using mypy
, a static type checker for Python:
$ mypy area.py
area.py:10: error: "RangeX" has no attribute "lefft"; maybe "left"?
area.py:10: error: "RangeY" has no attribute "right"
area.py:10: error: "RangeY" has no attribute "left"
Found 3 errors in 1 file (checked 1 source file)
It spots the same three errors before we even run our code.
Working with dataclasses
In our previous example, you'll notice that the assignment of attributes like x.left
and x.right
is clunky. Instead, what we'd like to do is RangeX(left = 1, right = 4)
. The dataclass
decorator makes this possible. It takes a class
and turbocharges it with a constructor and several other useful methods.
Let's take our area.py
file and use the dataclass
decorator.
# area.py
from dataclasses import dataclass
@dataclass # <----- check this out
class RangeX:
left: float
right: float
@dataclass # <----- and this
class RangeY:
up: float
down: float
def area(x: RangeX, y: RangeY) -> float:
return (x.right - x.left) * (y.up - y.down)
x = RangeX(left = 1, right = 4)
y = RangeY(down = -3, up = 6)
print(area(x, y))
According to mypy
, our file is now error-free:
$ mypy area.py
Success: no issues found in 1 source file
And it gives us the expected result of 27
:
$ python area.py
27
class
and dataclass
are nice ways to represent objects as types. They suffer from several limitations, though, that TypedDict
solves.
TypedDict
But first...
Brief introduction to duck typing
In the world of types, there is a notion called duck typing. Here's the idea: If an object looks like a duck and quacks like a duck, it's a duck.
For example, take the following JSON:
{
"name": "Stacey O'Hara",
"age": 42,
}
In a language with duck typing, we would define a type with the attributes name
and age
. Then, any object with these attributes would correspond to the type.
In Python, classes aren't duck typed, which leads to the following situation:
# person_vs_comet.py
from dataclasses import dataclass
@dataclass
class Person:
name: str
age: int
@dataclass
class Comet:
name: str
age: int
Person(name="Haley", age=42000) == Comet(name="Haley", age=42000) # False
This example should return False
. But without duck typing, JSON or dict
versions of Comet
and Person
would be the same.
We can see this when we check our example with asdict
:
from dataclass import asdict
asdict(Person(name="Haley", age=42000)) == asdict(Comet(name="Haley", age=42000)) # True
Duck typing helps us encode classes to another format without losing information. That is, we can create a field called type
that represents a "person"
or a "comet"
.
Working with TypedDict
TypedDict
brings duck typing to Python by allowing dict
s to act as types.
# person_vs_comet.py
from typing import TypedDict
class Person(TypedDict):
name: str
age: int
class Comet(TypedDict):
name: str
age: int
Person(name="Haley", age=42000) == Comet(name="Haley", age=42000) # True
An extra advantage of this approach is that it treats None
values as optional.
Let's imagine, for example, that we extended Person
like so:
# person.py
from dataclasses import dataclass
from typing import Optional
@dataclass
class Person:
name: str
age: int
car: Optional[str] = None
bike: Optional[str] = None
bank: Optional[str] = None
console: Optional[str] = None
larry = Person(name="Larry", age=25, car="Kia Spectra")
print(larry)
If we print a Person
, we'll see that the None
values are still present:
Person(name='Larry', age=25, car='Kia Spectra', bike=None, bank=None, console=None)
This feels a bit off - it has lots of explicit None
fields and gets verbose as you add more optional fields. Duck typing avoids this by only adding existing fields to an object.
So let's rewrite our person.py
file to use TypedDict
:
# person.py
from typing import TypedDict
class _Person(TypedDict, total=False):
car: str
bike: str
bank: str
console: str
class Person(_Person):
name: str
age: int
larry: Person = dict(name="Larry", age=25, car="Kia Spectra")
print(larry)
Now when we print our Person
, we only see the fields that exist:
Person(name='Larry', age=25, car='Kia Spectra')
Migrating from TypedDict
to dataclasses
You may have guessed by now, but generally, we prefer duck typing over classes. For this reason, we're very enthusiastic about TypedDict
. That said, in Meeshkan, we migrated from TypedDict
to dataclasses
for several reasons. Throughout the rest of this article, we'll explain why we made the move.
The two reasons we migrated from TypedDict
to dataclasses
are matching and validation:
- Matching means determining an object's class when there's a union of several classes.
- Validation means making sure that unknown data structures, like JSON, will map to a class.
Matching
Let's use the person_vs_comet.py
example from earlier to see why class
is better at matching in Python.
# person_vs_comet.py
from dataclasses import dataclass
from typing import Union
@dataclass
class Person:
name: str
age: int
@dataclass
class Comet:
name: str
age: int
def i_am_old(obj: Union[Person, Commet]) -> bool:
return obj.age > 120 if isinstance(obj, Person) else obj.age > 1000000000
print(i_am_old(Person(name="Spacey", age=1000))) # True
print(i_am_old(Comet(name="Spacey", age=1000))) # False
In Python, isinstance
can discriminate between union types. This is critical for most real-world programs that support several types.
In Meeshkan, we work with union types all the time in OpenAPI. For example, most object specifications can be a Schema
or a Reference
to a schema. All over our codebase, you'll see isinstance(r, Reference)
to make this distinction.
TypedDict
doesn't work with isinstance
- and for good reason. Under the hood, isinistance
looks up the class name of the Python object. That's a very fast operation. With duck typing, you'd have to inspect the whole object to see if "it's a duck." While this is fast for small objects, it is too slow for large objects like OpenAPI specifications. The isinstance
pattern has sped up our code a lot.
Validation
Most code receives input from an external source, like a file or an API. In these cases, it's important to verify that the input is usable by the program. This often requires mapping the input to an internal class. With duck typing, after the validation step, this requires a call to cast
.
The problem with cast
is that it allows incorrect validation code to slip through. In the following person.py
example, there is an intentional mistake. It asks if isinstance(d['age'], str)
even though age
is an int
. cast
, because it's so permissive, won't catch this error:
# person.py
from typing import cast, TypedDict, Optional
class Person(TypedDict):
name: str
age: Optional[int]
def to_person(d: dict) -> Person:
if ('name' in d) and isinstance(d['name'], str) and (('age' not in d) or (( 'age' in d) and (isinstance(d['age'], str))):
return cast(d, Person) # this will work at runtime even though it shouldn't
raise ValueError('d is not a Person')
However, a class
will only ever work with a constructor. So this will catch the error at the moment of construction:
# person.py
from typing import Optional
from dataclasses import dataclass
@dataclass
class Person:
name: str
age: Optional[int] = None
def to_person(d: dict) -> Person:
if ('name' in d) and isinstance(d['name'], str) and (('age' not in d) or (( 'age' in d) and (isinstance(d['age'], str))):
# will raise a runtime error for age when age is a str
# because it is `int` in `Person`
return Person(**to_person)
raise ValueError('d is not a Person')
The above to_person
will raise an error, whereas the TypedDict
version won't. This means that, when an error arises, it will happen later down the line. These types of errors are much harder to debug.
When we changed from TypedDict
to dataclasses
in Meeshkan, some tests started to fail. Looking them over, we realized that they never should have succeeded. Their success was due to the use of cast
, whereas the class
approach surfaced several bugs.
Conclusion
While we love the idea of TypedDict
and duck typing, it has practical limitations in Python. This makes it a poor choice for most large-scale applications. We would recommend using TypedDict
in situations where you're already using dicts
. In these cases, TypedDict
can add a degree of type safety without having to rewrite your code. For a new project, though, I'd recommend using dataclasses
. It works better with Python's type system and will lead to more resilient code.
Disagree with us? Are there any strengths or weaknesses of either approach that we're missing? Leave us a comment!