Using custom data types in Python: reflections on the use of type hints
Here, I discuss a benefit I have found after adopting type hinting in Python: thinking more about data type appropriateness, with an example application to participant identifiers in a research data context.
An important change to my Python programming approach this year has been the comprehensive use of type hints.
Although the type hints are not checked at runtime, I have found the routine execution of a type checker like mypy
to be very valuable in picking up errors, unanticipated edge cases, and general sloppiness.
More unexpectedly, to me, has been how using type hints has considerably changed how I structure my code.
For example, my code often requires the representation of participant identifiers—that is, identifiers that allow research data to be associated with particular human participants.
There are no field-level conventions for the format of participant IDs, but they are typically either short numeric or alphanumeric sequences if the research involves a physical component (e.g., recording IDs on paper) or UUIDs if the interaction is only ever digital.
My preference is for participant IDs to be five-character sequences beginning with a lowercase p
and ending with an integer that is zero-padded to a length of four characters; for example, p0420
would be a valid participant ID under my system whereas a01
would be invalid.
Previously, I would have represented a participant ID in a variable with a str
type—which is the built-in data type in Python for representing Unicode strings.
For example, a function using a participant ID might look something like:
def some_sort_of_analysis(participant_id: str) -> None:
# some functionality using `participant_id` ...
However, a benefit of adding type hints that I have found is that it provides motivation to think more carefully about the type of variables.
Is a str
really the most appropriate type for participant_id
?
There are two main ways in which I think str
is inappropriate here:
-
It doesn’t enforce any of the known constraints.
The flexibility of the
str
type means that it will happily represent sequences that we know cannot correspond to participants, as we have defined them. For example,"a01"
is a perfectly validstr
but an invalid participant identifier, according to our definition. -
It permits nonsensical operations.
A variable of type
str
supports a large number of operations and methods, almost all of which do not make sense for participant identifiers. For example, it doesn’t make sense to add two participant IDs together, or to select individual characters in a participant ID, or to reverse the characters, etc.
I feel like there is a connection here between my finding the above objectionable and my shift towards Bayesian data analysis. It reminds me of the heuristic of checking a model by drawing prior-predictive samples and seeing whether the resulting values make sense within the particular domain of interest.
Rather than using the built-in str
data type, we can use our own custom type by defining a new class called ParticipantID
.
We can start with a simple definition, in which we accept a participant ID as a string and store it within the attribute _participant_id
, using the single leading underscore as a ‘weak “internal use” indicator’, and build up functionality from there:
class ParticipantID:
def __init__(self, participant_id: str):
self._participant_id = participant_id
Should the parameter here be called something like participant_id_str
rather than participant_id
?
That is, should we reserve calling things participant_id
only for variables that are instances of ParticipantID
.
I think there is merit in that—but at the same time, the _str
suffix doesn’t feels odd and the type hint indicates what data type is required.
A key limitation of using a str
to represent a participant ID is that we had no assurance that the contents of the variable actually matched the requirements for a valid participant ID.
Hence, we can add some verification logic to our object initialiser to ensure that an object is only created if it matches our specifications:
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
For our object to be useful, we need for it to be able to interact with other data types and operators in the Python ecosystem.
The most relevant way to facilitate such interactions for our object is to make it convertible to a string.
We do that by defining a __str__
method, which allows access to the Unicode representation of the participant ID using the str
function:
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
Given our definition of a participant ID, it might also be useful to be able to readily convert it into an integer (so something like int(ParticipantID("p1001"))
would return 1001
):
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
def __int__(self) -> int:
return int(self._participant_id[1:])
It might also be useful to be able to compare the equality of two participant IDs.
For our purposes, two participant IDs are equal if they have the same _participant_id
value—they do not need to refer to the same location in memory.
We thus need a way of representing _participant_id
in a way that facilitates comparison, which we do by implementing the __hash__
method using the hash
function:
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
def __int__(self) -> int:
return int(self._participant_id[1:])
def __hash__(self) -> int:
return hash(self._participant_id)
We can than implement an __eq__
method that compares the hashes of the objects:
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
def __int__(self) -> int:
return int(self._participant_id[1:])
def __hash__(self) -> int:
return hash(self._participant_id)
def __eq__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return hash(self) == hash(other)
Note that we applied a constraint within __eq__
that it can only be compared with another ParticipantID
object.
That means that a statement like ParticipantID("p1001") == "p1001"
will return False
.
This actually gets really tricky, and I could imagine leading to some pretty horrible bugs—I’d be tempted to raise a warning if other
is a str
(and maybe also if it is an int
).
We might also like to be able to sort a collection of participant IDs, probably mostly to be able to have them in a consistent order rather than there being anything inherently ordered about them.
To support sorting, we need to be able to indicate whether a participant ID is ‘less than’ another.
We do this by defining a __lt__
method that compares their integer representations:
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
def __int__(self) -> int:
return int(self._participant_id[1:])
def __hash__(self) -> int:
return hash(self._participant_id)
def __eq__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return hash(self) == hash(other)
def __lt__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return int(self) < int(other)
The last ‘special’ method that we will define is for good Python form: a __repr__
method that returns a string representation of how the object can be constructed.
This is useful in interactive Python sessions, in particular.
class ParticipantID:
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
def __repr__(self) -> str:
return f'Participant(participant_id="{self}")'
def __int__(self) -> int:
return int(self._participant_id[1:])
def __hash__(self) -> int:
return hash(self._participant_id)
def __eq__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return hash(self) == hash(other)
def __lt__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return int(self) < int(other)
I thought about adding __truediv__
and __rtrudiv__
methods so that it could easily interact with pathlib.Path
operations (e.g., base_dir / participant_id
), but I think that probably over-complicates things.
We will make one final change to our object definition.
Given that we know that our object will only store the _participant_id
attribute, we can enforce this by defining a __slots__
class variable.
This prevents the dynamic assignment of additional attributes and has memory and speed benefits (see the slots documentation for more information).
class ParticipantID:
__slots__ = ("_participant_id",)
def __init__(self, participant_id: str):
if not participant_id.startswith("p"):
raise ValueError("Participant ID must start with 'p'")
if len(participant_id) != 5:
raise ValueError("Participant ID must be in the form 'pXXXX'")
try:
_ = int(participant_id[1:])
except ValueError:
raise ValueError(
"The last four digits in the participant ID must represent an integer"
)
self._participant_id = participant_id
def __str__(self) -> str:
return self._participant_id
def __repr__(self) -> str:
return f'Participant(participant_id="{self}")'
def __int__(self) -> int:
return int(self._participant_id[1:])
def __hash__(self) -> int:
return hash(self._participant_id)
def __eq__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return hash(self) == hash(other)
def __lt__(self, other: object) -> bool:
if not isinstance(other, ParticipantID):
return NotImplemented
return int(self) < int(other)
Having defined our ParticipantID
object and implemented its functionality, we can now turn again to our example function that will use the information and update its type hint:
def some_sort_of_analysis(participant_id: ParticipantID) -> None:
# some functionality using `participant_id` ...
Within this function, we can thus use participant_id
with confidence (well, assuming that we are using type checking) that it matches the participant ID format that we have defined and that its functionality is limited to operations that are sensible in our context.
Although it has required quite a bit of code to construct our ParticipantID
class, I have found this confidence that it brings to often be worth it.