Using custom data types in Python: reflections on the use of type hints

2023-Nov-05

Here, I discuss a benefit I have found after adopting type hinting in Python: thinking more about data type appropriateness, with an example application to participant identifiers in a research data context.

An important change to my Python programming approach this year has been the comprehensive use of type hints. Although the type hints are not checked at runtime, I have found the routine execution of a type checker like mypy to be very valuable in picking up errors, unanticipated edge cases, and general sloppiness. More unexpectedly, to me, has been how using type hints has considerably changed how I structure my code.

For example, my code often requires the representation of participant identifiers—that is, identifiers that allow research data to be associated with particular human participants. There are no field-level conventions for the format of participant IDs, but they are typically either short numeric or alphanumeric sequences if the research involves a physical component (e.g., recording IDs on paper) or UUIDs if the interaction is only ever digital. My preference is for participant IDs to be five-character sequences beginning with a lowercase p and ending with an integer that is zero-padded to a length of four characters; for example, p0420 would be a valid participant ID under my system whereas a01 would be invalid.

Previously, I would have represented a participant ID in a variable with a str type—which is the built-in data type in Python for representing Unicode strings. For example, a function using a participant ID might look something like:

def some_sort_of_analysis(participant_id: str) -> None:
    # some functionality using `participant_id` ...

However, a benefit of adding type hints that I have found is that it provides motivation to think more carefully about the type of variables. Is a str really the most appropriate type for participant_id? There are two main ways in which I think str is inappropriate here:

It doesn’t enforce any of the known constraints. The flexibility of the str type means that it will happily represent sequences that we know cannot correspond to participants, as we have defined them. For example, "a01" is a perfectly valid str but an invalid participant identifier, according to our definition.
It permits nonsensical operations. A variable of type str supports a large number of operations and methods, almost all of which do not make sense for participant identifiers. For example, it doesn’t make sense to add two participant IDs together, or to select individual characters in a participant ID, or to reverse the characters, etc.

I feel like there is a connection here between my finding the above objectionable and my shift towards Bayesian data analysis. It reminds me of the heuristic of checking a model by drawing prior-predictive samples and seeing whether the resulting values make sense within the particular domain of interest.

Rather than using the built-in str data type, we can use our own custom type by defining a new class called ParticipantID. We can start with a simple definition, in which we accept a participant ID as a string and store it within the attribute _participant_id, using the single leading underscore as a ‘weak “internal use” indicator’, and build up functionality from there:

class ParticipantID:
    def __init__(self, participant_id: str):
        self._participant_id = participant_id

Should the parameter here be called something like participant_id_str rather than participant_id? That is, should we reserve calling things participant_id only for variables that are instances of ParticipantID. I think there is merit in that—but at the same time, the _str suffix doesn’t feels odd and the type hint indicates what data type is required.

A key limitation of using a str to represent a participant ID is that we had no assurance that the contents of the variable actually matched the requirements for a valid participant ID. Hence, we can add some verification logic to our object initialiser to ensure that an object is only created if it matches our specifications:

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

For our object to be useful, we need for it to be able to interact with other data types and operators in the Python ecosystem. The most relevant way to facilitate such interactions for our object is to make it convertible to a string. We do that by defining a __str__ method, which allows access to the Unicode representation of the participant ID using the str function:

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

Given our definition of a participant ID, it might also be useful to be able to readily convert it into an integer (so something like int(ParticipantID("p1001")) would return 1001):

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

    def __int__(self) -> int:
        return int(self._participant_id[1:])

It might also be useful to be able to compare the equality of two participant IDs. For our purposes, two participant IDs are equal if they have the same _participant_id value—they do not need to refer to the same location in memory. We thus need a way of representing _participant_id in a way that facilitates comparison, which we do by implementing the __hash__ method using the hash function:

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

    def __int__(self) -> int:
        return int(self._participant_id[1:])

    def __hash__(self) -> int:
        return hash(self._participant_id)

We can than implement an __eq__ method that compares the hashes of the objects:

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

    def __int__(self) -> int:
        return int(self._participant_id[1:])

    def __hash__(self) -> int:
        return hash(self._participant_id)

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return hash(self) == hash(other)

Note that we applied a constraint within __eq__ that it can only be compared with another ParticipantID object. That means that a statement like ParticipantID("p1001") == "p1001" will return False. This actually gets really tricky, and I could imagine leading to some pretty horrible bugs—I’d be tempted to raise a warning if other is a str (and maybe also if it is an int).

We might also like to be able to sort a collection of participant IDs, probably mostly to be able to have them in a consistent order rather than there being anything inherently ordered about them. To support sorting, we need to be able to indicate whether a participant ID is ‘less than’ another. We do this by defining a __lt__ method that compares their integer representations:

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

    def __int__(self) -> int:
        return int(self._participant_id[1:])

    def __hash__(self) -> int:
        return hash(self._participant_id)

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return hash(self) == hash(other)

    def __lt__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return int(self) < int(other)

The last ‘special’ method that we will define is for good Python form: a __repr__ method that returns a string representation of how the object can be constructed. This is useful in interactive Python sessions, in particular.

class ParticipantID:
    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

    def __repr__(self) -> str:
        return f'Participant(participant_id="{self}")'

    def __int__(self) -> int:
        return int(self._participant_id[1:])

    def __hash__(self) -> int:
        return hash(self._participant_id)

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return hash(self) == hash(other)

    def __lt__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return int(self) < int(other)

I thought about adding __truediv__ and __rtrudiv__ methods so that it could easily interact with pathlib.Path operations (e.g., base_dir / participant_id), but I think that probably over-complicates things.

We will make one final change to our object definition. Given that we know that our object will only store the _participant_id attribute, we can enforce this by defining a __slots__ class variable. This prevents the dynamic assignment of additional attributes and has memory and speed benefits (see the slots documentation for more information).

class ParticipantID:

    __slots__ = ("_participant_id",)

    def __init__(self, participant_id: str):

        if not participant_id.startswith("p"):
            raise ValueError("Participant ID must start with 'p'")

        if len(participant_id) != 5:
            raise ValueError("Participant ID must be in the form 'pXXXX'")

        try:
            _ = int(participant_id[1:])
        except ValueError:
            raise ValueError(
                "The last four digits in the participant ID must represent an integer"
            )

        self._participant_id = participant_id

    def __str__(self) -> str:
        return self._participant_id

    def __repr__(self) -> str:
        return f'Participant(participant_id="{self}")'

    def __int__(self) -> int:
        return int(self._participant_id[1:])

    def __hash__(self) -> int:
        return hash(self._participant_id)

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return hash(self) == hash(other)

    def __lt__(self, other: object) -> bool:
        if not isinstance(other, ParticipantID):
            return NotImplemented
        return int(self) < int(other)

Having defined our ParticipantID object and implemented its functionality, we can now turn again to our example function that will use the information and update its type hint:

def some_sort_of_analysis(participant_id: ParticipantID) -> None:
    # some functionality using `participant_id` ...

Within this function, we can thus use participant_id with confidence (well, assuming that we are using type checking) that it matches the participant ID format that we have defined and that its functionality is limited to operations that are sensible in our context. Although it has required quite a bit of code to construct our ParticipantID class, I have found this confidence that it brings to often be worth it.