Python disrememberings

2024-Sep-29

There are a few things in Python that I struggle to remember — each time, I need to consult the documentation or do an internet search (typically finding pages that I have previously visited). Here, I go through some of these disrememberings in the hope that I will remember them better — or, at least, I will have a place to go when looking for the answers.

Iterables and iterators

As defined in the Python documentation, an iterable is “an object capable of returning its members one at a time”. Perhaps most notably, an iterable can be used in a for loop to iterate over the contents of a variable.

In the context of a custom object, an iterable can be created by specifying a __getitem__ method. This allows the members of the object to be indexed via the square-brackets notation (e.g., demo_iterable[2]) and automatically provides support for iteration. For example, this defines a valid iterable that could be used in a for loop:

class DemoIterable:

    def __init__(self, items: typing.Sequence[int]):
        self._items = list(items)

    def __getitem__(self, key: object) -> int:
        return self._items[key]

However, a problem with this approach is that mypy (type checker) will complain that it is not actually iterable. Although it is iterable in practice, this is difficult to detect via type checking.

The alternative approach is to provide an __iter__ method that returns an iterator. An iterator is defined as an object that conforms to the iterator protocol by having __iter__ and __next__ methods. The __next__ method needs to return the next item or raise StopIteration if it is finished.

We could do this by making the object itself be an iterator by meeting the requirements specified above:

import typing

class DemoIterable:

    def __init__(self, items: typing.Sequence[int]):
        self._items = list(items)
        self._curr_index = 0

    def __iter__(self) -> typing.Iterator[int]:
        return self

    def __next__(self) -> int:
        try:
            curr_item = self._items[self._curr_index]
        except IndexError:
            raise StopIteration()
        self._curr_index += 1
        return curr_item

A tricky issue with the above, noted in the documentation, relates to the fact that iterators are exhausted when they have completed their iteration; that is, they can only be iterated over once and will appear to be empty on subsequent attempts at iteration. Because the instance is both an iterable and an iterator in this approach, the instance is similarly exhausted. This can produce the unexpected behaviour that a second pass through iteration of an instance has no effect, since it has been exhausted by the first pass — potentially causing very annoying bugs. We could get around this by resetting the current item index when starting a new iteration:

class DemoIterable:

    def __init__(self, items: typing.Sequence[int]):
        self._items = list(items)
        self._curr_index = 0

    def __iter__(self) -> typing.Iterator[int]:
        self._curr_index = 0
        return self

    def __next__(self) -> int:
        try:
            curr_item = self._items[self._curr_index]
        except IndexError:
            raise StopIteration()
        self._curr_index += 1

I don’t know of any way for a consumer of an iterable to know whether it will or will not be exhausted by iteration, without explicitly checking. A good reminder to be careful whenever iterating over a variable multiple times!

However, this approach in general is unnecessarily complex for this object because it doesn’t itself need to be an iterator — it just needs to define a __iter__ method that returns an iterator. To do so, we could take advantage of the fact that generators are iterators:

class DemoIterable:

    def __init__(self, items: typing.Sequence[int]):
        self._items = list(items)

    def __iter__(self) -> typing.Iterator[int]:
        yield from self._items

But this is again more complex than we need in this circumstance. We can instead provide a __iter__ method that returns the iterator for our underlying list using the iter built-in function:

class DemoIterable:

    def __init__(self, items: typing.Sequence[int]):
        self._items = list(items)

    def __iter__(self) -> typing.Iterator[int]:
        return iter(self._items)

Using `typing.overload`

The main thing I need to remember when using typing.overload is that it allows you to express correspondences between parameter types and return types — it is not about being able to execute different code depending on parameter types.

For example, say if you have a function my_func that has a single parameter param that can either be a str or an int and it returns either a str or None, it could be written as:

def my_func(param: str | int) -> str | None:
    pass

Let’s additionally assume that we know that if param is a str, then the return value is an str — and that if param is an int, then the return value is None. As written, our type annotation is unable to capture that relationship.

Instead, we can use typing.overload to express the correspondences:

import typing

@typing.overload
def my_func(param: str) -> str: ...
@typing.overload
def my_func(param: int) -> None: ...
def my_func(param: str | int) -> str | None:
    pass

This will allow the type checker to appropriately set the type of the value returned from calls to my_func.

What if you do want to be able to write type-dependent computation? Python has a built-in way to do this for a single parameter in functools.singledispatch.

Restricting the values of an argument to a function parameter

Say you are defining a function my_func which has a single parameter arg. The function only supports passing an argument for arg that is one of a known set of values; for example, arg could be the string "a" or the string "b".

While the arg parameter could be typed as arg: str and then validated inside the function (e.g., arg in ["a", "b"], you instead want to use the type system to impose the validation.

Something that always looks appealing when I encounter this situation is to use typing.Literal to encode the constraint. For example:

import typing

def my_func(arg: typing.Literal["a", "b"]) -> None:
    pass

This can even seem to work well with some testing:

my_func(arg="a")  # ok
my_func(arg="c")  # type check error

Unfortunately — and this is the part I always forget — this approach does not work when using a variable with a valid value:

arg_val = "a"
my_func(arg=arg_val)  # type check error

This does make sense — the type of variable arg_val (str) is not correct, even though its value matches the value of an expected type.

That still leaves us with the problem of how to specify the type to achieve the constraint. One way to do this is to define a different sort of type — an Enum — that can only take on the allowed values:

import typing
import enum

class Arg(enum.Enum):
    A = "a"
    B = "b"

def my_func(arg: Arg) -> None:
    pass

This gives us some flexibility in passing an allowed value, as long as we first pass it through the Arg constructor (at which point it would fail if an invalid value was provided):

my_func(arg=Arg.A)  # ok

arg_val = Arg.A
my_func(arg=arg_val)  # ok

my_func(arg=Arg("a")) # ok

arg_val_str = "a"
my_func(arg=Arg(arg_val_str))  # ok

This feels rather unsatisfactory, though, in that we can no longer simply pass a string with a valid value to the function:

my_func(arg="a")  # type check error

This seems to violate the general principle of being liberal with the accepted argument types (and strict with the annotated return types). Perhaps a rule-of-thumb is how ‘user-facing’ the function is; if it is a function that is expected to be called outside of the library or application, then maybe err on the side of accepting a string and then validating — if only being called internally, it seems better to require the argument to be of the Enum type.

It feels like there should be a better way to do this!

Handling boolean values in `argparse`

I quite like argparse, which is the built-in Python package for handling command-line arguments. I particularly appreciate its functionality to handle parameters that can be either true or false, but I often struggle to remember the syntax.

As an example of handling boolean values in argparse, say if you have a command-line program (cli.py) that has a parameter verbose that controls the amount of log-style output that is produced. You can set it up like this:

import argparse

parser = argparse.ArgumentParser()

parser.add_argument(
    "--verbose",
    default=False,
    action=argparse.BooleanOptionalAction,
)

args = parser.parse_args()

The benefit to using the highlighted line, action=argparse.BooleanOptionalAction, is that now the program can be called using either cli.py --verbose or cli.py --no-verbose.

The design of effective command-line interfaces is tricky — the Command Line Interface Guidelines is a very useful resource.

Iterables and iterators

Using typing.overload

Restricting the values of an argument to a function parameter

Handling boolean values in argparse

Using `typing.overload`

Handling boolean values in `argparse`