Labelled multi-dimensional arrays for representing research data
Here, I work through an example of converting tabular data from a CSV file into a labelled multi-dimensional array data structure (using the xarray Python package).
Research data are often represented in tabular form as a set of rows and columns, similar to a spreadsheet, and stored in a comma-separated value (CSV) file.
However, a multi-dimensional structure can also provide a powerful and intuitive data representation for many research contexts.
In the Python ecosystem, numpy
(Harris et al., 2020) provides a wonderful way of computing with multi-dimensional arrays and the xarray
package (Hoyer & Hamman, 2017) builds on numpy
to allow multi-dimensional arrays to have labelled dimensions and additional metadata.
Here, I aim to walk through my process of converting an example dataset from a tabular CSV file into a multi-dimensional xarray
data representation.
I will use the data from the recent study by Graham & Hobaiter (2023) titled “Towards a great ape dictionary: Inexperienced humans understand common nonhuman ape gestures”.
I chose this dataset because its associated article was in the media when I was preparing this post, it is open-access, it is a relatively large dataset with a few interesting features and quirks, it is a bit of a different sort of perception study, and I thought it might be fun to work through a Bayesian analysis in a future post.
It is unfortunately still not as common as it should be for the raw data to be available with a published article—thanks to the authors for doing so here!
Briefly, the experiment involved asking human participants to identify which of four potential ‘meanings’ were associated with a video of an ape making a gesture. There were 10 different gesture types and 2 different species (chimpanzee and bonobo), and each participant completed the 20 trials corresponding to all combinations of these two conditions. Participants either saw just the video or the video with an additional line of contextual information, in a between-subjects manipulation. The key experimental question of interest concerned the accuracy with which participants could correctly identify the meaning that was associated with each gesture.
Accessing the raw data
The authors have made the data available in a repository on zenodo, so we could just download it from there and put it in a known location on our file system.
Here I am going to instead experiment with using the pooch
package (Uieda et al., 2020) to download and manage the data.
The advantage of this approach is that we make it explicit in the code where the raw data are coming from, and the hash-based integrity verification in pooch
means that we only download the file as necessary and we can be confident that it matches our expectations.
We can access the raw data in this fashion via:
import pooch
def get_raw_data_path():
url = (
"https://zenodo.org/record/7352326/files/"
+ "Wild-Minds/GreatApeDictionary-v1.1.zip"
)
raw_data_path = pooch.retrieve(
url=url, known_hash="md5:8c40d1bce25619548f4daa16d63f36a3"
)
return raw_data_path
Ideally, we would use the functionality in pooch
to download using a DOI rather than a HTTPS URL.
However, there is a problem with the current version (when writing this post) of pooch
with accessing files in some repositories; I’ve opened an issue.
The zip file containing the raw data will then be downloaded to a cache directory (if not already present), and its location returned in the raw_data_path
variable.
Reading the raw tabular data
Now, we will define a get_rows
function that reads each row of the CSV file that is contained within the raw data archive.
We define it as a generator function, which means that it contains a yield
statement within it.
This makes it behave as an iterator, returning one row at a time to its caller when iterated over—which is really neat for this use case.
The row that is returned has been parsed by the csv
built-in module and is in the form of a dictionary where the keys are the column names and the values are the associated column values.
Most of the code in the get_rows
function handles the parsing of the zip file that is necessary to get to the CSV file.
Well, zip files plural, since the CSV file is within a zip file within the zip file.
See my previous post Benefits of storing and accessing research data in zip files for more on processing zip files in Python.
import zipfile
import csv
import io
def get_rows():
raw_data_path = get_raw_data_path()
with zipfile.ZipFile(file=raw_data_path) as outer_zip:
# path of the zip file containing the CSV, within the zip file
csv_zip_path = "Wild-Minds-GreatApeDictionary-4ed8d77/GAD_FullDataset.csv.zip"
with zipfile.ZipFile(file=outer_zip.open(name=csv_zip_path)) as csv_zip:
# name of the CSV file within the zip file
csv_path = "GAD_FullDataset.csv"
with csv_zip.open(csv_path) as csv_handle:
reader = csv.DictReader(io.TextIOWrapper(csv_handle))
for row in reader:
yield row
We can have a look at its representation of the first data row by running next(get_rows())
in an interactive session:
>>> next(get_rows())
{'Participant Private ID': '83088',
'Trial Number': '17',
'UTC Date': '21/08/2017 02:25',
'Local Timezone': '-4',
'Condition': 'Gesture',
'Reaction Time': '4208.64',
'Reaction Time (M)': '0.070144',
'Response': 'Move into a new position',
'Attempt': '1',
'Success': '0',
'Spreadsheet Name': 'Task8',
'Response1': 'Climb on my back',
'Response2': 'Groom me',
'Response3': 'Carry me',
'Response4': 'Move into a new position',
'Video': '<iframe src="https://player.vimeo.com/video/214671788" width="640" '
'height="360" frameborder="0" webkitallowfullscreen '
'mozallowfullscreen allowfullscreen></iframe>',
'Meaning': 'Groom me',
'Image': 'ArRaBo1.png',
'Gesture': 'Arm Raise',
'Ambiguous': '1',
'Species': 'Bonobo',
'Context': 'N/A',
'': ''}
Building a multi-dimensional representation of a single trial
Our next step is to use the raw row data to build up the ‘atoms’ of the multi-dimensional representation. To do so, we need to think about the form of multi-dimensional data structure that best matches the characteristics of our data.
In this example, the key data that we wish to represent are the correctness of the participant responses across trials. To decide how to structure such data, I often find it useful to think about the experimental design and identify factors that are crossed, in that each level of one factor is present in combination with each level of another factor; these often correspond to the within-subject or repeated measures factors. In this example, each participant completed trials involving a particular gesture (from a set of 10 gestures) from one of two species. Importantly, each participant completed trials involving all 20 possible combinations of gesture and species.
Hence, one potential form of data representation is to have three dimensions: participant, gesture, and species. With there being data from 5,656 participants in the raw data, that means our data will have 5,656 × 10 × 2 values.
As mentioned previously, participants were randomly allocated to either view only the video or the video with additional contextual information on each trial—a between-subjects manipulation. Could this ‘context’ factor also be a dimension in the data representation? Potentially, yes—particularly if there are the same number of participants in the two conditions. In that case, a natural structure might be 2 (context) × 2,828 (context participant) × 10 (gesture) × 2 (species). Even if they were not balanced, they could still be represented as separate dimensions—with the expectation that not all of the volume will be filled. However, I tend to prefer using one dimension for all participants and handling between-subjects manipulations in the way that I will expand on below.
We can think of each row in the raw data as relating to a single point in this three-dimensional space.
Our goal now is to create an xarray
data structure from an individual row:
def form_trial_data(row):
return trial_data
We will start by creating a variable to describe the position of the datum on each of the three dimensions.
These positions are referred to as coordinates, and each coordinate is represented as a one-dimensional numpy
array (using the helpful ndmin
argument to indicate that we want a one-dimensional array rather than a scalar).
import numpy as np
def form_trial_data(row):
dim_coords = {
"ppant": np.array(row["Participant Private ID"], ndmin=1),
"gesture": np.array(row["Gesture"].lower(), ndmin=1),
"species": np.array(row["Species"].lower(), ndmin=1),
}
dims = tuple(dim_coords.keys())
n_dims = len(dims)
return trial_data
In addition to these “dimensional” coordinates, we can also identify “non-dimensional” coordinates. These record the position of the datum along axes that do not necessarily correspond to the nominated dimensions of the data. For example, each row has a particular ‘trial number’ that describes the order with which the participant completed the particular combination of gesture and species (so an integer between 1 and 20). We can thus label each combination of participant, gesture, and species with a trial number. We will create a new variable to handle these non-dimensional coordinates:
import numpy as np
def form_trial_data(row):
dim_coords = {
"ppant": np.array(row["Participant Private ID"], ndmin=1),
"gesture": np.array(row["Gesture"].lower(), ndmin=1),
"species": np.array(row["Species"].lower(), ndmin=1),
}
dims = tuple(dim_coords.keys())
n_dims = len(dims)
nondim_coords = {
"trial_num": (
("ppant", "gesture", "species"),
np.array(int(row["Trial Number"]), ndmin=3),
),
}
return trial_data
Notice that the non-dimensional coordinates are specified by first specifying a tuple of the relevant dimensional coordinates. The trial number position is then specified as a three-dimensional array.
We can identify a few other non-dimensional coordinates that are useful to represent:
- The between-subjects factor of whether the participant was given a text description in addition to the video.
- Whether a particular combination of gesture and species is considered to be ‘ambiguous’ in its meaning.
- Which of the two possible exemplar sets a participant received.
These can be represented via:
import numpy as np
def form_trial_data(row):
dim_coords = {
"ppant": np.array(row["Participant Private ID"], ndmin=1),
"gesture": np.array(row["Gesture"].lower(), ndmin=1),
"species": np.array(row["Species"].lower(), ndmin=1),
}
dims = tuple(dim_coords.keys())
n_dims = len(dims)
nondim_coords = {
"trial_num": (
("ppant", "gesture", "species"),
np.array(int(row["Trial Number"]), ndmin=3),
),
"context_cond": (
("ppant",),
np.array(row["Condition"].lower(), ndmin=1),
),
"ambiguous": (
("gesture", "species"),
np.array(row["Ambiguous"] == "1", ndmin=2),
),
"exemplar_cond": (
("ppant",),
np.array(
int(row["Spreadsheet Name"].replace("Task", "")) % 2 + 1,
ndmin=1,
),
),
}
return trial_data
Finally, we extract whether the participant responded correctly on the trial and put together the xarray
data structure:
import numpy as np
import xarray as xr
def form_trial_data(row):
dim_coords = {
"ppant": np.array(row["Participant Private ID"], ndmin=1),
"gesture": np.array(row["Gesture"].lower(), ndmin=1),
"species": np.array(row["Species"].lower(), ndmin=1),
}
dims = tuple(dim_coords.keys())
n_dims = len(dims)
nondim_coords = {
"trial_num": (
("ppant", "gesture", "species"),
np.array(int(row["Trial Number"]), ndmin=3),
),
"context_cond": (
("ppant",),
np.array(row["Condition"].lower(), ndmin=1),
),
"ambiguous": (
("gesture", "species"),
np.array(row["Ambiguous"] == "1", ndmin=2),
),
"exemplar_cond": (
("ppant",),
np.array(
int(row["Spreadsheet Name"].replace("Task", "")) % 2 + 1,
ndmin=1,
),
),
}
resp_correct = np.array(row["Success"] == "1", ndmin=n_dims)
trial_data = xr.DataArray(
data=resp_correct,
dims=dims,
coords=dim_coords | nondim_coords,
)
return trial_data
We can have a look at its representation of the first data row by running form_trial_data(next(get_rows()))
in an interactive session:
>>> form_trial_data(row=next(get_rows()))
<xarray.DataArray (ppant: 1, gesture: 1, species: 1)>
array([[[False]]])
Coordinates:
* ppant (ppant) <U5 '83088'
* gesture (gesture) <U9 'arm raise'
* species (species) <U6 'bonobo'
trial_num (ppant, gesture, species) int64 17
context_cond (ppant) <U7 'gesture'
ambiguous (gesture, species) bool True
exemplar_cond (ppant) int64 1
Assembling and examining the study data
Now that we are able to represent each trial, we can stitch all the trial data together using the combine_by_coords
function in xarray
:
import xarray as xr
def form_data():
trials = [form_trial_data(row) for row in get_rows()]
data = xr.combine_by_coords(data_objects=trials)
return data
However, one aspect that we need to be mindful of is that the combine_by_coords
function will happily proceed if there is more than one value at a particular coordinate.
For example, say if there was (erroneously) different rows in the raw data that had the same participant, gesture, and species, then combine_by_coords
would silently only record one of these and we would not be alerted to the problem.
Hence, I think it is best to do a manual verification that each trial datum has a unique position on the three dimensions:
import xarray as xr
def form_data():
trials = [form_trial_data(row) for row in get_rows()]
trial_positions = []
for trial_data in trials:
trial_position = (
trial_data.ppant.item(),
trial_data.gesture.item(),
trial_data.species.item(),
)
if trial_position in trial_positions:
raise ValueError(
f"Trial with position {trial_position} already seen."
)
trial_positions.append(trial_position)
data = xr.combine_by_coords(data_objects=trials)
return data
This will probably take a while to run.
Let’s assume that we have done so and stored the result in the variable data
.
We can have a look at its representation by running print(data)
in an interactive session:
>>> print(data)
<xarray.DataArray (ppant: 5656, gesture: 10, species: 2)>
array([[[ 1., 0.],
[ 1., 1.],
[ 1., 1.],
...,
[ 0., 1.],
[ 0., 1.],
[ 1., 1.]],
[[nan, nan],
[nan, nan],
[nan, nan],
...,
[nan, nan],
[ 0., nan],
[nan, nan]],
[[ 0., 0.],
[ 1., 0.],
[ 1., 0.],
...,
...
...,
[ 1., 1.],
[ 0., 0.],
[ 0., 1.]],
[[ 0., 0.],
[ 0., 1.],
[ 0., 1.],
...,
[ 1., 1.],
[ 0., 1.],
[ 1., 1.]],
[[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
...,
[ 1., 1.],
[ 0., 1.],
[ 1., 1.]]])
Coordinates:
* ppant (ppant) <U6 '101018' '101104' '101234' ... '99348' '99407'
* gesture (gesture) <U16 'arm raise' 'big loud scratch' ... 'touch'
* species (species) <U6 'bonobo' 'chimp'
trial_num (ppant, gesture, species) float64 17.0 3.0 5.0 ... 4.0 2.0
context_cond (ppant) object 'gesture' 'context' ... 'context' 'context'
ambiguous (gesture, species) bool True True False ... True True True
exemplar_cond (ppant) float64 1.0 2.0 2.0 1.0 2.0 ... 1.0 1.0 2.0 1.0 1.0
Now, let’s check whether some aspects of the data conform to what we would expect based on the following description from the published article:
We analysed n = 112,648 responses (Video only, n = 59,001; Context, n = 53,647) from n = 5,656 participants who completed the full set of videos (Video only, n = 2,962; Context, n = 2,694). Participants correctly interpreted the meanings of chimpanzee and bonobo gestures with or without additional Contextual information (Context: Success rate mean = 57.3 ± 11.9%; binomial, n = 53,647, p < 0.0001; Video only: Success rate mean = 52.1 ± 11.0%; binomial, n = 59,001, p < 0.0001) significantly higher than expected by chance (0.25).
If we prepare some straightforward data summarisation code:
def compare_against_paper(data):
print(f"Total responses: {data.size}")
print(f"Total responses (!nan): {np.sum(np.isfinite(data.values))}")
print(f"n = {data.sizes['ppant']}")
label_lut = {"gesture": "Video only", "context": "Context"}
for (context_label, context_data) in data.groupby(group=data.context_cond):
print(f"Condition: {label_lut[context_label]}")
print(f"\tn (participants) = {context_data.sizes['ppant']}")
print(f"\tn (responses) = {context_data.size}")
we get:
Total responses: 113120
n = 5656
Condition: Context
n (participants) = 2694
n (responses) = 53880
Mean = 57.3%
Condition: Video only
n (participants) = 2962
n (responses) = 59240
Mean = 52.1%
Hmm.
It seems like most of the values match the description given in the paper—but not all.
Specifically, those relating to the total number of responses differ by being larger in our data representation.
The reason for the discrepancy is evident in the summary of the data
variable printed above—notice how some of the data points have the value of nan
rather than 0
or 1
.
These nan
values are being ignored when calculating the means, but are still contributing to the total response counts.
If we do not consider the nan
responses:
def compare_against_paper(data):
print(f"Total responses: {data.size}")
print(f"Total responses (!nan): {np.sum(np.isfinite(data.values))}")
print(f"n = {data.sizes['ppant']}")
label_lut = {"gesture": "Video only", "context": "Context"}
for (context_label, context_data) in data.groupby(group=data.context_cond):
print(f"Condition: {label_lut[context_label]}")
print(f"\tn (participants) = {context_data.sizes['ppant']}")
print(f"\tn (responses) = {context_data.size}")
print(f"\tn (responses; !nan) = {np.sum(np.isfinite(context_data.values))}")
print(f"\tMean = {context_data.mean().item() * 100:.1f}%")
we recover the values described in the paper:
Total responses: 113120
Total responses (!nan): 112648
n = 5656
Condition: Context
n (participants) = 2694
n (responses) = 53880
n (responses; !nan) = 53647
Mean = 57.3%
Condition: Video only
n (participants) = 2962
n (responses) = 59240
n (responses; !nan) = 59001
Mean = 52.1%
The presence of nan
values indicates that some combinations of participant, gesture, and species are not contained within the raw data.
The data are described as relating to “participants who completed the full set of videos”, but there were also data exclusions involving response times that likely operated on a trial-by-trial basis.
How to handle those nan
values depends on the approach of the analyst.
With such a large number of participants, it seems fine to me to remove those participants who have any excluded trials—which we can do by dropping those participants who have any nan
values in their data:
def check_data(data):
data = data.dropna(dim="ppant")
return data
Having nan
values in the data can also affect the data types, because nan
is a floating point type.
Hence, we can also fix up the data types while we are there:
def check_data(data):
data = data.dropna(dim="ppant")
data.data = data.data.astype(bool)
nondim_dtypes = {
"trial_num": int,
"context_cond": str,
"exemplar_cond": int,
}
for (var_name, var_dtype) in nondim_dtypes.items():
data[var_name] = data[var_name].astype(var_dtype)
return data
This then gives us our final data representation (with 230 fewer participants):
>>> print(data)
<xarray.DataArray (ppant: 5426, gesture: 10, species: 2)>
array([[[ True, False],
[ True, True],
[ True, True],
...,
[False, True],
[False, True],
[ True, True]],
[[False, False],
[ True, False],
[ True, False],
...,
[False, False],
[False, True],
[ True, False]],
[[ True, False],
[ True, True],
[ True, False],
...,
...
...,
[ True, True],
[False, False],
[False, True]],
[[False, False],
[False, True],
[False, True],
...,
[ True, True],
[False, True],
[ True, True]],
[[ True, False],
[False, True],
[ True, False],
...,
[ True, True],
[False, True],
[ True, True]]])
Coordinates:
* ppant (ppant) <U6 '101018' '101234' '101978' ... '99348' '99407'
* gesture (gesture) <U16 'arm raise' 'big loud scratch' ... 'touch'
* species (species) <U6 'bonobo' 'chimp'
trial_num (ppant, gesture, species) int64 17 3 5 14 20 ... 11 16 10 4 2
context_cond (ppant) <U7 'gesture' 'context' ... 'context' 'context'
ambiguous (gesture, species) bool True True False ... True True True
exemplar_cond (ppant) int64 1 2 1 2 1 2 1 1 2 1 2 ... 1 1 2 2 2 2 1 1 2 1 1
We can save and load this data to and from disk, in NetCDF format, using:
def save(data_path, data):
data.to_netcdf(path=data_path)
def load(data_path):
with xr.open_dataarray(filename_or_obj=data_path) as handle:
data = handle.load()
return data
We are then ready to perform some data analysis! Although it has been a fair bit of work to get to this point, we now have the data in a form that is amenable to powerful computation (via its multi-dimensional representation) and hopefully more robust to programming errors (via its dimensional labelling).
Software versions
These are the versions of the key software and packages that were used to generate the material in this post:
Python implementation: CPython
Python version : 3.9.16
pooch : 1.6.0
xarray : 2023.1.0
numpy : 1.24.1
License
This work is licensed under CC BY-NC-ND 4.0.
References
- Graham, K.E. & Hobaiter, C. (2023) Towards a great ape dictionary: Inexperienced humans understand common nonhuman ape gestures. PLoS Biology, 21(1), e3001939.
- Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., & Oliphant, T.E. (2020) Array programming with NumPy. Nature, 585, 357–362.
- Hoyer, S. & Hamman, J. (2017) xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software, 5(1), 10–10.
- Uieda, L., Soler, S.R., Rampin, R., van Kemenade, H., Turk, M., Shapero, D., Banihirwe, A., & Leeman, J. (2020) Pooch: A friend to fetch your data files. Journal of Open Source Software, 5(45), 1943.