Data—arrays and numpy
Objectives
- Be able to create arrays using numpy functions.
- Know how to access subsets of array elements using indexing.
- Understand how operations are applied to both the items in an array and over items in an array.
- Be able to save/load arrays to/from disk.
In this lesson, we will be introducing a critical data type for data analysis—the array.
We will be using an external Python package called numpy for our array functionality.
As you may recall, we need to use the import
command to make such additional functionality available to our Python scripts.
For numpy, we do something slightly different; because we will be using it so much, it is conventional to shorten numpy
to np
in our code.
We modify our usual import
statement to be:
import numpy as np
This code imports the numpy functionality and allows us to refer to it as np
.
Typically, this line will appear at the top of all the Python scripts we will use in these lessons.
Creating arrays
There are quite a few ways of producing arrays.
First, we will consider the array analogue of the range
function that we used previously to produce a list.
The equivalent function in numpy is arange
:
import numpy as np
r_list = range(10)
print type(r_list), r_list
r_array = np.arange(10)
print type(r_array), r_array
<type 'list'> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<type 'numpy.ndarray'> [0 1 2 3 4 5 6 7 8 9]
As you can see, both methods have produced a collection of integers from 0 through 9.
However, they are different data types—the range
function produces a list
, whereas the arange
function produces an array
.
This distinction is important, because the different data types have different functionality associated with them.
One of the most useful things about arrays is that they can have multiple dimensions.
A frequently encountered form of data is a table, with a number of rows and a number of columns.
We can represent such a structure by creating a two-dimensional array.
For example, the ones
function creates an array of a particular size with all elements having the value of 1—we can use this to create a data structure with 5 rows and 3 columns.
import numpy as np
data = np.ones(shape=[5, 3])
print data
[[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]
[ 1. 1. 1.]]
Here, we’ve provided the ones
function with a keyword argument called shape
, which is a list that specifies the number of items along each dimension.
As you can see, it creates a structure with 5 rows and 3 columns.
Tip
You can see how this would be a useful data structure if you think that the rows might represent individual participants in an experiment and the columns might 3 conditions in a within-subject design.
Once created, we can access various useful properties of the array.
For example, .shape
returns a representation of the number of items along each dimension of the array, .size
returns the total number of items in the array, and .ndim
returns the number of dimensions in the array:
import numpy as np
data = np.ones(shape=[5, 3])
print data.shape
print data.size
print data.ndim
(5, 3)
15
2
Indexing arrays
We can access the items in an array using similar techniques to what we used with lists:
import numpy as np
r_array = np.arange(10)
print r_array[:5]
[0 1 2 3 4]
Indexing becomes more advanced when we have arrays with more than one dimension.
Here, we will use another function to generate a multidimensional array—the numpy equivalent of random.random
that we encountered earlier.
We can access individual items in the array by separating the dimensions by a comma:
import numpy as np
data = np.random.random(size=[5, 3])
print data
# first row, first column
print data[0, 0]
# second row, first column
print data[1, 0]
# second row, last column
print data[1, -1]
[[ 0.90168308 0.17133111 0.57730204]
[ 0.83406998 0.38666967 0.3698428 ]
[ 0.48163748 0.43250003 0.95742637]
[ 0.17254113 0.40056304 0.35827753]
[ 0.95301857 0.13373673 0.69812848]]
0.9016830831
0.834069979484
0.36984279585
Importantly, we can also access all the items along a particular dimension:
import numpy as np
data = np.random.random(size=[5, 3])
print data
# all rows, first column
print data[:, 0]
# first row, all columns
print data[0, :]
[[ 0.1961776 0.53541907 0.7453791 ]
[ 0.57018969 0.62871156 0.04582693]
[ 0.83923676 0.29456444 0.72562237]
[ 0.77830901 0.50678319 0.18410882]
[ 0.56276205 0.0579545 0.92955741]]
[ 0.1961776 0.57018969 0.83923676 0.77830901 0.56276205]
[ 0.1961776 0.53541907 0.7453791 ]
We can also extract items using arrays of boolean values.
For example, if we use a >
operator on an array, it returns an array of booleans.
If we then use this boolean array to index the data, it returns those items where the corresponding item in the boolean array is True
:
import numpy as np
data = np.random.random(10)
print data
gt_point_five = data > 0.5
print gt_point_five
print data[gt_point_five]
[ 0.41374836 0.20518716 0.78958201 0.97271562 0.37304847 0.42976821
0.02028967 0.91801355 0.05767821 0.73583786]
[False False True True False False False True False True]
[ 0.78958201 0.97271562 0.91801355 0.73583786]
Operations on arrays
We can use the conventional maths operators with arrays where, unlike lists, they operate on each item in the array. For example:
import numpy as np
data = np.ones(4)
print data + 1
print data * 3
print data - 2
[ 2. 2. 2. 2.]
[ 3. 3. 3. 3.]
[-1. -1. -1. -1.]
We can also use operators that operate over items in an array. For example, we could add together all the items in an array:
import numpy as np
data = np.ones(4)
print data
print np.sum(data)
[ 1. 1. 1. 1.]
4.0
When applied to multidimensional arrays, such functions typically can be given an axis
argument.
This argument specifies the axis over which the operation is applied.
For example, to sum over rows and columns:
import numpy as np
data = np.ones([4, 3])
data[1, :] = 2
data[2, :] = 3
data[3, :] = 4
print data
print "Rows:"
print np.sum(data, axis=0)
print "Columns:"
print np.sum(data, axis=1)
[[ 1. 1. 1.]
[ 2. 2. 2.]
[ 3. 3. 3.]
[ 4. 4. 4.]]
Rows:
[ 10. 10. 10.]
Columns:
[ 3. 6. 9. 12.]
Loading and saving arrays from/to disk
When using one or two dimensional arrays, a straightforward way to load and save data is in the form of a text file.
This can be opened with any editor, and maximises the interoperability of the data with other programs.
To do so, we use np.savetxt
and np.loadtxt
.
For example, we can save some random data to a text file and the load it back in again to verify its contents have not changed:
import numpy as np
data = np.random.random([3, 2])
print data
np.savetxt("data.txt", data)
saved_data = np.loadtxt("data.txt")
print saved_data
[[ 0.92961609 0.31637555]
[ 0.18391881 0.20456028]
[ 0.56772503 0.5955447 ]]
[[ 0.92961609 0.31637555]
[ 0.18391881 0.20456028]
[ 0.56772503 0.5955447 ]]
We can also inspect the file that is saved on disk, data.txt
:
9.296160928171478544e-01 3.163755545817859005e-01
1.839188116770944514e-01 2.045602785530397094e-01
5.677250290816866496e-01 5.955447029792515501e-01
There are two notable aspects of the above that are worth comment:
- The data are represented in exponential notation.
- Columns are separated by a space character.
To make it easier to visually inspect the data file, and for compatibility with other programs that are expecting a CSV (‘comma separated values’) format, we can provide arguments to the np.savetxt
function:
import numpy as np
data = np.random.random([3, 2])
np.savetxt("data2.txt", data, fmt="%.4f", delimiter=",")
0.9296,0.3164
0.1839,0.2046
0.5677,0.5955
- The
fmt
argument allowed us to specify 4 decimal places (see the Python documentation on its format specification mini-language for more details). - The
delimiter
argument allowed us to specify a comma as the separator (‘delimiter’) between columns.
If the array has more than two dimensions, then saving as a plain text file often isn’t practical.
In such circumstances, you can use np.save
and np.load
with array files, which are typically given the extension .npy
.