Create, recreate and remove duplicates in array manipulation, obviously in Python!:)

I would like to “pin” here a pair of quick solution to everyday problems I encounter manipulating arrays.

Create

First, the creation of a structured array (an array composed of records made by different data types) array from a file too big to be read at once with np.genfromtxt. The new array will be stored in an HDF5 file, so this is a conversion from .csv to .h5 file.

import numpy as np
import tables as tb
from StringIO import StringIO

f = open("filename.csv", 'r')
line = f.readline()
values = np.genfromtxt(StringIO(line), dtype=([('column_1', 'i8'), ('column_2', 'f4'), ('column_3', 'f4')]), delimiter=',')
values.shape = 1
h5 = tb.openFile('filename.h5', 'w')
table = h5.createTable(h5.root, description=values, name=table_name', title="table_description", expectedrows=12158536)
table.flush()

for line in f:
	values = np.genfromtxt(StringIO(line), dtype=([('column_1', 'i8'), ('column_2', 'f4'), ('column_3', 'f4')]), delimiter=',')
	values.shape = 1
	table.append(values)
	
table.flush()
h5.flush()
h5.close()
f.close()

The use of StringIO is necessary to convert the string containing the line read in a “I/O” object that np.genfromtxt can eat.

Remove duplicates

Consider the previous file, if there are duplicates row, np.unique can help in removing them. Note that we use the first column to identify the duplicates and that the result will be sorted respect to this column.

new_indexes = np.unique(table['column_1'], return_index=True, return_inverse=False)[1]
new_array = np.transpose(table[new_indexes])

In general, numpy.unique(array, return_index=True, return_inverse=True)
returns an array sorted and without duplicates, the indexes of the original array to create the new array and the indexes of the new one to recreate the old one:

In: a = np.array([5,3,3,7,2,9,1])
In: np.unique(a, return_index=True, return_inverse=True)
Out: 
(array([1, 2, 3, 5, 7, 9]),
	array([6, 4, 1, 0, 3, 5]),
	array([3, 2, 2, 4, 1, 5, 0]))

Recreate

Sometimes it’s useful to split a structured array in different arrays, manipulate them and recreate the structured array, or maybe you need to create a structured array from different arrays to fill a Pytables table.
To do this a possible solution is:

values = np.array(zip([column_1[0]], [column_2[0]]))
print "Creating table..."
table = h5.createTable(h5.root, description=values, name='fof_data_snap67', title="fof_data_snap67", expectedrows=11697806)

for i in xrange(1, fof.size):
	values = np.array(zip([column_1[i]], [column_2[i]]))
	table.append(values)

table.flush()
h5.flush()
h5.close()

It’s also possible to zip the entire arrays if they fit into memory:

values = np.array(zip(column_1, column_2))
Tags// , , , ,
comments powered by Disqus