What’s the most efficient way to convert a MySQL result set to a NumPy array?

Posted on

Question :

What’s the most efficient way to convert a MySQL result set to a NumPy array?

I’m using MySQLdb and Python. I have some basic queries such as this:

c=db.cursor()
c.execute("SELECT id, rating from video")
results = c.fetchall()

I need “results” to be a NumPy array, and I’m looking to be economical with my memory consumption. It seems like copying the data row by row would be incredibly inefficient (double the memory would be required). Is there a better way to convert MySQLdb query results into the NumPy array format?

The reason I’m looking to use the NumPy array format is because I want to be able to slice and dice the data easily, and it doesn’t seem like python is very friendly to multi-dimensional arrays in that regard.

e.g. b = a[a[:,2]==1] 

Thanks!

Answer #1:

The fetchall method actually returns an iterator, and numpy has the fromiter method to initialize an array from an interator. So, depending on what data is in the table you could combine the two easily, or use an adapter generator.

Answered By: Keith

Answer #2:

This solution uses Kieth’s fromiter technique, but handles the two dimensional table structure of SQL results more intuitively. Also, it improves on Doug’s method by avoiding all the reshaping and flattening in python data types. Using a structured array we can read pretty much directly from the MySQL result into numpy, cutting out python data types almost entirely. I say ‘almost’ because the fetchall iterator still produces python tuples.

There is one caveat though, but it’s not a biggie. You must know the data type of your columns and the number of rows in advance.

Knowing the column types should be obvious, since you know what the query is presumably, otherwise you can always use curs.description, and a map of the MySQLdb.FIELD_TYPE.* constants.

Knowing the row count means you have to use client side cursor (which is the default). I don’t know enough about the internals of MySQLdb and the MySQL client libraries, but my understanding is that the entire result is fetched into client side memory when using client side cursors, although I suspect there’s actually some buffering and caching involved. This would mean using double memory for the result, once for the cursor copy and once for the array copy, so it’s probably a good idea to close the cursor as soon as possible to free up the memory if the result set is large.

Strictly speaking, you don’t have to provide the number of rows in advance, but doing so means the array memory is allocated once off in advance, and not continuously resized as more rows come in from the iterator which is meant to provide a huge performance boost.

And with that, some code

import MySQLdb
import numpy

conn = MySQLdb.connect(host='localhost', user='bob', passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype='i4,i4' means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

See the numpy documentation for dtype and the link above about structured arrays for how to specify column data types, and column names.

Answered By: sirlark

Answer #3:

NumPy’s fromiter method seems best here (as in Keith’s answer, which preceded this one).

Using fromiter to recast a result set, returned by a call to a MySQLdb cursor method, to a NumPy array is simple, but there are a couple of details perhaps worth mentioning.

import numpy as NP
import MySQLdb as SQL

cxn = SQL.connect('localhost', 'some_user', 'their_password', 'db_name')
c = cxn.cursor()
c.execute('SELECT id, ratings from video')

# fetchall() returns a nested tuple (one tuple for each table row)
results = cursor.fetchall()

# 'num_rows' needed to reshape the 1D NumPy array returend by 'fromiter' 
# in other words, to restore original dimensions of the results set
num_rows = int(c.rowcount)

# recast this nested tuple to a python list and flatten it so it's a proper iterable:
x = map(list, list(results))              # change the type
x = sum(x, [])                            # flatten

# D is a 1D NumPy array
D = NP.fromiter(iterable=x, dtype=float, count=-1)  

# 'restore' the original dimensions of the result set:
D = D.reshape(num_rows, -1)

Note that fromiter returns a 1D NumPY array,

(This makes sense, of course, because you can use fromiter to return just a portion of a single MySQL Table row, by passing a parameter for count).

Still, you’ll have to restore the 2D shape, hence the predicate call to the cursor method rowcount. and the subsequent call to reshape in the final line.

Finally, the default argument for the parameter count is ‘-1’, which just retrieves the entire iterable

Answered By: doug

Leave a Reply

Your email address will not be published.