HDF5: where have you been all my life?

Nathaniel introduced me to HDF5 around the winter holidays (because, yes, that's the kind of thing we talk about while on vacation), but I just started using in in earnest this past week via h5py.  I may never go back to plain text data storage if I can help itwe'll see if I can convince you too.

This is a script that simulates a big matrix off random data, writes in both formats, and reads from both formats.  The syntax is similar for both.

To compare h5py against plain text, I ran the above (plus timing code added in) 100 times with different random data.  Here are the average results.

plain text
9.44 sec
0.0634 sec
7.94 sec
0.0051 sec
row access
3e-5 sec
6e-4 sec
col access
1e-6 sec
0.016 sec
file size
239 MB
77 MB

If your data is small enough or you need to access almost all of it repeatedly, plain text files might still be good for you.  I usually sample rows from large datasets that eat memory like chocolate cake. So for almost everything I do, h5py is the clear winner.

1 comment:

Anonymous said...

So romantic, ain't he a sweet-talker!