low performance: creating & reading files with many lines of data rather slow
I noted that creating and reading large files (i.e. 10k+ lines of data, FFI 1001) is rather slow, on the order of seconds to minutes.
creation
ict = icartt.Dataset(format=icartt.Formats.FFI1001)
# ...
I create the data section from a 2D np array. Internally, that calls add
for each line along the vertical axis of the array. That function has two conditionals, a loop and two nested conditionals. Also, it calls np.append for each line of data, which is inefficient compared to native Python's append.
reading
ict = icartt.Dataset(f)
# ...
For the data section, this calls addBulkFromTxt
, from there on, it's basically the same as for creation above.
At the moment, I'm not sure what would be the best way to approach this. What I'm certain about is that this needs improvement, as I think this can scare off users. Speaking for myself, that's exactly what makes using the only "sort-of-official" NASA Ames Python package nappy painful ;-)
ideas
-
load data section with np.loadtxt or np.genfromtxt?
- we have that dependency already and it would offer convenient features, such as "vectorized" methods to replace vmiss with np.nan etc.
-
load the data section line by line, but intermediately use a list of lists
- Python's list methods are very efficient, so I think reading data to a list of lists first, then convert to a numpy array, then use vectorized methods e.g. for vmiss-to-nan would improve performance