Learn and write knowledge frames as much as ten instances quicker than Parquet with StaticFrame NPZ
The Apache Parquet format offers an environment friendly binary illustration of columnar tabular knowledge, as seen in widespread use in Apache Hadoop and Spark, AWS Athena and Glue, and Pandas DataFrame serialization. Whereas Parquet presents broad interoperability with efficiency superior to textual content codecs (akin to CSV or JSON), it’s as a lot as ten instances slower than NPZ, another DataFrame serialization format launched in StaticFrame.
StaticFrame (an open-source DataFrame library of which I’m an writer) builds on the NumPy NPY and NPZ codecs to encode DataFrames. The NPY format (a binary encoding of array knowledge) and the NPZ format (compressed bundles of NPY information) have been outlined in a 2007 NumPy Enhancement Proposal. By extending the NPZ format with specialised JSON metadata, StaticFrame offers a whole DataFrame serialization format that helps all NumPy-dtypes.
This text extends the work first offered at PyCon USA 2022 with additional efficiency optimizations and broader benchmarking.
DataFrames should not simply collections of columnar knowledge with string column labels, as present in relational databases. Along with column knowledge, DataFrames have rows and columns labeled, and people row and column labels could be of any sort or (with hierarchical labels) of many sorts. Moreover, it’s common to retailer metadata with a title
attribute, both on the DataFrame or on the axis labels.
As a result of Parquet was initially designed solely to retailer collections of columnar knowledge, the total vary of DataFrame options should not instantly supported. Pandas offers this extra info by including JSON metadata to the Parquet file.
Moreover, Parquet helps a minimal number of sorts; the total vary of NumPy dtypes shouldn’t be instantly supported. For instance, Parquet doesn’t help unsigned integers or date sorts by default.
Though Python pickles can serialize DataFrames and NumPy arrays effectively, they’re solely appropriate for short-term caches of trusted sources. Though pickles are quick, they’ll turn out to be invalid because of code adjustments and are unsafe to load from untrusted sources.
One other various to Parquet, from the Arrow undertaking, is Feather. Though Feather helps all Arrow sorts and manages to be quicker than Parquet, it’s nonetheless no less than two instances slower to learn DataFrames than NPZ.
Parquet and Feather help compression to scale back file dimension. Parquet makes use of “snappy” compression by default, whereas Feather makes use of “lz4” by default. As a result of the NPZ format prioritizes efficiency, it doesn’t but help compression. As shall be proven under, NPZ outperforms each compressed and uncompressed Parquet information in essential methods.
Quite a few publications present DataFrame benchmarks by testing only one or two datasets. McKinney and Richardson (2020) is an instance, utilizing two datasets, Fannie Mae Mortgage Efficiency and NYC Yellow Taxi Journey knowledge, to generalize efficiency. Such idiosyncratic datasets are inadequate, as each the form of the DataFrame and the diploma of heterogeneity of the columnar sort can considerably differentiate efficiency.
To keep away from this shortcoming, I examine efficiency towards a panel of 9 artificial datasets. These datasets range alongside two dimensions: form (lengthy, sq., and large) and columnar heterogeneity (columnar, combined, and uniform). Form variations change the distribution of components between tall (e.g. 10,000 rows and 100 columns), sq. (e.g. 1,000 rows and columns), and large (e.g. 100 rows and 10,000 columns) geometries. Variations in columnar heterogeneity change the variety of sorts between columnar (no adjoining columns have the identical sort), combined (some adjoining columns have the identical sort), and uniform (all columns have the identical sort).
The frame-fixtures
library defines a domain-specific language to create deterministic, randomly generated DataFrames for testing; the 9 datasets are generated with this software.
To display a number of the StaticFrame and Pandas interfaces evaluated, the next IPython session performs fundamental efficiency testing utilizing %time
. As proven under, a sq., uniformly typed DataFrame could be written and skim with NPZ many instances quicker than uncompressed Parquet.
>>> import numpy as np
>>> import static_frame as sf
>>> import pandas as pd>>> # an sq., uniform float array
>>> array = np.random.random_sample((10_000, 10_000))
>>> # write peformance
>>> f1 = sf.Body(array)
>>> %time f1.to_npz('/tmp/body.npz')
CPU instances: consumer 710 ms, sys: 396 ms, whole: 1.11 s
Wall time: 1.11 s
>>> df1 = pd.DataFrame(array)
>>> %time df1.to_parquet('/tmp/df.parquet', compression=None)
CPU instances: consumer 6.82 s, sys: 900 ms, whole: 7.72 s
Wall time: 7.74 s
>>> # learn efficiency
>>> %time f2 = f1.from_npz('/tmp/body.npz')
CPU instances: consumer 2.77 ms, sys: 163 ms, whole: 166 ms
Wall time: 165 ms
>>> %time df2 = pd.read_parquet('/tmp/df.parquet')
CPU instances: consumer 2.55 s, sys: 1.2 s, whole: 3.75 s
Wall time: 866 ms
The efficiency checks under prolong this fundamental method utilizing: frame-fixtures
for systematic variation in heterogeneity in form and sort, and common outcomes over ten iterations. Though {hardware} configuration impacts efficiency, relative traits are maintained throughout totally different machines and working programs. The default parameters are used for all interfaces, apart from disabling compression if vital. The code used to run these checks is on the market on GitHub.
Learn Achievements
As a result of knowledge is usually learn extra typically than written, learn efficiency is a precedence. As proven in all 9 DataFrames of 1 million (1e+06) components, NPZ considerably outperforms Parquet and Feather in each fixture. The learn efficiency of NPZ is greater than ten instances quicker than that of compressed Parquet. For instance, with the Uniform Tall fixture the compressed Parquet measurement is 21 ms in comparison with 1.5 ms with NPZ.
The chart under exhibits processing time, with decrease bars akin to quicker efficiency.
This spectacular NPZ efficiency is maintained with scale. On its solution to 100 million (1e+08) components, NPZ continues to carry out no less than twice as quick as Parquet and Feather, no matter whether or not compression is used.
Write achievements
When writing DataFrames to disk, NPZ outperforms Parquet in all eventualities (each compressed and uncompressed). For instance, with the Uniform Sq. fixture the compressed Parquet write is 200 ms in comparison with 18.3 ms with NPZ. NPZ’s write efficiency is usually similar to uncompressed Feather: in some eventualities NPZ is quicker, in others Feather is quicker.
As with learn efficiency, NPZ write efficiency is maintained with scaling. On the best way to 100 million (1e+08) components, NPZ stays no less than twice as quick as Parquet, no matter whether or not compression is used or not.
Quirky efficiency
As extra reference, we may also benchmark the identical NYC Yellow Taxi Journey knowledge (as of January 2010) utilized in McKinney and Richardson (2020). This dataset incorporates nearly 300 million (3e+08) components in a big, heterogeneously typed DataFrame of 14,863,778 rows and 19 columns.
The learn efficiency of NPZ has been proven to be roughly 4 instances quicker than that of Parquet and Feather (with or with out compression). Though NPZ’s write efficiency is quicker than Parquet’s, Feather’s write efficiency is the quickest right here.
File dimension
As proven under for DataFrames with a million (1e+06) components and 100 million (1e+08) components, uncompressed NPZ on disk is usually the identical dimension as uncompressed Feather and all the time smaller than uncompressed Parquet ( generally additionally smaller than compressed Parquet). Since compression offers solely a modest file dimension discount for Parquet and Feather, the pace good thing about uncompressed NPZ would simply outweigh the price of rising dimension.
StaticFrame shops knowledge as a group of 1D and 2D NumPy arrays. Arrays signify column values, in addition to index and column labels with variable depth. Along with NumPy arrays, details about element sorts (i.e. the Python class used for the index and columns), in addition to the element title
attributes, are wanted to totally reconstruct one Body
. To completely serialize a DataFrame, these parts should be written and skim to a file.
DataFrame parts could be represented by the next diagram, which isolates arrays, array sorts, element sorts, and element names. This diagram shall be used to display how an NPZ encodes a DataFrame.
The parts of that diagram correspond to parts of a Body
string illustration in Python. For instance, given a Body
of integers and Booleans with hierarchical labels on each the index and columns (downloadable from GitHub with StaticFrame’s WWW
interface), StaticFrame offers the next string illustration:
>>> body = sf.Body.from_npz(sf.WWW.from_file('https://github.com/static-frame/static-frame/uncooked/grasp/doc/supply/articles/serialize/body.npz', encoding=None))
>>> body
<Body: p>
<IndexHierarchy: q> knowledge knowledge knowledge legitimate <<U5>
A B C * <<U1>
<IndexHierarchy: r>
2012-03 x 5 4 7 False
2012-03 y 9 1 8 True
2012-04 x 3 6 2 True
<datetime64[M]> <<U1> <int64> <int64> <int64> <bool>
The parts of the string illustration could be mapped to the DataFrame diagram by shade:
Encoding an array in NPY
An NPY shops a NumPy array as a binary file with six parts: (1) a “magic” prefix, (2) a model quantity, (3) a header size, and (4) header (the place the header is a string illustration of a Python dictionary), and (5) padding adopted by (6) uncooked array byte knowledge. These parts are proven under for a binary array with three components, saved in a file referred to as “__blocks_1__.npy”.
Given an NPZ file named “body.npz”, we will extract the binary knowledge by studying the NPY file from the NPZ with the usual library ZipFile
:
>>> from zipfile import ZipFile
>>> with ZipFile('/tmp/body.npz') as zf: print(zf.open('__blocks_1__.npy').learn())
b'x93NUMPYx01x006x00"descr":" nx00x01x01
Since NPY is properly supported in NumPy, the np.load()
operate can be utilized to transform this file to a NumPy array. Which means that underlying array knowledge in a StaticFrame NPZ could be simply extracted by alternate readers.
>>> with ZipFile('/tmp/body.npz') as zf: print(repr(np.load(zf.open('__blocks_1__.npy'))))
array([False, True, True])
As a result of an NPY file can encode any array, massive two-dimensional arrays could be loaded from contiguous byte knowledge, offering wonderful efficiency in StaticFrame when a number of contiguous columns are represented by a single array.
Constructing an NPZ file
A StaticFrame NPZ is a typical uncompressed ZIP file that incorporates array knowledge in NPY information and metadata (with element sorts and names) in a JSON file.
Contemplating the NPZ file for the Body
above we will point out its contents ZipFile
. The archive incorporates six NPY information and one JSON file.
>>> with ZipFile('/tmp/body.npz') as zf: print(zf.namelist())
['__values_index_0__.npy', '__values_index_1__.npy', '__values_columns_0__.npy', '__values_columns_1__.npy', '__blocks_0__.npy', '__blocks_1__.npy', '__meta__.json']
Within the picture under, these information are mapped to parts of the DataFrame diagram.
StaticFrame extends the NPZ format with metadata in a JSON file. This file defines title attributes, element sorts, and depth counts.
>>> with ZipFile('/tmp/body.npz') as zf: print(zf.open('__meta__.json').learn())
b'"__names__": ["p", "r", "q"], "__types__": ["IndexHierarchy", "IndexHierarchy"], "__types_index__": ["IndexYearMonth", "Index"], "__types_columns__": ["Index", "Index"], "__depths__": [2, 2, 2]'
Within the picture under are parts of the __meta__.json
file are assigned to parts of the DataFrame diagram.
As a easy ZIP file, instruments to extract the contents of a StaticFrame NPZ are ubiquitous. Alternatively, the ZIP format, given its historical past and broad options, entails efficiency overhead. StaticFrame implements a customized ZIP reader optimized for NPZ utilization, which contributes to NPZ’s wonderful studying efficiency.
The efficiency of DataFrame serialization is crucial for a lot of purposes. Though Parquet enjoys broad help, its generality compromises sort specificity and efficiency. StaticFrame NPZ can learn and write DataFrames as much as ten instances quicker than Parquet, with or with out compression, with comparable (or solely modestly bigger) file sizes. Whereas Feather is a pretty various, NPZ’s learn efficiency remains to be typically twice as quick as Feather. When knowledge I/O is a bottleneck (and it typically is), StaticFrame NPZ offers an answer.