Published on

The Best In-Browser Data Processing Framework is SQL

Authors

Continuing the explorations on in-browser processing capabilities, recently I wrote a benchmark to evaluate various frameworks including Arquero, Danfo.fs, SQLite WASM, and DuckDB.

The conclusion? WebAssembly ports of high performance libraries like SQLite WASM and DuckDB are extremely performant and are probably the best approach for in-browser data processing from both a performance and developer experience perspective.

Dive straight to the results and run the benchmarks in your browser, or continue reading to learn why in-browser computation opens interesting possibilities in client-side computing.

Table of Contents

The Evolution of the Browser as a Data Sandbox

Data Streams in the Browser

While it might be a little unusual to consider the browser a computing environment for data-intensive tasks, it is not without its merits. It has always been a popular medium to ship web-based games since the good old days of Macromedia Flash to more recent WebGL accelerated applications.

But the browser's potential doesn't stop at gaming; it extends to data processing as well. With the confluence of hardware improvements, new web and data standards, and the emergence of WebAssembly, the browser is now capable of handling increasingly complex tasks, even for large datasets.

I will just briefly outline how these technologies have enabled the browser to become a powerful data computation tool.

Modern devices including laptops and handphones are now packed with powerful processors enabling users to run generative AI applications that were previously thought to be impossible. I think it parallels the trend of enterprises moving workloads away from distributed computing to more powerful single machines, and we shall see more workloads moving to the browser as well.

One of the pain points of working with data frameworks across various languages is the lack of interoperability.1 With the semi-convergence of data standards towards parquet and arrow for analytical workloads, it is now possible to read and write across various languages in a performant manner (i.e. not resorting to CSVs and JSON). Coupled with WebAssembly (WASM) coming into the picture, many of these native libraries can now be compiled to run in the browser. This is a huge win for developers as they can now leverage the same performant libraries across various languages and platforms, instead of having to re-implement them in JavaScript(JS).

We have begun to see the emergence of in-browser data processing tools from libraries like Perspective.js which provide interactive analytics and visualization components for large streaming datasets2, to compute applications like Observable, a web-based Jupyter Notebook, and Graphistry, a GPU-accelerated graph analysis platform.

Benefits of In-Browser Data Processing

As a platform, the browser is ubiquitous and is a great sandbox for data exploration and visualization.

In addition, in-browser computing is a paradigm shift from large powerful backends that do the heavy lifting to local computation on individual client systems. This shift towards performing data processing directly in the browser brings the following benefits.

1. Reduced Server Load

Traditionally, data processing tasks have been offloaded to remote servers due to their resource-intensive nature. However, as browsers become more powerful and efficient, they can now handle a significant portion of this workload. This translates to reduced server load, lower operational costs, and quicker response times for users.

2. Enhanced Privacy

Processing data in the browser can bolster user privacy. By keeping data on the client side, sensitive information doesn't need to be transmitted to a remote server, reducing the risk of data breaches and enhancing data security.

3. Offline Functionality

Browsers have introduced sophisticated service workers and caching mechanisms, allowing web applications to work seamlessly offline or with intermittent connectivity. In-browser data processing enables applications to continue functioning even when there's no internet connection, offering a smoother user experience.

4. Real-time Interactivity

Modern web applications demand real-time interactivity. In-browser data processing can facilitate dynamic updates, interactive visualizations, and instant responses to user input. This leads to engaging and responsive user interfaces, making applications more user-friendly.

The Benchmark

Now that we've established the browser's potential as a data computation environment, let's dive into the benchmarks. There are two prior related benchmarks:

  • wa-sqlite benchmark - Focuses on SQLite variants and primarily transactional queries. A very useful guide to gauge the capabilities of the different SQLite variants. I adapted most of the approach, layout, and design from Roy Hashimoto's benchmark.
  • DuckDB WASM comparisons - Compares DuckDB-Wasm, sql.js3, Arquero, and Lovefield on the TPC-H benchmark, which consists of mostly analytical queries. While it compares the WASM library, the benchmark runs on node.js and not directly in the browser.

For this benchmark, I wanted to test a mix of transactional and analytical queries within the browser environment. I selected four libraries to benchmark - Arquero, Danfo.fs, SQLite WASM, and DuckDB. This is intended to cover a broad range of approaches from JavaScript native libraries (Danfo.js) to JS libraries that work with modern data formats (Arquero with Arrow.js), to WebAssembly ported staples (SQLite WASM) and analytical database (DuckDB).

benchmark candidates

A quick discussion of each and the different variations benchmarked is in order.

Arquero

Arquero is a JavaScript library created by the University of Washington Interactive Data Lab. It provides a fluent API for manipulating tabular data in the browser, taking inspiration from the various famous dplyr library. It supports loading data from Apache Arrow format, but not Parquet. As an intermediate step, I used Parquet-WASM to load it from Parquet into Arrow.

While the API is convenient for analytical processing, it is not a database, and adding, removing, or updating data is a hassle and definitely not as performant.4 As such, the benchmark skips the transactional queries and focuses on analytical queries.

Danjo.fs

I was interested in including Danfo.js as it supports Tensorflow.js out of the box. I thought it would be interesting to see how a GPU-accelerated framework might perform on the benchmark, but on a closer look at the source code, the core data structures and transformations are handled in JS.5 Nevertheless, it's still a good library to include as it is a popular pandas-esque data frame library implemented in JS.

While writing the queries, I had to fall back to regular JS for some of the operations as the library does not implement the full set of Pandas APIs like transform. Similar to Arquero, there is no efficient API to add, remove, or update rows and the benchmark skips those tests.

SQLite WASM

SQLite WASM is a port of SQLite to WebAssembly. SQLite probably needs no introduction, being one of the most widely adopted in-process, self-contained database engines. There are multiple compilations out in the wild each with varying capabilities.6 This benchmark is based on the official release.

Other than benchmarking the in-memory WASM library, I have also included 3 variants of interest, one with additional indexes defined, the second which stores the database in the Origin Private File System (OPFS), and the third which stores the database in OPFS with Sync Access Handles (OPFS-SAH). The latter two are interesting as they allow the database to persist in a file system specially carved out for application storage as part of the new file system standard.

From benchmarks done previously in wa-sqlite the OPFS approaches seem to be more performant than indexed DB, which is another possibility for storing the database and an older web standard.

DuckDB

DuckDB WASM is a relatively new in-process database that excels in analytical workloads. While traditional database systems like SQLite, Postgre, and MySQL are row-level stores that are optimized for transactional workloads, DuckDB is a columnar store that accelerates analytical workloads with its vectorized execution engine.7 As a columnar store, it has a zero-copy integration between its internal format and Apache Arrow, as well as out-of-the-box functions for reading and writing Parquet files.

Other than the default DuckDB WASM in-memory library, I have included a benchmark of the library using the HTTP File System (HTTPFS) approach. This extension allows reading and writing remote files. For Parquet files, DuckDB uses the file metadata and HTTP range requests to only download the parts required by the query. In the benchmark, we skip the "Fetch" step and define a CREATE VIEW in the load step. Thus, there is no up-front large network load but each query sends multiple network requests to fetch the required data. This is intended to benchmark DuckDB WASM as a remote query engine for Parquet files.

Dataset and Queries

Bandcamp logo

To benchmark the libraries discussed above, I run 9 different queries on the 1,000,000 Bandcamp sales dataset. The dataset contains 1,000,000 items with 24 fields from Bandcamp's sales feed between 2020-09-09 and 2020-10-02.

Since each library requires different import formats, I converted the raw CSV file to the required formats before serving it from Cloudflare's R2 storage. The raw CSV file is 301MB uncompressed. The Gzipped CSV is 81MB and is used by Danfo.js. The compressed parquet zstd file is 74MB and is used for both Arquero and DuckDB, while the SQLite tests fetch and load a 100MB Gzipped database file.

The queries are a mix of transactional and analytical queries. Putting aside the fetch and load times, which are useful as a reference, but should be interpreted carefully as it depends on the network speed as well as any caching behaviour, there are 9 queries in total. The first 3 are analytical queries in increasing orders of difficulty:

  1. Obtain the overall count, mean, and total sales across the entire dataset
  2. Obtain the total daily sales and revenue by day (some transformation is required from unix timestamp to date)
  3. Window query where we obtain the top 5 countries by sales for each item / slug type combination

The next 3 queries are intended to benchmark random row reads:

  1. Select 10 random rows
  2. Create an index on column00 (where applicable)
  3. Select 1000 random roles on the indexed column

The final 3 queries are intended to benchmark update, insert, and delete operations:

  1. Update 1000 random rows
  2. Insert 1000 random rows
  3. Delete 1000 random rows

Note that for queries 4, 6, 7, 8, and 9, the operations are conducted sequentially and are not batched. For the full source code, check out the github respository.

Results

I have included two sets of results - one from an 11th Gen Intel(R) Core(TM) i7-1165G7 @2.80GHz Windows laptop running Chrome 116 and the other from an Apple M2 Macbook Air on Firefox 117.

11th Gen Intel(R) Core(TM) i7-1165G7 @2.80GHz Windows and Chrome 116

Testarquerodanfosqlitesqlite (indexed)sqlite (OPFS)sqlite (OPFS + SAH)duckdbduckdb (HttpFS)
Fetch data3.00916.862.6612.4832.4384.9511.508n/a
Load data2.866n/a0.8933.9070.8322.0894.3090.463
Test 1: SELECT top level metrics - overall count, mean and total sales0.0670.1930.3760.1032.4020.720.0140.859
Test 2: SELECT group by day and count daily sales and total revenue1.054.0680.6380.0052.6031.1810.1631.648
Test 3: SELECT for each item type, slug type combination the top 5 countries by overall counts4.8473.4131.4320.1653.3111.9380.1141.477
Test 4: SELECT 10 random rows0.5171.6650.9910.00212.0332.4120.0327.325
Test 5: CREATE an indexn/an/a0.573n/a2.510.860.24n/a
Test 6: SELECT 1000 random rows with an indexn/an/a0.0540.0653.7950.11.048n/a
Test 7: UPDATE 2 fields in 1000 rows with an indexn/an/a0.0380.06242.31616.4110.588n/a
Test 8: INSERT 1000 rows with an indexn/an/a0.0410.07851.04215.8511.397n/a
Test 9: DELETE 1000 rows with an indexn/an/a0.0350.06448.14715.5462.376n/a

Apple M2 Macbook Air and Firefox 117

Testarquerodanfosqlitesqlite (indexed)sqlite (OPFS)sqlite (OPFS + SAH)duckdbduckdb (HttpFS)
Fetch data2.2579.4872.8471.9681.4981.2211.084n/a
Load data1.707n/a0.2062.7880.120.694.0810.303
Test 1: SELECT top level metrics - overall count, mean and total sales0.0510.0820.2590.0741.8461.0910.0070.554
Test 2: SELECT group by day and count daily sales and total revenue0.6342.730.4760.0012.1851.350.1691.123
Test 3: SELECT for each item type, slug type combination the top 5 countries by overall counts0.8523.531.0250.1192.7251.9090.1320.903
Test 4: SELECT 10 random rows0.3741.3640.760.00110.2925.6910.024.356
Test 5: CREATE an indexn/an/a0.415n/a2.1011.3440.207n/a
Test 6: SELECT 1000 random rows with an indexn/an/a0.020.0250.9230.0890.748n/a
Test 7: UPDATE 2 fields in 1000 rows with an indexn/an/a0.0210.03410.9860.6730.291n/a
Test 8: INSERT 1000 rows with an indexn/an/a0.0250.04314.6970.7350.55n/a
Test 9: DELETE 1000 rows with an indexn/an/a0.0190.03613.7570.7170.576n/a

As it is a browser based benchmark, feel free to try it out yourself and run it on your own machine. It would be interesting to see how it fairs across various devices and browsers.

Takeaways

  1. Having a specialized format to read and write data significantly reduces the fetch and load time. After deserializing, loading data for both Parquet and SQLite works at the memory buffer level, and is >2x faster than Danfo.js which uses papa parse to parse and type infer the CSV.
  2. As a swiss-army knife of data processing, in-memory DuckDB takes the cake. Without any additional indexes, it is able to perform analytical queries faster than the other libraries. Transactional queries are also reasonably performant and above my expectations for a columnar store.
  3. SQLite excels in transactional queries (no surprises there). If the queries are known in advance, it can even outperform DuckDB on analytical tasks by creating the appropriate indexes. This requires a bit more effort to benchmark and test the queries.
  4. For a single-user browser application, the OPFS-SAH approach is very attractive as it maintains full transactional capabilities while being reasonably performant. As highlighted to me by a user in the DuckDB discord group, on Firefox the read and write performance of OPFS-SAH is even comparable with in-memory DuckDB (in Chrome, it's only about 2x faster than the OPFS benchmark). Due to an existing bug in Safari, the OPFS-SAH approach will fail to run, though it is expected to be fixed in Safari 17. It's still a relatively new spec, so I expect the performance and supprot to converge over time across all browsers, but it's definitely a promising approach.
  5. There's still quite a bit of data moved for DuckDB in HTTPFS mode. This might be improved if additional indexes are defined.

Conclusion

At the end of the day, the choice of which library to use is a matter of trade-offs:

Working on small dataset with mostly analytical queries? Arquero at 105kB might be a decent choice, though reading and writing to parquet would require an additional heavier dependency e.g. Parquet-WASM (1.3MB).

For everything else, the SQL frameworks provide superior performance, developer experience (it's just SQL), and are packed with features. The usual differentiating factor of transactional vs analytical workload probably does not matter as much and the choice probably depends on whether there's a need (or requirement) to work with Parquet files, other file system requirements, and ecosystem tooling. DuckDB definitely packs in a whole bunch of modern processing capabilities, with many nice to have features at the cost of a larger bundle size (3.5MB WASM), while SQLite WASM manages to squeeze in all its goodies in a 400kB WASM package.

Footnotes

  1. Not to mention many of the libraries have differing and confusing syntaxes and not very performant - hint pandas.

  2. It's also built on Apache Arrow, so +1 for interoperability.

  3. Probably the first port of SQLite to WASM (2012). For this benchmark, we use the "official" SQLite WASM release.

  4. Here's a notebook on row manipulation with Arquero from the authors showing that is possible to update, insert and delete rows via joins.

  5. I am still looking for a GPU accelerated data processing library in JS, similar to cuDF or one that is built on GPU.js. If you know of any, feel free to drop a note on the project's Github page.

  6. The SQLite WASM homepage includes a list of the different versions.

  7. In the h2o benchmarks, DuckDB ranks as one of the top performers along other top OLAP databases and libraries like Clickhouse and Polars.