Underpinnings of our Approach

3. Underpinnings of our Approach
1. 3.1. Embedded Databases
2. 3.2. File System Performance

3. Underpinnings of our Approach

This section presents the fundamental concepts of our approach as shown in Figure 1.

Figure 1. Underpinnings of our approach

3.1. Embedded Databases

Embedded databases are databases that are included in an application rather than as a separate server [12]. Since they are embedded in the application itself, they do not require a separate server or internet access to use. Embedded databases are meant for scenarios where an application is storing data that is only needed on the local machine.

Some embedded databases, such as SQLite, are full relational databases. However, simpler NoSQL alternatives have become popular such as key-value databases. Keyvalue databases only allow efficient access of each entry by a unique key, and do not support more complex queries. While this is restrictive, it is sufficient for many use cases and allows the database to be much simpler and optimized.

SQLite3¹ is one of the most popular embedded databases. SQLite3 is an open source C library that implements an embedded SQL database engine. It it used extensively in mobile devices, Python apps, and browsers, among many more applications². While SQLite is a fully featured SQL database, for our purposes we treat it as keyvalue store by just having one table with a primary key.

LevelDB³ is a open-source, key-value embedded database sponsored by Google. As a key-value database, it maps arbitrary binary string keys and values and does not support SQL queries. LevelDB offers features such as high performance, compression of data, and “snapshots” to provide consistent read-only views of the data. LevelDB is used in many applications including Chrome’s IndexedDB⁴ , Bitcoin Core⁵ , and Minecraft Bedrock Edition⁶.

¹. https://www.sqlite.org ². https://www.sqlite.org/mostdeployed.html ³. https://github.com/google/leveldb ⁴.https://chromium.googlesource.com/chromium/src/+/a77a9a1/thirdparty/blink/renderer/modules/indexeddb/docs/idbdatapath.md ⁵. https://bitcoindev.network/understanding-the-data ⁶. https://github.com/Mojang/leveldb-mcpe

RocksDB⁷ is another open-source, key-value embedded database. RocksDB is sponsored by Facebook, and is used in much of their infrastructure. RocksDB actually started as a fork of LevelDB, and has a similar API. Facebook found that LevelDB didn’t perform well with massive parallelism or datasets larger than RAM, so they created RocksDB with a focus on scaling to large applications.

Berkeley DB⁸ is an open source embedded database provided by Oracle. Berkeley DB advertises itself as “a family of embedded key-value database libraries that provides scalable, high-performance data management services to applications.” Berkeley DB is very feature rich, and encompasses multiple different database technologies. Our benchmark makes use of the efficient key-value API, but Berkeley DB can also create relational SQL databases, and use multiple different index types, among many other features. Some programs that use Berkeley DB are Bogofilter, Citadel, and SpamAssassin.

3.2. File System Performance

File systems can have issues when handling large numbers of small files. For instance, most modern file systems, including FAT, ext3, and ext4, allocate space for files in a unit of a cluster or block, no matter how small the file is. On large files this is inconsequential, but if there is a very large number of files smaller than the cluster size it can lead to a large amount of space being wasted [1]. Directory lookup can also take a performance hit if a large number of files are directly under a single folder [2].

This file system performance bottleneck has led many applications, including browsers and web caches, to use “nested hash file structures” with files placed into intermediate sub-directories based on their hash [13]. This avoids placing too many files directly under a single folder and can improve performance. As seen in Figure 3, the top level directory may contain folders 00 through ff, where each sub-directory contains files whose hash digest starts with those characters. Files would be spread evenly across all 256 sub-directories. This process can be repeated for as many levels of nested as is required to get a reasonable number of files in the “leaf” directories.

3. Underpinnings of our Approach

3.1. Embedded Databases

3.2. File System Performance

Table of Contents