Main Memory vs. RAM-Disk Databases: A Linux-based Benchmark

Page 1

Main Memory vs. RAM-Disk Databases:

A Linux-based Benchmark

Abstract:

It stands to reason that accessing data from memory will be faster than from

physical media. A new type of database management system, the main memory database

(MMDB), claims breakthrough performance and availability via memory-only

processing. But doesn't database caching achieve the same result? And if complete

elimination of disk access is the goal, why not deploy a traditional database on a RAM-

disk, which creates a file system in memory?

This paper tests the eXtremeDB main memory database against the db.linux embedded

database in both traditional (disk-based) and RAM-disk modes, running on Red Hat

Linux 6.2. Deployment in RAM boosts db.linux's performance by as much as 74 percent.

But even then, the traditional database lags the MMDB. Fundamental architectural

differences explain the disparity. Overhead hard-wired into disk-based databases includes

data transfer and duplication, unneeded recovery functions and, ironically, caching logic

intended to avoid disk access.

McObject LLC

22525 SE 64th Place

Suite 302

Issaquah, WA 98027

Phone: 425-831-5964

E-mail:

info@mcobject.com

www.mcobject.com

Page 2

Introduction

It makes sense that maintaining data in memory, rather than retrieving it from disk, will

improve application performance. After all, disk access is one of the few

mechanical

(as

opposed to electronic) functions integral to processing, and suffers from the slowness of

moving parts. On the software side, disk access also involves a "system call" that is

relatively expensive in terms of performance. The desire to improve performance by

avoiding disk access is the fundamental rationale for database management system

(DBMS) caching and file system caching methods.

This concept has been extended recently with a new type of DBMS, designed to reside

entirely in memory. Proponents of these main memory databases (MMDBs) point to

groundbreaking improvements in database speed and availability, and claim the

technology represents an important step forward in both data management and real-time

systems.

But this begs a seemingly obvious question: since caching is available, why not extend its

use to cache entire databases to realize desired performance gains? In addition, RAM-

drive utilities exist to create file systems in memory. Deploying a traditional database on

such a RAM-disk eliminates physical disk access entirely. Shouldn't its performance

equal the main memory database?

This white paper tests the theory. Two nearly identical database structures and

applications are developed to measure performance in reading and writing 30,000

records. The main difference is that one database, eXtremeDB, is an MMDB, and the

other, db.linux, is designed for disk storage. The result: while RAM-drive deployment

makes the disk-based database significantly faster, it cannot approach the main memory

database performance. The sections below present the comparison and explain how

caching, data transfer and other overhead sources inherent in a disk-based database (even

on a RAM-drive) cause the performance disparity.

The Emergence of Main Memory Databases

Main memory databases are relative newcomers to database management. The

technology first arose to enhance business application performance and to cache Web

commerce sites for handling peak traffic. In keeping with this enterprise focus, the initial

MMDBs were similar to conventional SQL/relational databases, stripped of certain

functionality and stored entirely in main memory.

Another new focus for database technology is embedded systems development.

Increasingly, developers of network switches and routers, set-top boxes, consumer

electronics and other hardware devices turn to commercial databases to support new

features. Main memory databases have emerged to serve this market segment, delivering

the required real-time performance along with additional benefits such as exceptional

frugality in RAM and CPU resource consumption, and tight integration with embedded

Page 3

systems developers' preferred third-generation programming languages (C/C++ and

Java).

The Comparison: eXtremeDB vs. db.linux

McObject's eXtremeDB is the first main memory database created for the embedded

systems market. This DBMS is similar to disk-based embedded databases, such as

db.linux, BerkleyDB, Empress, C-tree and others, in that all are intended for use by

application developers to provide database management functionality from within an

application. They are "embedded" in the application, as opposed to being a separately

administered server like Microsoft SQL Server, DB2 or Oracle. Each also has a

relatively small footprint when compared to enterprise class databases, and offers a

navigational API for precise control over database operations.

This paper compares eXtremeDB to db.linux, a disk-based embedded database. The open

source db.linux DBMS was chosen because of its longevity (first released in 1986 under

the name db_VISTA) and wide usage. eXtremeDB and db.linux also have similar

database definition languages.

The tests were performed on a PC running Red Hat Linux 6.2, with a 400Mhz Intel

Celeron processor and 128 megabytes of RAM.

Database Design

The following simple database schema was developed to compare the two databases'

performance writing 30,000 similar objects to a database and reading them back via a

key.

Page 4

/**********************************************************

* *

**********************************************************/

#define int1 signed<1>

#define int2

signed<2>

#define int4

signed<4>

#define uint4 unsigned<4>

#define uint2 unsigned<2>

#define uint1 unsigned<1>

declare database mcs[1000000];

struct stuff {

int2 a;

};

class Measure {

uint4 sensor_id;

uint4 timestamp;

string spectra;

stuff thing;

tree <sensor_id, timestamp> sensors;

};

Figure 1. eXtremeDB schema

Page 5

/**********************************************************

* *

**********************************************************/

struct stuff {

short a;

};

database mcs [8192]

{

data file "mcs.dat" contains Measure;

key file "mcs.key" contains sensors;

record Measure

{

long sensor_id;

long m_timestamp;

char spectra[1000];

struct stuff thing;

compound key sensors {

sensor_id;

m_timestamp;

}

Figure 2. db.linux schema

The only meaningful difference between the two schemas is the field `spectra'. In the

case of eXtremeDB it is defined as a `string' type whereas with db.linux it is defined as

char[1000]. The db.linux implementation will consume 1000 bytes for the spectra field,

regardless of how many bytes are actually stored in the field. In eXtremeDB a string is a

variable length field. db.linux does not have a direct corollary to the eXtremeDB string

type, though there is a technique to use db.linux network model sets to emulate variable

length fields with varying degrees of granularity (trading performance for space

efficiency). Doing so, however, would have caused significant differences in the two sets

of implementation code, making it more difficult to perform a side-by-side comparison.

eXtremeDB has a fixed length character data type; however, the variable length field was

used for purposes of comparison, because it is the data type explicitly designed for this

task.

(An interesting exercise for the reader may be to alter the eXtremeDB implementation to

use a char[1000] type for spectra, and to alter the db.linux implementation to employ the

variable length field implementation. The pseudo-code for implementing this is shown in

Appendix A).

Page 6

Benchmark Application

The first half of the test application populates the database with 30,000 instances of the

`Measure' class/record.

The eXtremeDB implementation allocates memory for the database, allocates memory

for randomized strings, opens the database, and establishes a connection to it.

void *start_mem = malloc( DBSIZE );

if ( !start_mem ) {

printf( "\nToo bad ..." );

exit( 1 );

}

make_strings();

rc = mco_db_open( dbName, mcs_get_dictionary(), start_mem,

DBSIZE, (uint2) PAGESIZE );

if ( rc ) {

printf( "\nerror creating database" );

exit( 1 );

}

/* connect to the database, obtain a database handle */

mco_db_connect( dbName, &db );

Figure 3. eXtremeDB startup implementation

The db.linux implementation allocates memory for randomized strings, initializes a

DB_TASK structure, and opens the database in the "s" shared mode that enables multi-

threaded access and requires transactions for assured data integrity.

make_strings();

stat = d_opentask(&task);

if((stat = d_dbuserid("rdmtest", &task))) return;

if((stat = d_open("mcs", "s", &task))) return;

Figure 4. db.linux startup implementation

From this point, both implementations enter two loops: 100 iterations for the outer loop,

300 iterations for the inner loop (total 30,000).

To add a record to eXtremeDB, a write transaction is started and space is reserved for a

new object in the database (Measure_new). Then the sensor_id and timestamp fields are

Page 7

put to the object, a random string is taken from the pool created earlier and put to the

object, and the transaction committed.

for ( sensor_num = 0; sensor_num < SENSORS; sensor_num++ ) {

for ( measure_num = 0; measure_num < MEASURES; measure_num++ ) {

mco_trans_start(db, MCO_READ_WRITE, MCO_TRANS_FOREGROUND, &t);

rc = Measure_new( t, &measure );

if ( MCO_S_OK == rc ) {

Measure_sensor_id_put(&measure, (uint4) sensor_num );

Measure_timestamp_put(&measure, sensor_num + measure_num );

get_random_string( str );

Measure_spectra_put( &measure, str, (uint2) strlen(str) );

rc = mco_trans_commit( t );

if ( rc != 0 )

goto rep1;

}

else {

mco_trans_rollback( t );

printf( "\n\n\tOops, error allocating object: %d\n", rc );

goto rep1;

}

putchar( `.' );

}

Figure 5. eXtremeDB `write' implementation

In the db.linux implementation a transaction is started, requiring a write-lock. The code

next assigns values to a local structure for sensor_id and timestamp, copies a random

string from the pool of strings created earlier, writes the record to the database

(d_fillnew), and commits the transaction.

Page 8

for( sensor_num = 0; sensor_num < NSENSORS; sensor_num++ ) {

for( measure_num = 0; measure_num < NMEASURES; measure_num++ ) {

if((stat = d_trbegin( "tid", &task )))

break;

if((stat = d_reclock(MEASURE, "w", &task, CURR_DB)))

break;

mr.sensor_id = sensor_num;

mr.m_timestamp = measure_num + sensor_num;

get_random_string( &mr.spectra[0] );

if((stat = d_fillnew( MEASURE, &mr, &task, CURR_DB )))

break;

if( stat == S_OKAY ) {

if((stat = d_trend( &task )))

break;

} else if((stat = d_trabort( &task ))) {

break;

}

putchar('.');

}

Figure 6. db.linux `write' implementation

Because eXtremeDB is a multi-threaded database, all database operations, including read

access, are carried out within the scope of a transaction, so there is no need to specify the

open-mode when opening the database. In contrast, db.linux has distinct single-user (so

called one-user) and multi-user modes. Transactions are optional in the db.linux one-user

mode, but required with the multi-user mode in order to ensure multi-user cache

consistency.

A second pair of nested loops is set up to conduct the performance evaluation of reading

the 30,000 objects previously created.

Page 9

for ( sensor_num = 0; sensor_num < SENSORS; sensor_num++ ) {

uint2 len;

for ( measure_num = 0; measure_num < MEASURES; measure_num++ ) {

mco_trans_start(db, MCO_READ_ONLY, MCO_TRANS_FOREGROUND, &t );

rc = Measure_sensors_index_cursor( t, &csr );

rc = Measure_sensors_find( t, &csr, MCO_EQ, sensor_num,

sensor_num + measure_num);

if ( rc != 0 ) {

rc = mco_trans_commit( t );

goto rep2;

}

rc = Measure_from_cursor( t, &csr, &measure );

/* read the spectra */

rc = Measure_spectra_get( &measure, str, sizeof(str), &len );

rc = Measure_sensor_id_get( &measure, &id );

rc = Measure_timestamp_get( &measure, &ts );

rc = mco_trans_commit( t );

}

Figure 7. eXtremeDB `read' implementation

The eXtremeDB implementation sets up the loops and, for each iteration, starts a read

transaction, instantiates a cursor, and finds the Measure object by its key fields. Upon

successfully finding the object, an object handle is initialized from the cursor and the

object's fields are read from the object handle. Lastly, the transaction is completed.

for( sensor_num = 0; sensor_num < NSENSORS; sensor_num++ ) {

for( measure_num = 0; measure_num < NMEASURES; measure_num++ ) {

mr.sensor_id = sensor_num;

mr.m_timestamp = measure_num + sensor_num;

if((stat = d_reclock(MEASURE, "r", &task, CURR_DB)))

break;

if((stat = d_keyfind( SENSORS, &mr, &task, CURR_DB )))

break;

if((stat = d_recread( &mr, &task, CURR_DB )))

break;

if((stat = d_recfree(MEASURE, &task, CURR_DB)))

break;

}

Figure 8. db.linux `read' implementation

For the db.linux implementation, the two loops are set up and on each iteration, the key

search values are assigned to a structure's fields. db.linux does not use transactions for

read-only access, but requires that the record-type be explicitly locked. Upon

successfully acquiring the record lock, the structure holding the key lookup values is

passed to the d_keyfind function. If the key values are found, the record is read into the

same structure by d_recread and the record lock is released.

Page 10

As alluded to above, the key implementation differences revolve around transactions and

multi-user (multi-threaded) concurrent access (there is also a philosophical difference

between the object-oriented approach to database access of eXtremeDB, but it is

unrelated to in-memory versus disk-based databases, so we do not explore it here).

With eXtremeDB, all concurrency controls are implicit, only requiring that all database

access occur within the scope of a read or write transaction. In contrast, db.linux requires

the application to explicitly acquire read or write record type locks, as appropriate, prior

to attempting to access the record type. Because db.linux requires explicit locking, it

does not require a transaction for read-only access.

The following graph depicts the relative performance of eXtremeDB and db.linux in a

multi-threaded, transaction-controlled environment, with db.linux maintaining the

database files on disk, as it naturally does.

Page 11

eXtremeDB (main memory)

vs.

db.linux (disk drive)

3118.25

16.25

2.6

500

1000

1500

2000

2500

3000

3500

write

read

write

read

db.linux

eXtremeDB

Seconds

Figure 9. eXtremeDB and a disk-bound database

Clearly, processing in main memory led to dramatically better performance for

eXtremeDB. By using a RAM-disk, will db.linux's performance equal or approximate

that of an in-memory database?

Figure 10 shows the performance of the same eXtremeDB implementation used above,

alongside db.linux with the database files on a RAM-disk, completely eliminating

physical disk access (for details on the implementation of this RAM-disk on Red Hat

Linux 6.2, see Appendix B).

Page 12

eXtremeDB (main memory)

vs.

db.linux (RAM-drive)

1093

4.2

2.6

200

400

600

800

1000

1200

write

read

write

read

db.linux

eXtremeDB

Seconds

Figure 10. eXtremeDB and a RAM-disk database

Figure 10 demonstrates that RAM-drive deployment improves db.linux performance by

almost 4X for read access and approximately 3X for writing the database. Clearly,

moving a disk-based database's files to a RAM-drive can improve performance.

However, it is equally obvious that the database fundamentally designed for in-memory

use delivers superior performance. The main memory database still outperforms the

RAM-deployed, disk-based database by 420X for database writing, and by more than 4X

for database reads. The following section analyzes the reasons for this disparity.

Analysis � Where's the Overhead?

The RAM-drive approach eliminates physical disk access. So why does the disk-based

database still lag the main memory database in performance? The problem is that disk-

based databases incorporate processes that are irrelevant for main memory processing,

Page 13

and the RAM-drive deployment does not change such internal functioning. These

processes "go through the motions" even when no longer needed, adding several distinct

types of performance overhead.

Caching overhead

Due to the significant performance drain of physical disk access, virtually all disk-based

databases incorporate sophisticated techniques to minimize the need to go to disk.

Foremost among these is database caching, which strives to keep the most frequently

used portions of the database in memory. Caching logic includes cache synchronization,

which makes sure that an image of a database page in cache is consistent with the

physical database page on disk, to prevent the application from reading invalid data.

Another process, cache lookup, determines if data requested by the application is in cache

and, if not, retrieves the page and adds it to the cache for future reference. It also selects

data to be removed from cache, to make room for incoming pages. If the outgoing page

is "dirty" (holds one or more modified records), additional logic is invoked to protect

other applications from seeing the modified data until the transaction is committed.

These caching functions present only minor overhead when considered individually, but

present significant overhead in aggregate. Each process plays out every time the

application makes a function call to read a record from disk (in the case of db.linux,

examples are d_recfrst, d_recnext, d_findnm, d_keyfind, etc.). In the demonstration

application above, this amounts to some 90,000 function calls: 30,000 d_fillnew, 30,000

d_keyfind and 30,000 d_recread. In contrast, all records in a main memory database such

as eXtremeDB are always in memory, and therefore require zero caching

Transaction Processing Overhead

Transaction processing logic is a major source of processing latency. In the event of a

catastrophic failure such as loss of power, a disk-based database recovers by committing

or rolling back complete or partial transactions from one or more log files when the

system is restarted. Disk-based databases are hard-wired to keep transaction logs, and to

flush transaction log files and cache to disk after the transactions are committed. A disk-

based database doesn't know that it is running in a RAM-drive, and this complicated

processing continues, even when the log file exists only in memory and cannot aid in

recovery should system failure occur.

Main memory databases must also provide transactional integrity, or so-called ACID

compliant transactions. In plain English, a main memory database application thread

must be able to commit or abort a series of updates as a single unit. To do this,

eXtremeDB maintains a before-image of the objects that are updated or deleted, and a list

of database pages added during a transaction. When the application commits the

transaction, the memory for before-images and page references returns to the memory

pool

(a very fast and efficient process). If an in-memory database must abort a

Page 14

transaction-for example, if the in-bound data stream is interrupted- the before-images

are returned to the database and the newly inserted pages are returned to the memory.

In the event of catastrophic failure, the in-memory database image is lost-which suits

MMDBs' intended applications. If the system is turned off or some other event causes

the in-memory image to expire, the database is simply re-provisioned upon restart.

Examples of this include a program guide application in a set-top box that is continually

downloaded from a satellite or cable head-end, a network switch that discovers network

topology on startup, or a wireless access point that is provisioned by a server upstream.

This does not preclude the use of saved local data. The application can open a stream (a

socket, pipe, or a file pointer) and instruct eXtremeDB to read or write a database image

from, or to, the stream. This feature could be used to create and maintain boot-stage data,

i.e. an initial starting point for the database. The other end of the stream can be a pipe to

another process, or a file system pointer (any file system, whether it's magnetic, optical,

or FLASH). However, eXtremeDB's transaction processing operates independently from

these capabilities, limiting its scope to main memory processing in order to provide

maximum availability.

Data Transfer Overhead

With a disk-based database, data is transferred and copied extensively. In fact, the

application works with a copy of the data contained in a program variable that is several

times removed from the database. Consider the "handoffs" required for an application to

read a piece of data from the disk-based database, modify it, and write that piece of data

back to the database.

1. The application requests the data item from the database runtime through some

database API (e.g. db.linux's d_recread function).

The database runtime instructs the file system to retrieve the data from the

physical media (or memory-based storage location, in the case of a RAM-disk).

The file system makes a copy of the data for its cache and passes another copy to

the database.

The database keeps one copy in its cache and passes another copy to the

application.

The application modifies its copy and passes it back to the database through some

database API (e.g. db.linux's d_recwrite function).

The database runtime copies the modified data item back to database cache.

The copy in the database cache is eventually written to the file system, where it is

updated in the file system cache.

Finally, the data is written back to the physical media (or RAM-disk).

In this scenario there are 4 copies of the data (application copy, database cache, file

system cache, file system) and 6 transfers to move the data from the file system to the

application and back to the file system. And this simplified scenario doesn't account for

additional copies and transfers that are required for transaction logging!

Page 15

In contrast, a main memory database such as eXtremeDB requires little or no data

transfer. The application

may

make copies of the data, in local program variables, for its

own purposes or convenience, but is not required to by eXtremeDB. Instead, eXtremeDB

will give the application a pointer that refers directly to the data item in the database,

enabling the application to work with the data directly. The data is still protected because

the pointer is only used through the eXtremeDB-provided API, which insures that it is

used properly.

Operating System Dependency

A RAM-disk database still uses the underlying file system to access data within the

database. Therefore, it still relies on the file system function lseek() to locate the data.

Differing implementations of lseek() (for disk file systems as well as RAM disks) will

exhibit better or worse performance based on the quality of the implementation, but the

DBMS has no knowledge or control over this performance factor. In constrast,

eXtremeDB has complete control over access methods and is highly optimized.

db.linux, in particular, is heavily dependent upon inter-process communication (IPC) for

synchronization of concurrent access and transaction log recovery in the event of the

failure of one or more clients, or the failure of the lock manager itself. The quality of the

IPC implementation will impact the performance of db.linux but even the best

implementation represents an area of significant processing overhead. Other embedded

databases may or may not be dependent on inter-process communication.

Conclusion

This paper confirms two points:

Deploying a disk-based database on a RAM-drive improves DBMS performance.

This performance significantly lags that of a main memory database, given an

identical application task and processing environment.

The reason boils down to fundamental architectural differences between main memory

databases and traditional databases. Ironically, a major reason for disk-based databases

lagging, even on RAM-disk, is logic that has been incorporated to

avoid

disk access,

which continues to operate even though it is irrelevant in this setting. Other traditional

database functions, such as sophisticated recovery from catastrophic failure, are similarly

unnecessary in a memory-only environment, but cannot be "turned off" to achieve higher

performance. MMDBs, while perhaps not suited for every application, offer a compelling

alternative when high availability and performance are required.

While not this paper's primary focus, two other benefits of the main memory database

emerge from the experiment above. One is database footprint-the absence of caching

functions and other unnecessary logic means that memory and storage demands are

correspondingly low. In fact, the eXtremeDB database maintained a total RAM footprint

Page 16

of 108K in this test and 20.85MB when fully loaded with data (the raw data size is

16.7MB), compared to db.linux's footprint of 323K and 31.8MB with data (raw data is

the same, 16.7MB). The second benefit is greater reliability stemming from a less

complex database system architecture. It stands to reason that with fewer interacting

processes, this streamlined database system should result in fewer negative surprises for

end-users and developers.

Page 17

Appendix A � db.linux variable length string emulation

To emulate a variable length string field with db.linux, alter the database schema as

follows:

/**********************************************************

* *

**********************************************************/

struct stuff {

short a;

};

database mcs

{

data file "mcs.dat" contains Measure;

key file "mcs.key" contains sensors;

record Measure

{

long sensor_id;

long m_timestamp;

struct stuff thing;

compound key sensors {

sensor_id;

m_timestamp;

}

record Text100 {

char spectra100[100];

}

record Text200 {

char spectra200[200];

}

record Text300 {

char spectra300[300];

}

set Spectra {

order last;

owner Measure;

member Text100;

member Text200;

member Text300;

}

Page 18

When populating the database, the following pseudo-code is used:

d_fillnew (MEASURE)

d_setor(SPECTRA)

char *p = spectra

do {

if strlen(p) >= sizeof_spectra300

strncpy( Text300.spectra300, p, sizeof_spectra300 )

d_fillnew the Text300 record

d_connect(SPECTRA)

p += sizeof_spectra300

else if strlen(p) >= sizeof_spectra200

strncpy( Text200.spectra200, p, sizeof_spectra200 )

d_fillnew the Text200 record

d_connect(SPECTRA)

p += sizeof_spectra200

else if strlen(p) >= sizeof_spectra100

strncpy( Text100.spectra100, p, sizeof_spectra100 )

d_fillnew the Text100 record

d_connect(SPECTRA)

p = NULL

} while (p)

Note: the above pseudo-code is greatly simplified and does not cover all of the border

conditions. The general idea is to break off the largest piece of the spectra string possible

and store it in the appropriately sized TextNNN record and create a linked list of these

records with db.linux's multi-member network model set, named SPECTRA in this

example.

When retrieving the data, the linked list is traversed, concatenating the segmented spectra

string back together into the whole:

d_keyfind (MEASURE)

d_recread (MEASURE)

d_setor(SPECTRA)

char spectra[1000];

for( (stat = d_findfm(SPECTRA)); stat != S_EOS; (stat =

d_findnm(SPECTRA)) {

d_recread( &text300rec )

strcat( spectra, text300rec.spectra300 )

}

The code to reassemble the string iterates over the set reading each

set member record and concatenating the string segment to the whole.

Again, the pseudo code is simplified to illustrate the primary logic of

the variable length string technique.

Page 19

Appendix B � RAM-Disk configuration

For the Red Hat Linux 6.2 operating system.

RAM disk setup procedures:

1. Add a line to /etc/lilo.config file:

ramdisk=38000

Here's an example of lilo.conf:

boot=/dev/hda

map=/boot/map

install=/boot/boot.b

prompt

timeout=50

image=/boot/vmlinuz-2.2.5-15

label=linux

root=/dev/hda6

read-only

ramdisk=38000

2. Type /sbin/lilo and reboot

3. Create a mount point for the ram disk, for example:

mkdir /tmp/ramdisk0

Make sure to give appropriate access rights to this directory.

4. Create a file system on the block device:

/sbin/mke2fs /dev/ram0

Running df -k /dev/ram0 tells you how much can be used (the file system takes

some space, too).

5. Mount the ramdisk

mount /dev/ram0 /tmp/ramdisk0

You are set to go.