Object-Oriented Development in RDBMS, Part 1

Vladimir Kotlyarevsky

Thanks and apologies

This article is mostly a compilation of methods that are already well-known, though many times it turned out that I on my own have reinvented a well-known and quite good wheel. I have endeavored to provide readers with links to publications I know of that are discussing the problem. However, if I missed someone’s work in the bibliography, and thus violated copyright, please drop me a message at vlad@contek.ru. I apologize beforehand for possible inconvenience, and promise to add any necessary information to the article.

The sources in the bibliography are listed in the order of their appearance in my mind.

The described database structures have been simplified in order to illustrate the problems under consideration as much clearly as possible, leaving out more unimportant elements. I have tested the viability of all the methods taken from the articles, and of course I have tested all my own methods.

Mixing of object-oriented programming and RDBMS use is always a compromise. I have endeavored to recommend several approaches in which this compromise is minimised for both components. I have also tried to describe both advantages and disadvantages of such a compromise.

I should make it clear that the object approach to database design as described is not appropriate for every task. It is still true for the OOP as a whole, too, no matter what OOP apologists may say! :). I would recommend using it for such tasks as document storage and processing, accounting, etc.

And the last, but not least , I am very thankful to Dmitry Kuzmenko, Alexander Nevsky and other people who helped me in writing this article.

The problem statement

What is the problem?

Present-day relational databases were developed in times when the sun shone brighter, the computers were slower, mathematics was in favour, and OOP had yet to see the light of day. Due to that fact most RDBMSs’ have the following characteristics in common:

  1. Everyone got used to them and felt comfortable with them.
  2. They are quite fast (if you use them according to certain known standards).
  3. They use SQL, which is an easy, comprehensible and time-proved data manipulation method.
  4. They are based upon a strong mathematical theory.
  5. They are convenient for application development - if you develop your applications just like 20-30 years ago.

As you see, almost all these characteristics sound good, except, probably, the last one. Today you can hardly find a software product (in almost any area) consisting of more than few thousand of lines which is written without OOP technologies. OOP languages have been used for a long time for building visual forms, i.e. in UI development. It is also quite usual to apply OOP at the business logic level, if you implement it either on a middle-tier, or on a client. But things fall apart when the deal comes closer to the data storage issues… During the last ten years there were several attempts to develop an object-oriented database system, and, as far as I know, all those attempts were rather far from being successful. The characteristics of an OODBMS are the antithesis of those for an RDBMS. They are unusual and slow; there are no standards for data access and no underlying mathematical theory. Perhaps the OOP developer feels more comfortable with them, although I am not sure…

As a result, everyone continues using RDBMS, combining object-oriented business logic and domain objects with relational access to the database, where these objects are stored.

What do we need?

The thing we need is simple - to develop a set of standard methods that will help us to simplify the process of tailoring the OO-layer of business logic and a relational storage together. In other words, our task is to find out how to store objects in a relational database, and how to implement links between the objects. At the same time we want to keep all the advantages provided by the relational database design and access: speed, flexibility, and the power of relation processing.

RDBMS as an object

First let’s develop a database structure that would be suitable for accomplishing the specified task.

The OID

All objects are unique, and they must be easily identifiable. That is why all the objects stored in the database should have unique ID-keys from a single set (similar to object pointers in run-time). These identifiers are used to link to an object, to load an object into a run-time environment, etc. In the [1] article these identifiers are called OIDs (i.e. Object IDs), in [2] - UINs (Unique Identification Number), or “hyperkey”. Let us call them OIDs, though “hyperkey” is also quite a beautiful word, isn’t it? :) .

First of all, I would like to make a couple of points concerning key uniqueness. Database developers who are used to the classical approach to database design would probably be quite surprised at the idea that sometimes it makes sense to make a table key unique not only within a single table (in terms of OOP - not only within a certain class), but also within the whole database (all classes). However, such strict uniqueness offers important advantages, which will become obvious quite soon. Moreover, it often makes sense to provide complete uniqueness in a Universe, which provides considerable benefits in distributed databases and replication development. At the same time, strict uniqueness of a key within the database does not have any disadvantages. Even in the pure relational model it does not matter whether the surrogate key is unique within a single table or the whole database.

OIDs should never have any real world meaning. In other words, the key should be completely surrogate. I will not list here all the pros and cons of surrogate keys in comparison with natural ones: those who are interested can refer to the [4] article. The simplest explanation is that everything dealing with the real world may change (including the vehicle engine number, network card number, name, passport number, social security card number, and even sex :).

Nobody can change their date of birth - at least not their de facto date of birth. But birth dates are not unique, anyway.)

Remember the maxim “everything that can go bad will go bad” (”consequently, everything that cannot go bad…”. hum!. But let’s not talk about such gloomy things :) ). Changes to some OIDs would immediately lead to changes in all identifiers and links, and thus, as Mr. Scott Ambler wrote [1], could result in a “huge maintenance nightmare.” As for the surrogate key, there is no need to change it, at least in terms of dependency on the changing world.

And what is more, nobody requires run-time pointers to contain some additional information about an object except for a memory address. However, there are some people who vigorously reject usage of surrogates. The most brilliant argument against surrogates I’ve ever heard is that “they conflict with the relational theory”. This statement is quite arguable, since surrogate keys, in some sense, are much closer to that theory than natural ones.

Those who are interested in more strong evidence supporting the use of OIDs with the characteristics described above (pure surrogate, unique at least within the database), should refer to [1], [2], and [4] articles.

The simplest method of OID implementation in a relational database is a field of “integer” type, and a function for generating unique values of this type. In larger or distributed databases, it probably makes sense to use “int64″ or a combination of several integers.

ClassId

All objects stored in a database should have a persistent analogue of RTTI, which must be immediately available through the object identifier. Then, if we know the OID of an object, keeping in mind that it is unique within the database, we can immediately figure out what type the object is. This is the first advantage of OID uniqueness. Such an objective may be accomplished by a ClassId object attribute, which refers to the known types list, which basically is a table (let us call it “CLASSES”). This table may include any kind of information - from just a simple type name to detailed type metadata, necessary for the application domain.

Name, Description, creation_date, change_date, owner

Imagine a file system where user has to remember the handles or the inodes of files instead of their filenames :). It is frequently convenient, though not always necessary, to have an object naming method that is independent of the object type. For example, each object may have a short name and a long name. It is also sometimes convenient to have other object attributes for different purposes, akin to file attributes, such as “creation_date,” “change_date,” and “owner” (the “owner” attribute can be a string containing the owner’s name, as well as a link to an object of the “user” type). The “Deleted” attribute is also necessary as an indicator of the unavailability of an object. The physical deletion of records in a database, full of direct and indirect links, is often a very complicated task, to put it mildly :).

Thus each object has to have mandatory attributes (”OID” and “ClassId”) and desirable attributes (”Name,” “Description,” “creation_date,” “change_date,” and “owner”). Of course, when developing an application system, you can always add extra attributes specific to your particular requirements. This issue will be considered a little later.

The OBJECTS Table

Reading this far leads us to the conclusion that the simplest solution is to store the standard attributes of all objects in a single table. You could support a set of these fixed attributes separately for each type, duplicating 5 or 6 fields in each table, but why? It is not a duplication of information, but it is still a duplication of entities. Besides, a single table would allow us to have a central “well-known” entry point for searching of any object, and that is one of the most important advantages of the described structure.

So let us design this table:

create domain TOID as integer not null;
create table OBJECTS(
   OID TOID primary key,
   ClassId TOID,
   Name varchar(32),
   Description varchar(128),
   Deleted smallint,
   Creation_date timestamp default CURRENT_TIMESTAMP,
   Change_date timestamp default CURRENT_TIMESTAMP ,
   Owner TOID);

The ClassId attribute (the object type identifier) refers to the OBJECTS table, that is, the type description is also an object of a certain type, which has, for example, well-known ClassId = -100. (You need to add a description of the known OID). Then the list of all persistent object types that are known to the system is sampled by a query: select OID, Name, Description from OBJECTS where ClassId = -100).

Each object stored in our database will have a single record in the OBJECTS table referring to it, and hence a corresponding set of attributes. Wherever we see a link to a certain unidentified object, this fact about all objects enables us to find out easily what that object is - using a standard method.

Quite often, only basic information is needed. For instance, we need just a name to display in a lookup combobox. In this case we do not need anything but the OBJECTS table and a standard method of obtaining the name of an object via the link to it. This is the second advantage.

There are also types, simple lookup dictionaries, for example, which do not have any attributes other than those which already are in OBJECTS. Usually it is a short name (code) and a full long name that can easily be stored in the “Name” and “Description” fields of the OBJECTS table. Do you remember how many simple dictionaries are in your accounting system? It is likely, that not less than a half! Thus you may consider that you have implemented half of these dictionaries - that is the third advantage! Why should different dictionaries (entities) be stored in different tables (relations), when they have the same attributes? It does not matter that you were told to do so when you were a student! :)

A simple plain dictionary can be retrieved from the OBJECTS table by the following query

select OID, Name, Description 
   from OBJECTS 
   where ClassId = :LookupId

if you know ClassId for this dictionary element. If you only know the type name, the query becomes a bit more complex:

select OID, Name, Description 
   from OBJECTS 
   where ClassId = (select OID from OBJECTS where ClassId = -100 
   and Name = :LookupName)

Later I will demonstrate how to add a little more intelligence to such simple dictionaries.

Storing of more complex objects

It is clear that some objects in real databases are more complex than those which can be stored in the OBJECTS table. The method for storing them depends on the application domain and the object’s internal structure. Let’s look at three well-known methods of object-relational mapping.

Method 1. The objects are stored just as in a standard relational database, with the type attributes mapped to table attributes. For example, document objects of the “Order” type with such attributes as “order number,” “comments,” “customer,” and “order amount” are stored in the table Orders

create table Orders (
   OID TOID primary key, 
   customer TOID, 
   sum_total NUMERIC(15,2)),

which relates one-to-one to the OBJECTS table by the OID field. The “order number” and “comments” attributes are stored in the “Name” and “Description” fields of the OBJECTS table. “Orders” also refers to “OBJECTS” via the “customer” field, since a customer is also an object, being for example, a part of the “Partners” dictionary. You can retrieve all attributes of “Order” type with the following query:

select o.OID, 
   o.Name as Number, 
   o.Description as Comment, 
   ord.customer, 
   ord.sum_total
   from Objects o, Orders ord
   where o.OID = ord.OID and ord.OID = :id

As you see, everything is simple and usual. You could also create a view “orders_view”, and make everything look as it always did. :)

If an order has a “lines” section, and a real-world order should definitely have such a section, we can create separate table for it, call it e.g. “order_lines”, and relate it with the “Orders” table by a relation 1:M.

create table order_lines (
   id integer not null primary key,
   object_id TOID, /* reference to order object - 1:M relation */
   item_id TOID, /* reference to ordered item */ 
   amount numeric(15,4),
   cost numeric(15,4),
   sum numeric(15,4) computed by (amount*cost))

One very important advantage of this storage method is that it allows you to work with object sets as you would with normal relational tables (which they actually are). All the advantages of the relational approach are present.

Nevertheless, there are two main disadvantages: implementation of the system of object-relational mapping for this method is less than simple, and there are some difficulties in the organization of type inheritance. This method is described in detail in [1] and [3]. These articles also describe the implementation of type inheritance methods in a database.

Method 2. (See. [5]) All object attributes of any type are stored in the form of a record set in a single table. A simple example would look like this:

create table attributes (
   OID TOID, /* link to the master-object of this attribute */
   attribute_id integer not null, 
   value varchar(256), 
   constraint attributes_pk primary key (OID, attribute_id)); 

connected 1:M with OBJECTS by OID. There is also a table

create table class_attributes (
 OID TOID, /*here is a link to a description-object of the type    */
 attribute_id integer not null,
 attribute_name varchar(32),
 attribute_type integer,
 constraint class_attributes_pk primary key (OID,attribute_id))

which describes type metadata - an attribute set (their names and types) for each object type.

All attributes for particular object where the OID is known are retrieved by the query:

select attribute_id, value 
   from attributes
   where OID = :oid

or, with names of attributes

select a.attribute_id, ca.attribute_name a.value 
   from attributes a, class_attributes ca, objects o
   where a.OID = :oid and 
   a.OID = o.OID and 
   o.ClassId = ca.OID and 
   a.attribute_id = ca.attribute_id

In the context of this method, you can also emulate a relational standard. Instead of selecting object attributes in several records (one attribute per a record) you can get all attributes in a single record by joining or by using subqueries:

select o.OID, 
   o.Name as Number, 
   o.Description as Comment,
   a1.value as customer,
   a2.value as sum_total
   from OBJECTS o 
   left join attributes a1 on a1.OID = o.OID and 
   a1.attribute_id = 1
   left join attributes a2 on a2.OID = o.OID and 
   a2.attribute_id = 2
   where o.OID = :id;

Clearly, the more attributes object has, the slower the loading process will be, since each attribute requires an additional join in the query.

Returning to the example with the “Order” document, we see that the “order number” and “comments” attributes are still stored in the OBJECTS table, but “customer” and “order amount” are stored in two separate records of the “attributes” table. This approach is described by Anatoliy Tentser in article [5]. Its advantages are rather important: a standardized method of retrieving and storing object attributes; ease of extending and changing a type; ease in implementing object inheritance; very simple structure for the database, a significant benefit benefit since, with this approach the number of tables would not increase, no matter how many different types were stored in a database,

The main disadvatage is that this method is so different from the standard relational model that many standard techniques used on relational databases cannot be applied. The speed of object retrieval is significantly lower than the previous method 1 and it decreases also with the growth of the attribute count of a type. It is not very suitable for working with objects using SQL from inside the database, in stored procedures, for example. There is also a certain level of data redundancy because three fields (”OID,” “attribute_id,” and “value”) are applicable to every attribute instead of just the one field described in Method 1.

Method 3. Everything is stored in BLOB, and one of the persistent formats is applied - a custom format, or, for example, dfm (VCL streaming) from the Borland VCL, or XML, or anything you like. There is nothing to comment on here. The advantages are obvious: object retrieval logic is simple and fast; no extra database structures are necessary - just a single BLOB field; you can store any custom objects, including absolutely unstructured objects (such as MS Word documents, HTML pages, etc). The disadvantage is also obvious: there is nothing relational about this approach and you would have to perform all data processing outside of the database, using the database only for object storage.

Editor’s note: Even more important, it is impossible to query the database for objects with certain attributes, as the attribute data is stored in the BLOB.

It was not difficult to come to the following conclusions (well I know - everyone already knows them :)). All the three methods described are undoubtedly useful and have a right to live. Moreover, it sometimes makes sense to use all three of them in a single application. But when choosing among these three methods, take into account the listed advantages and disadvantages of each method. If your application involves massive relational data processing like searching or grouping, or it is likely to be added in the future, using a certain object attributes, it would be better to to use method 1, since it is the closest to the standard relational model and you will retain all the power of SQL. If, on the other hand, data processing is not particularly complex and data sets are not too large and/or you need a simple database structure, then it makes sense to use method 2. If a database is used only as an object storage, and all operations are performed with run-time instances of objects without using native database tools like SQL (except work with attributes stored in the OBJECTS table), the third method would probably be the best choice due to the speed and ease of implementation.

To be continued…

Leave a Reply

You must be logged in to post a comment.