Monday, May 9

My meta-database dilemma

I'm working on a pet project, the mundane details of which I'm tracking primarily in my personal blog for the entertainment of my friends. However, for this particular discussion, I'm posting to my "professional" blog since it reaches a wider audience and I'm curious to get some quality feedback.

This entry is about the design of the database schema.

It's hard for me to put my database design process into words. It's all mental with me. Insert your own jokes here. I first envision all the entities of data: companies, products, etc. Then I envision their relationships. Then I start mapping them to each other. Then I have a little Q&A with myself to shoot holes in my design. If I come up with a question that I can't answer, I repeat the relationship process. Then I try to envision all the possible ways I expect to need to query the data, and I see if my design facilitates them. I'll attempt to narrate this process...

"I serve companies."

Company
---
Company ID
Name

"A company has one or more product lines."

Product Line
---
Product Line ID
Company ID
Name

"A product line contains one or more products."

Product
---
Product ID
Product Line ID

"A product has certain properties (i.e. name, description, price, etc.)."

Why don't I create a column for each property? Because...

"Companies maintaining their own product databases need to define new product properties on the fly."

This means I'm essentially defining meta-columns.

OK, granted, but why don't you at least create columns for the "obvious" pieces of data, like "name"? Because something that seems "obvious" is sometimes much more complex. What if there's an internal company name versus a retail name? What if the product is sold under different names in different regions? Nothing is as simple as it seems.

Property
---
Property ID
Format ID
Name

What is that Format ID column? I would like to enforce a consistent format across data entries of the same property type. For example, all the name entries should be text, all the price entries should be decimals, all the look-up entries (explained later) should be integers, etc. The formatting will be enforced in code.

Property Format
---
Format ID
Name

"Products between different product lines share some properties."

Product Line Property Xref
---
Product Line ID
Property ID

Now that I have the meta-database defined, I need to start assigning values to product properties.

Property Value
---
Product ID
Property ID
Created
Value

Some property values will be restricted, for example you might have a property named "size" and you only want the values to be "small", "medium", or "large".

Property Value Look-up
---
Property Value ID
Property ID
Value

In this case the pre-defined "look-up" values are stored in the following table and the Value column of the Property Value record would hold the Property Value ID. This would all be enforced by code.

So that's my plan so far. I expect it to change a little before and during implementation.

One nagging question I haven't yet resolved is: How do I manage sub-product lines? For example, the company Hanes might have a product line named T-shirts with sub-product lines for male and female. I can create a parent-child relationship with the Product Line table like so:

Product Line
---
Product Line ID
Parent Product Line ID
Company ID
Name

But that makes the Company ID column redundant. I could use a sister table like so:

Product Line Xref
---
Parent Product Line ID
Child Product Line ID

But that affords an unnecessary many-to-many relationship, and allows silly things like circular relationships, ad nauseam.

Or, dare I consider using the meta-data level of property value look-ups to manage this? Where do I draw the line between data and meta-data? What decides whether data is worthy of a column or has to settle for a meta-column? This could prove to be quite challenging, frustrating, and enlightening.

For the sake of argument, let's work through this and see what becomes of it. Forget sub-product lines for a moment and let's step back into top level product lines and companies. Could I make them meta-data columns? Sure. I simply create a Property record for "company" and "product line". But then how do I tie the two together? How do I map a product line to a company? I would need some sort of referential key between the two property types, and then a value assigned to such a key in the Property Value Look-up records. Are you confused yet? I think I might be. Let's revisit the Property table, forgetting the Format column for now since it's not relevant to this discussion:

Property
---
Property ID
Relative Property ID
Name

Our records might look like this:

Property ID Relative Property ID Name
----------- -------------------- ----
7 Null company
11 7 product line

Note that the Property "product line" has a Relative Property ID of 7 (products relate to companies). I'm essentially saying any Property Value Look-up record representing a "product line" must somehow refer to a "company". So let's revisit the Property Value Look-up table:

Property Value Look-up
---
Property Value ID
Property ID
Relative Property Value ID
Value

Our records might look like this:

Property Value ID Property ID Relative Property Value ID Value
----------------- ----------- -------------------------- -----
82 7 Null Hanes
91 11 82 T-shirts

OK, I've gone way off the deep-end here. I'm essentially creating a database within a database. I'm creating a set of table and columns and relationships to represent a set of tables and columns and relationships. And why? Because I need some of those columns and relationships to be dynamic. Customers that use my product are going to need to share some columns and relationships that are common to other customers and also create their own columns and relationships unique to their particular product lines.

What are the alternatives? Well the first and most obvious is to maintain different databases (or at least tables) for each customer (and possibly each product line). I don't want to do that for several reasons. I want this service to be as hands-off from my perspective as possible. I want the customer to be able to manage their own product lines. I don't want to be creating and editing tables, columns, and relationships for each client and their product lines. So that option is right out.

The second alternative is to let the customers actually add columns to tables. I shuddered just typing that. But perhaps it's feasible. When a customer first sets up the system for their product line, I can present them with a list of all the already existing properties (columns) available. They can then select which of those properties are appropriate for their product line. But they might need to add some new ones, which means new columns in the main product table, and possibly a new look-up table. For example, the customer wants to add a new property "size" and only allow the values "small", "medium", and "large". That means a new column named "Size ID" and a new table named something like "Size Look-up" with the three records for "small", "medium", and "large".

Size Look-up
---
Size ID
Name

The third option is an arguably cleaner variation on the former: a new table for each property instead of new columns. Using the same "size" example above, a new table would be created like "Size Value" and the same look-up table would be referenced.

Size Value
---
Product ID
Size ID

Note that the Product ID column would have to be unique to avoid a one-to-many relationship.

One downfall of this option is that the database could quickly become cluttered with tiny tables, and somehow the system needs to keep track of them all. I'd need a table containing a list of all the available property tables.

Property Table
---
Property Table ID
Name

And I would need to keep track of which product lines referenced which property tables:

Product Line Property Tables
---
Product Line ID
Property Table ID

Here I'm dancing with the whole meta-data issue again.

Another downfall to this option is the atrocious complexity of queries, joins, sorting, etc. all of which have to be built "on the fly" since the code has to account for and adjust to a constantly changing set of target tables.

This is going to require some more thought, and lot of it. I'm starting to realize why nobody has done this yet.

1 comment:

Richard Harms (rich@darkrealms.com) said...

Been through something similar several times now. Years ago, put together a product for online catalogs (an example would be at Right Stuf), where we allowed customer to define item attributes for their items in the database on the fly. Worked for creating catalogs of everything from anime, cars, to airplane parts. After that, went through a phase where we were going to attempt to recreating something like that built with database tables doing nothing but defining types of entities, and attributes of them. That failed. Third time now, starting over, we're building a system that's a hybrid - data that is used by the core system is all defined using standard database tables (XML which is used to generate Java classes and database schema), and then users of the system may define extensions to the core system on the fly. Those attributes may then be accessed used to control presentation of the information, or anything else the user wants to do with them. Its working out much better. The OFBiz folks have also mentioned on their mailing lists that they're working on a system that allows user extensions to their base database schema. They also use XML to setup the database schema, and it sounds like they'll just allow users to define their own extensions which will lay on top of their base.