Monday, May 9

My meta-database dilemma

I'm working on a pet project, the mundane details of which I'm tracking primarily in my personal blog for the entertainment of my friends. However, for this particular discussion, I'm posting to my "professional" blog since it reaches a wider audience and I'm curious to get some quality feedback.

This entry is about the design of the database schema.

It's hard for me to put my database design process into words. It's all mental with me. Insert your own jokes here. I first envision all the entities of data: companies, products, etc. Then I envision their relationships. Then I start mapping them to each other. Then I have a little Q&A with myself to shoot holes in my design. If I come up with a question that I can't answer, I repeat the relationship process. Then I try to envision all the possible ways I expect to need to query the data, and I see if my design facilitates them. I'll attempt to narrate this process...

"I serve companies."

Company
---
Company ID
Name

"A company has one or more product lines."

Product Line
---
Product Line ID
Company ID
Name

"A product line contains one or more products."

Product
---
Product ID
Product Line ID

"A product has certain properties (i.e. name, description, price, etc.)."

Why don't I create a column for each property? Because...

"Companies maintaining their own product databases need to define new product properties on the fly."

This means I'm essentially defining meta-columns.

OK, granted, but why don't you at least create columns for the "obvious" pieces of data, like "name"? Because something that seems "obvious" is sometimes much more complex. What if there's an internal company name versus a retail name? What if the product is sold under different names in different regions? Nothing is as simple as it seems.

Property
---
Property ID
Format ID
Name

What is that Format ID column? I would like to enforce a consistent format across data entries of the same property type. For example, all the name entries should be text, all the price entries should be decimals, all the look-up entries (explained later) should be integers, etc. The formatting will be enforced in code.

Property Format
---
Format ID
Name

"Products between different product lines share some properties."

Product Line Property Xref
---
Product Line ID
Property ID

Now that I have the meta-database defined, I need to start assigning values to product properties.

Property Value
---
Product ID
Property ID
Created
Value

Some property values will be restricted, for example you might have a property named "size" and you only want the values to be "small", "medium", or "large".

Property Value Look-up
---
Property Value ID
Property ID
Value

In this case the pre-defined "look-up" values are stored in the following table and the Value column of the Property Value record would hold the Property Value ID. This would all be enforced by code.

So that's my plan so far. I expect it to change a little before and during implementation.

One nagging question I haven't yet resolved is: How do I manage sub-product lines? For example, the company Hanes might have a product line named T-shirts with sub-product lines for male and female. I can create a parent-child relationship with the Product Line table like so:

Product Line
---
Product Line ID
Parent Product Line ID
Company ID
Name

But that makes the Company ID column redundant. I could use a sister table like so:

Product Line Xref
---
Parent Product Line ID
Child Product Line ID

But that affords an unnecessary many-to-many relationship, and allows silly things like circular relationships, ad nauseam.

Or, dare I consider using the meta-data level of property value look-ups to manage this? Where do I draw the line between data and meta-data? What decides whether data is worthy of a column or has to settle for a meta-column? This could prove to be quite challenging, frustrating, and enlightening.

For the sake of argument, let's work through this and see what becomes of it. Forget sub-product lines for a moment and let's step back into top level product lines and companies. Could I make them meta-data columns? Sure. I simply create a Property record for "company" and "product line". But then how do I tie the two together? How do I map a product line to a company? I would need some sort of referential key between the two property types, and then a value assigned to such a key in the Property Value Look-up records. Are you confused yet? I think I might be. Let's revisit the Property table, forgetting the Format column for now since it's not relevant to this discussion:

Property
---
Property ID
Relative Property ID
Name

Our records might look like this:

Property ID Relative Property ID Name
----------- -------------------- ----
7 Null company
11 7 product line

Note that the Property "product line" has a Relative Property ID of 7 (products relate to companies). I'm essentially saying any Property Value Look-up record representing a "product line" must somehow refer to a "company". So let's revisit the Property Value Look-up table:

Property Value Look-up
---
Property Value ID
Property ID
Relative Property Value ID
Value

Our records might look like this:

Property Value ID Property ID Relative Property Value ID Value
----------------- ----------- -------------------------- -----
82 7 Null Hanes
91 11 82 T-shirts

OK, I've gone way off the deep-end here. I'm essentially creating a database within a database. I'm creating a set of table and columns and relationships to represent a set of tables and columns and relationships. And why? Because I need some of those columns and relationships to be dynamic. Customers that use my product are going to need to share some columns and relationships that are common to other customers and also create their own columns and relationships unique to their particular product lines.

What are the alternatives? Well the first and most obvious is to maintain different databases (or at least tables) for each customer (and possibly each product line). I don't want to do that for several reasons. I want this service to be as hands-off from my perspective as possible. I want the customer to be able to manage their own product lines. I don't want to be creating and editing tables, columns, and relationships for each client and their product lines. So that option is right out.

The second alternative is to let the customers actually add columns to tables. I shuddered just typing that. But perhaps it's feasible. When a customer first sets up the system for their product line, I can present them with a list of all the already existing properties (columns) available. They can then select which of those properties are appropriate for their product line. But they might need to add some new ones, which means new columns in the main product table, and possibly a new look-up table. For example, the customer wants to add a new property "size" and only allow the values "small", "medium", and "large". That means a new column named "Size ID" and a new table named something like "Size Look-up" with the three records for "small", "medium", and "large".

Size Look-up
---
Size ID
Name

The third option is an arguably cleaner variation on the former: a new table for each property instead of new columns. Using the same "size" example above, a new table would be created like "Size Value" and the same look-up table would be referenced.

Size Value
---
Product ID
Size ID

Note that the Product ID column would have to be unique to avoid a one-to-many relationship.

One downfall of this option is that the database could quickly become cluttered with tiny tables, and somehow the system needs to keep track of them all. I'd need a table containing a list of all the available property tables.

Property Table
---
Property Table ID
Name

And I would need to keep track of which product lines referenced which property tables:

Product Line Property Tables
---
Product Line ID
Property Table ID

Here I'm dancing with the whole meta-data issue again.

Another downfall to this option is the atrocious complexity of queries, joins, sorting, etc. all of which have to be built "on the fly" since the code has to account for and adjust to a constantly changing set of target tables.

This is going to require some more thought, and lot of it. I'm starting to realize why nobody has done this yet.

Thursday, May 5

Java Job in South Florida

- Required knowledge and experience in Java, SQL, HTTP, HTML, CSS; you will be tested during the interview.
- Must be able to commute daily to Weston (a.k.a. Fort Lauderdale), Florida; we will NOT support relocation or telecommuting.
- Salary based entirely on your qualifications.

I run a very small tight-knit development team. We're very informal when it comes to dress code and office hours, but we're hard-core when it comes to design and coding standards. We don't tolerate crap code. Poorly designed database schemas make us retch. We're the type of people that keep up on the latest technologies because it's our passion, not because it's our job. If the first thing you do in the morning is check out Cafe au Lait, java.blogs, and Slashdot, then you're probably going to get along fine with us. If you know who Tog, Nielsen, and Norman are, you're off to a good start. If you know who Joel Spolsky is, we won't hold it against you ;-) If you can list and explain a half-dozen Design Patterns without breaking a sweat, you just might impress us. If you know and love Refactoring, you're going to enjoy working here. If you've read this far and you think you've got The Right Stuff, shoot your resume off to trak3r@gmail.com

Sunday, May 1

Play it safe; copy, don't innovate

Based on both personal experience and casual observation, it seems that launching a start-up based on a new idea is a good recipe for disaster and heart break. Here's my logic:

1. You dump a butt-load of VC money into R&D to make your idea a reality.

2. You dump a butt-load of VC money into PR to make the world aware of your hot new idea.

3. You launch your pride and joy, charging an insane amount of money in a desperate attempt to compensate for steps 1 and 2.

4. Somebody else comes along, sees your idea, sees all the feedback about where it's lacking in consumer appeasement, and builds their own copy. They save a bundle on R&D because you've already done the hard/long/expensive stuff. They put minimal investment into offering it with a cleaner interface, sell it for a lot less money, and steal your thunder along with your customer base, leaving you with a mountain of debt and the fruitless options of filing for bankruptcy or pursuing litigation.

My last company managed to cover both sides of this coin. They performed steps 1 through 3 with their own product, then executed step 4 on another company's product. Unfortunately the latter product they decided to copy and improve upon wasn't quite mature enough and they ended up repeating steps 2 and 3.

It seems that being an innovative DIY'er can provide you with pride and sometimes a book deal, but in the end you need to sell your work off to a bigger fish to cash out your stock, or watch in vein as the late-to-markets pick away at your profits while you struggle to get out of the red. Sure, there are a few rare exceptions to this, like the obligatory Amazon, eBay, Google, etc. But they are just a percentage of a percentage of the plethora of companies coming and going like the wind.

It seems to me to be a safer and saner bet that if you want to start a product or service based software company (as opposed to consulting or contracting) you're better off copying and improving another successful product or service rather than inventing and entirely new one. Yeah, a lot less exciting, and arguably less challenging, I'll admit. But if you're in the game for the money, that's where I would hedge my bets.

But then again, I'm just a lowly technology executive. What do I know? I don't have an MBA. I don't have to report to the Board of Directors. I don't have to "make the sale" to prospective clients. Perhaps someday I'll get the chance to do that. I do love a challenge, especially in unfamiliar territory.