merge, metadata and the data mart etl

I come across clients, usually smaller corporations or smaller teams of larger organisations where, in order to provide the most value for their dollar spent, the data layer supporting the BI solution ends up looking more like a data mart than a full data warehouse.

It seems to me that this is usually because these type of clients are business managers that are most concerned with the programs they are trying to implement now and the decisions they need to make in the very near future rather then elegance of the decision support system. But for that’s for another post.

Today I’d like to cover how the merge statement in SQL version 2005 and database system metadata can allow a BI developer to create extremely robust and flexible ETLs using just SQL. Being a SQL guy I’ve found that using SSIS is extremely cumbersome; I’d much rather code it than create a package! But I’ve actually found that using SQL has some huge advantages. With SQL, you can essentially synchronize a large number of data sets with very little code and a very light load on source and production servers. I know this approach has some serious limitations, but if you are providing a solution that feels more like a data mart than a full blown data warehouse, it might be helpful!

All that’s required is some well written dynamically generated SQL, some knowledge of the SQL Server DMVs and a few other system tables, the wonderful merge statement, and source data sets accessible from SQL Server.

// Metadata

You’ll want to start with metadata that contains a row for the 3- or 4- part notation of both your source and destination tables. If you are already used to maintaining an admin database this should be easy. If not, something simple will help us get started:

USE Warehouse
CREATE TABLE dbo.Object (
Object_ID int not null -- I hate underscores but use them here to be consistent with the system tables
, ObjectName sysname not null
, SchemaName varchar(50)
, DatabaseName varchar(50)
, SourceObjectName varchar(50)
, SourceSchemaName varchar(50)
, SourceDatabaseName varchar (50)
, SourceServerName varchar(50)
CONSTRAINT [PK_Object] PRIMARY KEY CLUSTERED
(
ObjectID ASC
)
)

// Source and Target Tables

Let’s now create a table we can use as an example, a simple sales fact table:

CREATE TABLE dbo.Sales (
TransactionID int, DateID int, ProductID int, CustomerID int, Quantity int, Amount decimal(18,2)
CONSTRAINT PK_Sales PRIMARY KEY CLUSTERED (TransactionID)
)


NOTE: For this method to work, all destination tables (even if only raw data tables in the feed store) should have a primary key. This will make sense as we construct the merge statement.

And then insert a row in to the object table for the sales fact. I’ll do it the long way by querying the system tables for the object ID and name, as setting up these kinds of views can be extremely helpful in simplifying maintenance:

INSERT dbo.Object
SELECT object_id, o.name, s.name, 'Warehouse', 'Sales', 'dbo', 'SourceDatabase', 'SourceServer'
FROM sys.objects o
JOIN sys.schemas s on o.schema_id = s.schema_id
WHERE o.name = 'sales'

Assuming SourceServer.SourceDatabase.dbo.Sales actually exists we can begin constructing the merge SQL that will be used to synchronize the two tables.

SELECT 'MERGE '+O.DatabaseName+'.'+O.SchemaName+'.'+O.ObjectName+' AS Tgt
USING '+O.SourceServerName+'.'+O.SourceDatabaseName+'.'+O.SourceSchemaName+'.'+O.SourceObjectName+' AS Src'
FROM dbo.Object O

The result should produce the following code:

MERGE Warehouse.dbo.Sales AS Tgt
USING SourceServer.SourceDatabase.dbo.Sales AS Src

// Join Criteria

It’s important that the target table for the merge statement have on it a primary key. Using the system tables and the FOR XML clause we can create a view that will help us generate the join criteria

CREATE VIEW dbo.ColumnSQL AS

WITH ColumnList AS (

SELECT o.object_id, ColumnName = c.name, PrimaryKey = CASE WHEN ic.index_column_id IS NOT NULL THEN 1 ELSE 0 END
, C.column_id, IsIdentity = c.is_identity
FROM sys.objects o
JOIN sys.columns c ON o.object_id = c.object_id
JOIN sys.schemas s ON o.schema_id = s.schema_id
LEFT JOIN sys.indexes i ON o.object_id = i.object_id AND i.type_desc = ‘CLUSTERED’
LEFT JOIN sys.index_columns ic ON i.object_id = ic.object_id and i.index_id = ic.index_id and c.column_id = ic.column_id
WHERE o.type_desc = ‘USER_TABLE’
)

Select C.object_id
, JoinColumns = REPLACE(Left(C.JoinColumns,Len(C.JoinColumns)-4),”+cast(C.object_id as varchar(16))+”,”)
, UpdateColumns = REPLACE(Left(C.UpdateColumns,Len(C.UpdateColumns)-1),”+cast(C.object_id as varchar(16))+”,”)
, AllColumns = REPLACE(Left(C.AllColumns,Len(C.AllColumns)-1),”+cast(C.object_id as varchar(16))+”,”)
From(
Select distinct C.Object_ID
— Equality Criteria of PK column(s) for on clause
, JoinColumns = (
Select C2.Object_ID, ‘Tgt.[‘+C2.ColumnName + ‘]=Src.[‘ + C2.ColumnName + ‘] AND ‘ AS [text()]
From ColumnList C2
Where C.Object_ID = C2.Object_ID
AND C2.PrimaryKey = 1
ORDER BY C2.Object_ID
For XML PATH (”)
)
— non-key columns for update clause
, UpdateColumns = (
Select C2.Object_ID, ‘Tgt.[‘+C2.ColumnName + ‘]=Src.[‘ + C2.ColumnName + ‘],’ AS [text()]
From ColumnList C2
Where C.Object_ID = C2.Object_ID
AND C2.PrimaryKey = 0
AND C2.IsIdentity = 0 — exclude identity columns in case they are used as surrogate keys, etc.
ORDER BY C2.Object_ID
For XML PATH (”)
)
— all columns for insert clause
, AllColumns = (
Select C2.Object_ID, ‘[‘+C2.ColumnName + ‘],’ AS [text()]
From ColumnList C2
Where C.Object_ID = C2.Object_ID
AND C2.IsIdentity = 0 — exclude identity columns in case they are used as surrogate keys, etc.
ORDER BY C2.Object_ID
For XML PATH (”)
)
From ColumnList C
) C

If you query this view for the sales table:

SELECT *
FROM dbo.ColumnSQL
WHERE object_name(object_id) = 'Sales'

You should get the following results:

object_id
85575343
JoinColumns
Tgt.[TransactionID]=Src.[TransactionID]
UpdateColumns
Tgt.[DateID]=Src.[DateID],Tgt.[ProductID]=Src.[ProductID],Tgt.[CustomerID]=Src.[CustomerID],Tgt.[Quantity]=Src.[Quantity],Tgt.[Amount]=Src.[Amount]
AllColumns
[TransactionID],[DateID],[ProductID],[CustomerID],[Quantity],[Amount]

NOTE: This query can get expensive if executed once for each table in the data mart. I’ve found it best to create a table in the admin database that gets updated at the beginning of a batch.

Now join this view to the object table and concatenate an on statement:

SELECT 'MERGE '+O.DatabaseName+'.'+O.SchemaName+'.'+O.ObjectName+' AS Tgt
USING '+O.SourceServerName+'.'+O.SourceDatabaseName+'.'+O.SourceSchemaName+'.'+O.SourceObjectName+' AS Src
ON '+C.JoinColumns
FROM dbo.Object O
JOIN dbo.ColumnSQL C
ON O.Object_ID = C.Object_ID

Which will produce the following:

MERGE Warehouse.dbo.Sales AS Tgt
USING SourceServer.SourceDatabase.dbo.Sales AS Src
ON Tgt.[TransactionID]=Src.[TransactionID]

// Update Condition

Now we want to check to see if any rows in the source table have changed and we use the WHEN MATCHED:

SELECT 'MERGE '+O.DatabaseName+'.'+O.SchemaName+'.'+O.ObjectName+' AS Tgt
USING '+O.SourceServerName+'.'+O.SourceDatabaseName+'.'+O.SourceSchemaName+'.'+O.SourceObjectName+' AS Src
ON '+C.JoinColumns + '
WHEN MATCHED AND '+REPLACE(REPLACE(C.UpdateColumns,'=',''),',',' OR ')+'
THEN UPDATE SET '+C.UpdateColumns
FROM dbo.Object O
JOIN dbo.ColumnSQL C
ON O.Object_ID = C.Object_ID

And our generated code looks like:

MERGE Warehouse.dbo.Sales AS Tgt
USING SourceServer.SourceDatabase.dbo.Sales AS Src
ON Tgt.[TransactionID]=Src.[TransactionID]
WHEN MATCHED AND Tgt.[DateID]Src.[DateID] OR Tgt.[ProductID]Src.[ProductID] OR Tgt.[CustomerID]Src.[CustomerID] OR Tgt.[Quantity]Src.[Quantity] OR Tgt.[Amount]Src.[Amount]
THEN UPDATE SET Tgt.[DateID]=Src.[DateID],Tgt.[ProductID]=Src.[ProductID],Tgt.[CustomerID]=Src.[CustomerID],Tgt.[Quantity]=Src.[Quantity],Tgt.[Amount]=Src.[Amount]

Note: The columnSQL view only returns one list of non-key columns and here we’re using it for both the update condition and the update statement itself. This could easily be replaced with a checksum or binary_checksum condition using all non-key columns. Both methods have pros and cons but performance differences are negligible. Watch out for the length of the code being generated though as if the table has too many columns the code could be longer than what is allowed by varchar(max). Using a checksum or binary_checksum or removing the []’s can help save space.

// Insert and Delete

The rest is quite simple! Update the code generating our merge statement with the not matched and not matched by source clauses in order to manage new and deleted records:

SELECT 'MERGE '+O.DatabaseName+'.'+O.SchemaName+'.'+O.ObjectName+' AS Tgt
USING '+O.SourceServerName+'.'+O.SourceDatabaseName+'.'+O.SourceSchemaName+'.'+O.SourceObjectName+' AS Src
ON '+C.JoinColumns + '
WHEN MATCHED AND '+REPLACE(REPLACE(C.UpdateColumns,'=',''),',',' OR ')+'
THEN UPDATE SET '+C.UpdateColumns + '
WHEN NOT MATCHED THEN INSERT VALUES('+C.AllColumns+')
WHEN NOT MATCHED BY SOURCE THEN DELETE;'
FROM dbo.Object O
JOIN dbo.ColumnSQL C
ON O.Object_ID = C.Object_ID

To produce the final code:

MERGE Warehouse.dbo.Sales AS Tgt
USING SourceServer.SourceDatabase.dbo.Sales AS Src
ON Tgt.[TransactionID]=Src.[TransactionID]
WHEN MATCHED AND Tgt.[DateID]Src.[DateID] OR Tgt.[ProductID]Src.[ProductID] OR Tgt.[CustomerID]Src.[CustomerID] OR Tgt.[Quantity]Src.[Quantity] OR Tgt.[Amount]Src.[Amount]
THEN UPDATE SET Tgt.[DateID]=Src.[DateID],Tgt.[ProductID]=Src.[ProductID],Tgt.[CustomerID]=Src.[CustomerID],Tgt.[Quantity]=Src.[Quantity],Tgt.[Amount]=Src.[Amount]
WHEN NOT MATCHED THEN INSERT VALUES ([TransactionID],[DateID],[ProductID],[CustomerID],[Quantity],[Amount])
WHEN NOT MATCHED BY SOURCE THEN DELETE;

And then use this in a cursor or while loop in order to execute the statement for each of the tables in your data mart!

// Summary

The key to using this method will be the extra bits needed in order to make it work for your solution. You’ll need to add metadata to control the creating of surrogate keys for type 2 slowly changing dimensions, for updates that are only meant to be incremental and a plethora of other concerns and more code to manage logging, handle errors and all the rest. If done thoroughly you can even track which dimensions and partitions in your cube need to be updated based on the results of the data changes and dynamically generate a job and XMLA command for cube processing.

Thx!

// Links
MSDN merge
MSDN checksum
MSDN binary_checksum

2 comments

  1. Pingback: merge and the data mart etl « christopher ross

  2. D.M.

    A couple of notes:
    1. merge doesn’t exist in 2005. Only 2008 and above.
    2. missed ‘ != ‘ in condition Tgt.[DateID]Src.[DateID] etc.
    3. operator priority – > WHEN MATCHED AND (Tgt.[DateID] != Src.[DateID] OR Tgt.[ProductID] != Src.[ProductID] OR Tgt.[CustomerID] != Src.[CustomerID] OR Tgt.[Quantity] != Src.[Quantity] OR Tgt.[Amount] != Src.[Amount])

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s