Monday, June 17, 2013

Reverse Engineering an Analysis Services Cube using SSIS

     When using the Microsoft BI Stack, cubes are usually seen as the end of the road(kind of). They are usually built on top of facts and dimensions in your data warehouse/data marts. These represent the culmination of all of your team's hard work. The requirements gathering/analysis, ETL/ELT, data warehouse design, reports/analytics and UAT, etc. However; these same end user destinations may, in certain situations, be the only point of exposure to other data systems. If you are tasked with pulling down data into your data warehouse from some other data store's cube then, in essence, their end is your beginning. Luckily SSIS can help with this.

     For this example, we're are going to be modeling data that tracks the complaint to event ratio for a certain range of products over a given time frame. We're going to be using a cube called "Events":

Figure 1. Event Cube

     In order to extract the data we want, we need to focus on measures COMPLAINT QT, EVENT QTY and dimensions ADD DAY and DIM PRODUCT. Now, when looking at cubes we have to remember that they are multidimensional and not 2 dimensional as tables in a database are. So we simply can't use a SQL statement to get the data we need, we'll have to use Multidimensional Expressions (MDX). For this, we're going to need to use 2, one to get the measures(facts) and one for the product(dimension). For the measures we're going to use this MDX statement:

WITH MEMBER [MEASURES].[COMPLAINT RATIO] as '([MEASURES].[COMPLAINT QTY] / [MEASURES].[EVENT COUNT])',FORMAT_STRING='0.00'
MEMBER [MEASURES].[COMPLAINT_QTY] as '[MEASURES].[COMPLAINT QTY] +0'
MEMBER [MEASURES].[EVENT_COUNT] as '[MEASURES].[EVENT COUNT] +0'
MEMBER [MEASURES].[COMPLAINT_RATIO] as '[MEASURES].[COMPLAINT RATIO] +0'

SELECT

{
       [MEASURES].[COMPLAINT_QTY],
       [MEASURES].[EVENT_COUNT],
       [MEASURES].[COMPLAINT_RATIO]
ON COLUMNS,

(
       {
              [DIM PRODUCT].[MODEL].CHILDREN
       },
        
       {
              [ADD DAY].[Hierarchy].[FULL DATE].MEMBERS
       }
 ) ON ROWS
FROM
        EVENTS
WHERE
    (
       [ADD DAY].[YEARMO].[201301] : [ADD DAY].[YEARMO].[201303]
    )

     This statement will return the number of complaints, events and complaint to event ratio(non-additive) by product model for first quarter 2013. We want all the product model children in this result set and by adding this we make sure we don't return a summation set. We also want all of the full date members, which will return all the dates at that level of the hierarchy.

Figure 2. Measures MDX Result Sample
     This takes care of our measures. Now we'll have to write a statement that will retrieve  the attributes for our products. This statement will have no measure involved:

SELECT
NULL
ON COLUMNS,
(
 {
 [DIM PRODUCT].[PRODUCT BUSINESS UNIT].CHILDREN
 },
 {
 [DIM PRODUCT].[PRODUCT FAMILY].CHILDREN
 },
 {
 [DIM PRODUCT].[MODEL].CHILDREN
 }
)ON ROWS

FROM EVENTS

     This statement will retrieve the master data for products out of the cube:

Figure 3. Dimension MDX Result Sample

However; there's a problem with this. Since there is no measure, there technically is no records retrieved. Putting this MDX statement in an OLE DB source in SSIS will add 0 records to your data flow.  To get around this we will add a bogus measure to the statement and just ignore it when we load the data:


WITH MEMBER [MEASURES].[FOO] AS ('0')

SELECT
[MEASURES].[FOO]
ON COLUMNS,
({
[DIM PRODUCT].[PRODUCT BUSINESS UNIT].CHILDREN
},
{
 [DIM PRODUCT].[PRODUCT FAMILY].CHILDREN
},{
 [DIM PRODUCT].[MODEL].CHILDREN }
)ON ROWS

FROM EVENTS

This will now return the same result set, but with a bogus measure. This will add records to a data flow when were ready to process:


Figure 4. Dimension MDX Result Sample with Measure
With our MDX queries ready, let's create some tables that can act as a data destination for these 2 pieces of data:


CREATE TABLE
P_STG_PRODUCT_COMPLAINT
(
 MODEL VARCHAR(50),
 COMPLAINT_QTY SMALLINT,
 EVENT_COUNT SMALLINT,
 COMPLAINT_RATIO DECIMAL(3,2),
 FULL_DATE DATETIME
)

GO


CREATE TABLE
P_STG_PRODUCT
(
 MODEL VARCHAR(50),
 BUSINESS_UNIT VARCHAR(50),
 FAMILY VARCHAR(50)
)

GO

     These tables represent data destinations you may have in a persistent staging database, where data is stored untransformed before its added to the data warehouse(for sake of example I did not add the step of staging the data in a staging database).  The data flow for our SSIS package is going to look something like this:


Figure 5. SSIS Data Flow
     
     The OLE DB Source connection manager is going to use the OLE DB Provider for Analysis Services. For this we need to create one under connection managers:


Figure 6. OLAP Data Provider
     After confirming our connection to Analysis Services we have to perform a few further configurations to make sure we can get the data into our data flow. Couple things we need to consider when using a cube as a source and using custom formatting in our MDX statements. First is that MDX is a multidimensional expression language, while SQL server is 2 dimensional (rows and columns). So we need to make sure the data coming through SSIS is tabular, or nothing will happen. The second thing, formatting, is not handled by default in the provider. For our statement the returns measures we added FORMAT_STRING='0.00' to make sure we only return 2 decimal places. To make sure we get our data in a tabular format with our custom formatting, we need to edit the extended properties of the data provider. Click the Data Links button-->All Tab-->Extended Properties and add ReturnCellProperties=true; Format=Tabular:



Figure 7. OLAP Data Provider Extended Properties
     With this done we can configure the OLE DB sources. Select the OLAP provider as the connection manage and SQL command as the data access mode. We can enter our MDX statement into the SQL command text box just like an SQL statement:  


Figure 8.  OLE DB Source Editor

          With this done, we need to click Columns to select what we want to add to our data flow. Usually, we can just select the columns we want. However; we have selections that are formatted, and these come in separate columns in our source. When clicking on columns you'll notice a warning will pop up that says "[OLE DB Source Output] references an external data type that cannot be mapped to a Data Flow task data type. The Data Flow task data type DT_WSTR will be used instead". It appears the data provider cannot read the data type from the cube. Its not a problem in that this can be cleaned up in a Data Conversion transformation. I have not found a way to get around this, if anyone knows let me know and ill update this post. Click OK on the message and make sure to select the measure columns that have the .FORMATTED_VALUE suffix:

Figure 9. Measures Source Column Selections

     For the dimension source we don't need to worry about formatting, but we do need to deselect our bogus measure "foo":

Figure 10. Dimension Source Column Selections

     Since all of our columns are by default set to DT_WSTR, we need to convert to data types that are compatible with our destination tables using the Data Conversion transformation. We will have to do this for both:

Figure 11. Data Conversion Transformation

     For the data destinations, make sure to select the converted columns rather than the unconverted source columns:

Figure 12. Data Destination
     Now we can slap a couple Data Viewers on the data flow and run the package:


Figure 13. Data Viewers

     With these 2 pieces of data in our persistent staging tables, we can join on the model columns to link them. We can also now incorporate them into a data warehouse schema like this:


Figure 14. Snapshot Schema