Normalize data: from Userfriendly to Pivotfriendly

If possible I always use pivottables to create reports in Excel. Pivottables need well formed, normalized data tables as source, which are usually not suitable for data entry. In this post I describe two techniques to bridge the gap between userfriendly and pivotfriendly data structures.

Case

Suppose you need to enter quarterly figures of three departments in several years. One way to set up your input table is as follows:

Normalize Source table

Source table

For easy data entry this table is well formed and userfriendly. Unfortuately, this table is not well formed for pivottables. It has four columns with values of the same ‘type’, Q1 through Q4. The table is not ‘normalized’ and therefore usually not suitable to produce quality pivot reports. A well formed, normalized source table has only one value column:

Normalize Target

Normalize Target

The first three columns of this table somehow specify or describe the Value field; they are sometimes also called ‘attributes’ or ‘dimensions’.
With the above table it is easy to create pivot reports and charts like these:

Normalize ChartReport

Normalize ChartReport

Also, with a normalized source table it is usually easy to reproduce it:

Normalize Report

Normalize Report

The question is: how do we bridge the gap between the user friendly source table and the pivot friendly normalized table above? Below I describe two vba-based techniques. See the attached workbooks for details.

The Moving Ranges technique

The basic idea is to define a few ranges as ‘windows’ on the source and target tables, and to transfer values from source to target as a ‘block’.
In this case we define four ranges:

Moving Ranges

Moving Ranges

Explanation:

  • In the source table the ranges RowLables (red) and ColumnLables (light blue) will not change;
  • Range Values (green) will move every round, from Q1 to Q4;
  • In the target table we define only one BaseRange which will be moved and horizontally resized as required.

To transfer the values of quarter Q1 to the target table we perform three steps:

Normalize Steps 123

Normalize Steps 123

Explanation:

  • Step 1: resize the target BaseRange to two columns and transfer the values of range RowLabels;
  • Step 2: move BaseRange to column Quarter and enter the current quarter from range ColumnLabels;
  • Step 3: again move the base range to column Value and transfer the green range Values.

In the next round we move the green range Values to Q2, move BaseRange nine rows down and repeat steps 1 to 3. Etcetera.

Implementation

This process is implemented in a public procedure Update which is located in the code module of the target sheet; Update is a method to refresh the target table, so to speak. It has three blocks of code. In block 1 we prepare some things based on the source table, after defining some variables:

    Set tbSource = SSource.ListObjects(1)
    Set tbTarget = Me.ListObjects(1)
    lNRowLabels = 2 ' Year and Department
    lNColumnLabels = 4 ' Q1 - Q4
    
    With tbSource ' prepare source things
    
        ' if source is empty clear target too and we're done
        If .ListRows.Count = 0 Then
            If tbTarget.ListRows.Count > 0 Then tbTarget.DataBodyRange.EntireRow.Delete
            Exit Function
        End If
        
        ' setup ranges
        Set rColumnLabels = .HeaderRowRange.Offset(, lNRowLabels).Resize(, lNColumnLabels)
        Set rRowLabels = .ListColumns(1).DataBodyRange.Resize(, lNRowLabels)
        Set rValues = .ListColumns(3).DataBodyRange
        lNSourceRows = .ListRows.Count
    
    End With

The comments speak for themselves, I hope.
In block 2 we prepare some things based on the target table.

    With tbTarget ' prepare target things
    
        ' if we need too many rows exit
        If rColumnLabels.Cells.Count * lNSourceRows > Rows.Count - .HeaderRowRange.Row Then
            MsgBox "Too big.", vbCritical, "Error"
            Exit Function
        End If
        
        ' clear target
        If .ListRows.Count > 0 Then .DataBodyRange.EntireRow.Delete
        
        ' set baserange = databodyrange of first column after first transfer
        Set rBaseRange = .HeaderRowRange.Offset(1).Resize(lNSourceRows, 1)
        
        ' make sure table will expand automatically when transferring to row below table
        Application.AutoCorrect.AutoExpandListRange = True
        
    End With

And finally in block 3 the hard work is done:

    ' transfer: for each Quarter move the BaseRange down and transfer values
    For iColumn = 0 To lNColumnLabels - 1 ' note: zero based
        With rBaseRange.Offset(iColumn * lNSourceRows)
            .Resize(, lNRowLabels).Value = rRowLabels.Value
            .Offset(, lNRowLabels).Value = rColumnLabels.Cells(iColumn + 1).Value
            .Offset(, lNRowLabels + 1).Value = rValues.Offset(, iColumn).Value
        End With
    Next iColumn

The Moving Ranges technique is usually not difficult to setup and is very efficient, especially when you can define large ranges. The direct block transfer of values from range to range performs very well and doesn’t need extra resources, as in the technique below.

The Arrays technique

The basic idea of this technique is well known.

Normalize with Arrays

Normalize with Arrays

Explanation:

  • Step 1: read the Source table into a two dimensional Source array;
  • Step 2: redimension the Target array; in this case we need an array of 9*4=36 rows and 4 columns. Then for each Source row iterate over the four quarters, and transfer the values from source to target array. Every source row results in four target rows;
  • Step 3: clear the target table and dump the target array to a range of the same size below the header of the target table.
Implementation

This process is implemented in procedure Update of the enclosed workbook. This is the final part of this procedure:

    ' convert Source array to Target array
    For iSourceRow = 1 To lNSourceRows
        lTargetBaseRow = (iSourceRow - 1) * lNColumnLabels + 1
        For iColumn = 0 To lNColumnLabels - 1
            aTarget(lTargetBaseRow + iColumn, 1) = aSource(iSourceRow, 1)
            aTarget(lTargetBaseRow + iColumn, 2) = aSource(iSourceRow, 2)
            aTarget(lTargetBaseRow + iColumn, 3) = rColumnLabels.Cells(iColumn + 1).Value
            aTarget(lTargetBaseRow + iColumn, 4) = aSource(iSourceRow, 3 + iColumn)
        Next
    Next

    ' dump array to Target
    With Me.ListObjects(1).HeaderRowRange.Offset(1).Resize(lNTargetRows, lNTargetColumns)
        .Value = aTarget
    End With

This technique is usually easy to setup and performs well. An important benefit is that it is easy:

  • to modify the individual values (change a date format);
  • to enrich the source data (add the location of every department to the target array);
  • to filter the source data (we are only interested in the departments in locations X and Y);
  • to aggregate the source data (sum the values and count the number of departments per location);
  • to validate the source data (if a value is not a date give some feedback to the user).

A drawback could be the use of memory resources, especially in case you have a large source table. Notice that without filtering or aggregating you need about twice the size of the source table, and that all array elements are variants. If memory is an issue you could decide to only read the source table, and write directly into the target table.
Showcase app Decathlon is an example of the Arrays technique which enriches the data (by adding Event type).

Final notes

  • It is possible to develop more general functions from the methods above. However, it’s my experience that every project has it’s own unique characteristics, so personally I prefer to use the code above as a template and modify it as required;
  • In the procedures above we check if target needs more rows than Excel has available. I have never been in a situation that I needed more than a million rows, so to me this check is a bit overdone. Of course, there are situations where potentially you could need more than a million rows, but usually we only need a subset of the source data to create the report we want, so we can filter and/or aggregate the source data;
  • For the above case we could use a userform. Then we wouldn’t need a source table: from the userform we can directly update four rows in the target table and normalize ‘on the fly’. A userform has many advantages. However, it also makes things more complicated, since we need to implement functions like insert, delete, search and filter rows, which are built in in a table on the worksheet. There’s no best choice, it depends on the demands of the project you’re working on;
  • There are a few other techniques to normalize source data. See for example Unpivot Shootout, with a few references to other posts on this subject;
  • Some use the term ‘unpivot’ instead of ‘normalize’. For more than a decade Unpivot is a well known function of database management systems like Oracle Database and SQL Server, but since Microsoft introduced the Unpivot feature in Power Query in 2013 the term gained in popularity in the Excel/BI community. Personally I prefer ‘normalize’, since ‘unpivot’ somehow gives me the idea that it will return the source rows. Which of course it never will.
  • I hope this pixcel helps. Excel!

    Downloads

    NormalizeMovingRanges-v1.xlsm
    NormalizeArrays-v1.xlsm

Comments

  1. AlexJ says:

    Frans,
    Why are we not using PowerQuery to do this? Once set up to use a source range, PQ can deliver the unpivotted result to a result table by a single VBA refresh command on the result table. I have had the same question regarding Unpivotted Shootout.

    • Hi Alex,
      Thanks for your comment.
      Yes, of course, if you have Power Query installed use it. Unfortunately, most users I develop for are still using 2007, or don’t have a 2010 volume license, or are not allowed to install the Power BI tools.
      Or have a look at Decathlon (link above). Would PQ really make things easier? I don’t think so.
      Regards,
      – Frans

Speak Your Mind

*