As part of extending the performance profiling abilities of FastExcel, I wanted to develop a Function Profiler Map. A key component of this is to extract the names of the functions embedded in Excel formulas.
So I experimented with some different approaches:
- using Rob Van Gelder’s AudXL formula parser
- using the MID function to scan through character by character
- using byte arrays
The Test Data
For performance testing I am using a large worksheet with 638K used cells, 103K constants and 413K formulas. There are 3 functions on this sheet (710K INDEX, 5 SUM and 51K IF).
For validity testing I am using a small worksheet with some tricky formulas:
Rob Van Gelder’s AudXL formula parser
Rob converted a javascript-based Excel formula parser written by Eric Bachtal to VBA. You can download AudXL.xla from here. It is a useful tool for breaking formulas apart so that they are more readable.
So I started with that as a performance baseline: it takes 130 seconds to parse the large test worksheet.
But of course a general purpose formula parser should be slower than a purpose-built parser that only finds the functions.
Rules for Finding Functions in Formula strings
- Function names are always followed by open bracket “(“
- Function Names are always preceded by an operator
+-,/*(=><& :!^ - Formula text enclosed in single or double quotes cannot contain a function
Assumptions
- Formulas to be parsed will be obtained using Range.FormulaR1C1.
- This means that the formula string will will use American English settings and native function names.
- And I only have to parse formulas that have been validated by Excel.
- The full range of Unicode characters are allowed, so I can’t just use ASCII codes.
Download the VBA Code
You can download the routines containing the VBA from here
The First attempt: using Byte arrays
In a previous post I discussed using Byte arrays as an efficient method for processing all the characters in a string.
This is a good method for handling Unicode strings – each character gets converted into 2 separate bytes (0 to 255) which uniquely define the character and cover all the possible National Language characters in the world.
So “(” for instance is always 40 and 0 and
Dim abFormula() as Byte
Dim str1 as string
str1="=NA()"
abFormula=str1
produces:
Note that the Byte arrays produced by assigning a string to a byte array are always zero-based.
You can also do the reverse: assigning a byte array to a string converts back.
The First Algorithm
So here is an algorithm that follows the rules:
Scan each character in the byte array from left to right:
If the character is a single or double quote ignore all subsequent characters until there is another single or double quote.
if the character is in the list of operators (abStartChars) then set a flag
if the character is NOT in the list of operators check if its a “(” and
if it IS a “(” and there are one or more non-operator characters preceding it we have found a function name.
The code looks like this:
Function GetFunc1(abFormula() As Byte, abStartChars() As Byte, abEndChar() As Byte, _ abQuotes() As Byte, abSQ() As Byte, jStart As Long, jEnd As Long) As String ' ' search for a function name within a byte array starting at the jStart byte position ' byte array is zero based pairs of bytes ' abFormula is a byte array version of the formula ' abStartChars is a byte array containing characters that can precede the function name ' abEndChar is a byte array containing "(" ' abQuotes is a byte array contianing a double quote ' abSQ is a byte array containing a single quote ' ' returns name of function as a string and jEnd as the byte position of "(" ' Dim j As Long Dim k As Long Dim jStartChar As Long Dim jFirst As Long Dim blStart As Boolean Dim blString As Boolean Dim abFunc() As Byte ' jFirst = jStart + 2 blString = False For j = jStart + 2 To (UBound(abFormula) - 2) Step 2 ' ' skip text strings ' If (abFormula(j) = abQuotes(0) And abFormula(j + 1) = abQuotes(1)) _ Or (abFormula(j) = abSQ(0) And abFormula(j + 1) = abSQ(1)) Then blString = Not blString End If If Not blString Then ' ' look for non startchar ' blStart = False For jStartChar = 0 To UBound(abStartChars) Step 2 If abFormula(j) = abStartChars(jStartChar) _ And abFormula(j + 1) = abStartChars(jStartChar + 1) Then blStart = True jFirst = j + 2 Exit For End If Next jStartChar If Not blStart Then If abFormula(j) = abEndChar(0) And abFormula(j + 1) = abEndChar(1) Then ' ' we have a ( ' If j > jFirst Then ' ' we have a function ' jEnd = j ' ' jend points to first byte of the ( character ' jfirst points to the first byte of the function name ' convert slice of formula to function name string ' ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte For k = 0 To UBound(abFunc) abFunc(k) = abFormula(jFirst + k) Next k GetFunc1 = abFunc Exit Function ' ElseIf abFormula(jFirst) = abEndChar(0) _ And abFormula(jFirst + 1) = abEndChar(1) Then jFirst = jFirst + 2 End If End If End If End If Next j End Function
And the driver routine just loops on all the formulas in the worksheet, calling the function parsing routine and storing/counting the functions found in a dictionary.
Sub testing1() Dim strFormula As String Dim strFunc As String Dim abFormula() As Byte Dim abStartChars() As Byte Dim abEndChar() As Byte Dim abQuotes() As Byte Dim abSQ() As Byte Dim jStart As Long Dim jEnd As Long Dim oFormulas As Range Dim oCell As Range Dim dTime As Double Dim dicFuncs As New Dictionary ' ''' characters that can come before the start of a Function name Const strStartChars1 As String = "+-,/*=><& :!^" & vbLf ' dTime = MicroTimer ' abStartChars = strStartChars1 abEndChar = "(" abQuotes = Chr(34) abSQ = Chr(39) ' dicFuncs.CompareMode = TextCompare ' Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23) ' For Each oCell In oFormulas strFormula = oCell.FormulaR1C1 abFormula = strFormula jStart = 0 jEnd = 0 Do strFunc = UCase(GetFunc1(abFormula, abStartChars, abEndChar, abQuotes, abSQ, jStart, jEnd)) If LenB(strFunc) = 0 Then Exit Do jStart = jEnd ' ' add Function to dictionary and count occurrences ' If dicFuncs.Exists(strFunc) Then dicFuncs(strFunc) = dicFuncs(strFunc) + 1 Else dicFuncs.Add strFunc, 1 End If Loop Next oCell ' MsgBox MicroTimer - dTime End Sub
This version takes 48 seconds: better but with room for improvement.
The (nearly) Final Algorithm
- I don’t need to worry about upper-lower case Function names, since Excel already takes care of that, so I can use the default Binary Compare for the dictionary
- All the special characters for operators etc always have byte 2 =0 so I only need to test that once per character rather on each comparison.
- If I first search left to right for the “(” character and then work backwards looking for an operator character the function does many fewer comparison operations, because there is only 1 end character but many start characters.
The resulting VBA code looks like this:
Function GetFunc3(abFormula() As Byte, abStartChars() As Byte, jStart As Long, jEnd As Long) As String ' ' search for a function name within a byte array bstarting at the jStart byte position ' byte array is zero based pairs of bytes ' abFormula is a byte array version of the formula ' abstartchars is a byte array containing characters that can precede the function name ' ' returns a string name of function and jEnd as the position of "(" ' Dim j As Long Dim k As Long Dim jj As Long Dim jStartChar As Long Dim jFirst As Long Dim blStart As Boolean Dim blDoubleQ As Boolean Dim blSingleQ As Boolean Dim abFunc() As Byte ' jFirst = jStart + 2 blDoubleQ = False For j = jStart + 2 To (UBound(abFormula) - 2) Step 2 ' ' start and end characters always have byte 2 =0 ' If abFormula(j + 1) = 0 Then ' ' skip text strings ' If abFormula(j) = 39 Then blSingleQ = Not blSingleQ If Not blSingleQ Then If abFormula(j) = 34 Then blDoubleQ = Not blDoubleQ If Not blDoubleQ Then ' ' look for ( ' If abFormula(j) = 40 Then ' ' we have a ( ' look backwards for a startchar ' blStart = False For jj = j - 2 To jStart Step -2 For jStartChar = 0 To UBound(abStartChars) Step 2 If abFormula(jj) = abStartChars(jStartChar) Then blStart = True jFirst = jj + 2 Exit For End If Next jStartChar If blStart Then Exit For Next jj If blStart Then If j > jFirst Then ' ' we have a function ' jEnd = j Exit For ElseIf abFormula(jFirst) = 40 Then jFirst = jFirst + 2 End If End If End If End If End If End If Next j If blStart And jEnd > jFirst Then ' ' convert slice of formula to function name string ' jend points to first byte of the ( character ' jfirst points to the first byte of the function name ' ReDim abFunc(0 To (jEnd - jFirst - 1)) As Byte For k = 0 To UBound(abFunc) abFunc(k) = abFormula(jFirst + k) Next k GetFunc3 = abFunc End If End Function
And the corresponding driver routine looks like this:
Sub testing3() Dim strFunc As String Dim abFormula() As Byte Dim abStartChars() As Byte Dim jStart As Long Dim jEnd As Long Dim oFormulas As Range Dim oCell As Range Dim dTime As Double Dim dicFuncs As New Dictionary ' ''' characters that can come before the start of a Function name Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf ' dTime = MicroTimer ' abStartChars = strStartChars2 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23) ' For Each oCell In oFormulas abFormula = oCell.FormulaR1C1 jStart = 0 jEnd = 0 Do strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd) If LenB(strFunc) = 0 Then Exit Do jStart = jEnd If dicFuncs.Exists(strFunc) Then dicFuncs(strFunc) = dicFuncs(strFunc) + 1 Else dicFuncs.Add strFunc, 1 End If Loop Next oCell MsgBox MicroTimer - dTime End Sub
This version takes 11.6 seconds: but maybe I can make it faster?
Using MID$() instead of a Byte array
What happens if I forget about all this Byte array stuff and just implement the same algorithm using MID$() to extract each character from the formula and INSTR to check against the list of start operators (start characters)?
The GetFunc VBA code looks like this:
Function GetFunc4(strFormula As String, strStartChars As String, strEndChar As String, strQuotes As String, strSQ As String, jStart As Long, jEnd As Long) As String ' ' search for a function name within a formula string starting at the jStart character position ' ' strStartChars is a string containing characters that can precede the function name ' ' returns a string name of function and jEnd as the position of "(" ' Dim j As Long Dim k As Long Dim jj As Long Dim jStartChar As Long Dim jFirst As Long Dim blStart As Boolean Dim blDoubleQ As Boolean Dim blSingleQ As Boolean Dim abFunc() As Byte Dim strChar As String Dim iStartChar As Long ' jFirst = jStart + 1 blDoubleQ = False For j = jStart + 1 To (LenB(strFormula) - 1) strChar = Mid$(strFormula, j, 1) ' ' skip text strings ' If strChar = strSQ Then blSingleQ = Not blSingleQ If Not blSingleQ Then If strChar = strQuotes Then blDoubleQ = Not blDoubleQ If Not blDoubleQ Then ' ' look for ( ' If strChar = strEndChar Then ' ' we have a ( ' look backwards for a startchar ' blStart = False For jj = j - 1 To jStart Step -1 strChar = Mid$(strFormula, jj, 1) iStartChar = InStrB(strStartChars, strChar) If iStartChar > 0 Then blStart = True jFirst = jj + 1 Exit For End If Next jj If blStart Then If j > jFirst Then ' ' we have a function ' jEnd = j ' ' convert slice of formula to function name string ' jend points to first byte of the ( character ' jfirst points to the first byte of the function name GetFunc4 = Mid$(strFormula, jFirst, jEnd - jFirst) Exit Function ElseIf Mid$(strFormula, jFirst, 1) = strEndChar Then jFirst = jFirst + 1 End If End If End If End If End If Next j End Function
Using MID$ and INSTR is slower: it takes 21.8 seconds.
Optimising the Driver Routine.
OK so I am reasonably happy with the parsing routine: I have gone from 130 seconds for a generic parsing routine to 11.6 seconds for the specialised Byte array routine.
But the driver routine looks at every single formula on the worksheet, and we know that on most large worksheets a large percentage of the formulas are copied.
So I can use the dictionary approach to find the distinct formulas (use R1C1 rather than A1 mode because copied formulas are identical in R1C1) and just parse those:
Testing3C also makes some more speed improvements by looping on each area and getting the formulas into a variant array rather than looping directly on the cells.
Sub testing3c() Dim strFunc As String Dim abFormula() As Byte Dim abStartChars() As Byte Dim jStart As Long Dim jEnd As Long Dim oFormulas As Range Dim oCell As Range Dim oArea As Range Dim vArr As Variant Dim vF As Variant Dim dTime As Double Dim dtotTime As Double Dim dicFuncs As New Dictionary Dim dicFormulas As New Dictionary Dim j As Long Dim k As Long ' Const strStartChars2 As String = "+-,/*(=><& :!^" & vbLf ''' characters that can come before the start of a Function name ' dTime = MicroTimer ' abStartChars = strStartChars2 Set oFormulas = ActiveSheet.UsedRange.SpecialCells(xlCellTypeFormulas, 23) ' ' find distinct formulas in areas and count occurrences ' For Each oArea In oFormulas.Areas ' ' get a block of formulas ' vArr = oArea.FormulaR1C1 If IsArray(vArr) Then For k = 1 To UBound(vArr, 2) For j = 1 To UBound(vArr) If dicFormulas.Exists(vArr(j, k)) Then dicFormulas(vArr(j, k)) = dicFormulas(vArr(j, k)) + 1 Else dicFormulas.Add vArr(j, k), 1 End If Next j Next k Else If dicFormulas.Exists(vArr) Then dicFormulas(vArr) = dicFormulas(vArr) + 1 Else dicFormulas.Add vArr, 1 End If End If Next oArea ' ' parse only the distinct formulas ' For Each vF In dicFormulas abFormula = vF jStart = 0 jEnd = 0 Do strFunc = GetFunc3(abFormula, abStartChars, jStart, jEnd) If LenB(strFunc) = 0 Then Exit Do jStart = jEnd If dicFuncs.Exists(strFunc) Then dicFuncs(strFunc) = dicFuncs(strFunc) + dicFormulas(vF) Else dicFuncs.Add strFunc, dicFormulas(vF) End If Loop Next vF MsgBox MicroTimer - dTime End Sub
Now this takes only 3.5 seconds: adding a formula string to a dictionary is much faster than parsing it!
Conclusion
- Byte Arrays can be significantly faster for character by character operations rather than using MID and INSTR.
- Looking at how the first attempt at an algorithm works can give you clues about how to improve it.
- Even using Byte arrays string parsing operations are slow in VBA.
- Using a dictionary its really fast to find the distinct formulas even on a large worksheet.
Challenge
OK guys: who can write a faster VBA function parsing routine?
Download the routines from here
