Quantcast
Channel: Excel and UDF Performance Stuff
Viewing all 94 articles
Browse latest View live

Writing Efficient UDFs Part 12: Getting Used Range Fast using Application Events and a Cache

$
0
0

In the previous post I suggested that one good way to speed up retrieval of the Used Range last row would be to use a Cache and the AfterCalculate Application event.

I have now tested this approach and it works well: here is the code for the demo function GetUsedRows3:
Option Explicit
'
' create module level array for cache
'
Dim UsedRows(1 To 1000, 1 To 2) As Variant
Public Function GetUsedRows3(theRng As Range)
' store & retrieve used range rows if Excel 2007 & later
Dim strBookSheet As String
Dim j As Long
Dim nFilled As Long
Dim nRows As Long
' create label for this workbook & sheet
strBookSheet = Application.Caller.Parent.Parent.Name & "_" & Application.Caller.Parent.Name
If Val(Application.Version) >= 12 Then
' look in cache
For j = LBound(UsedRows) To UBound(UsedRows)
If Len(UsedRows(j, 1)) > 0 Then
nFilled = nFilled + 1
If UsedRows(j, 1) = strBookSheet Then
' found
GetUsedRows3 = UsedRows(j, 2)
Exit Function
End If
Else
' exit loop at first empty row
Exit For
End If
Next j
End If
' find used rows
nRows = theRng.Parent.UsedRange.Rows.Count
'
If Val(Application.Version) >= 12 Then
' store in cache
nFilled = nFilled + 1
If nFilled <= UBound(UsedRows) Then
UsedRows(nFilled, 1) = strBookSheet
UsedRows(nFilled, 2) = nRows
End If
End If
'
GetUsedRows3 = nRows
End Function
Sub ClearCache()
'
' empty the first row of the used-range cache
'
UsedRows(1, 1) = ""
End Sub

Note: there is no error handling in this code!

Start by defining a module level array (UsedRows) with 1000 rows and 2 columns. Each row will hold a key in column 1 (book name and sheet name) and the number of rows in the used range for that sheet in that book in column 2. I have assumed that we will only cache the first 1000 worksheets containing these UDFs!
The key or label is created by concatenating the name of the parent of the calling cell (which is the worksheet) to the name of the parent of the parent of the calling cell (which is the workbook containing the sheet).
Then loop down the UsedRows array looking for the key, but exit the loop at the first empty row.

If the key is found, retrieve the number of rows in the used range from column 2, return it as the result of the function and exit the function.

Otherwise find the number of rows in the used range,  store it in the next row of the UsedRange cache and return it as the result of the function.

Only for Excel 2007 or later

You can see that the function only operates the cache for Excel 2007 and later versions. There are two reasons for this:

  • Excel 2003 and earlier have a maximum of 64K rows so finding the used range is relatively fast anyway.
  • Only Excel 2007 and later have the AfterCalculate event which will be used to empty the cache after each calculate.

We need to empty the cache after each calculate because the user might alter the used range and so the safe thing to do is to recreate the cache at each calculation.
AfterCalculate is an Application Level event which is triggered after completion of a a calculation and associated queries and refreshes. (A BeforeCalculate event would be even more useful but does not exist!)

Using the AfterCalculate Application Event.

Chip Pearson has an excellent page on Application Events. I always consult it when I need application events because I can never remember exactly how to do it!

First I added a Class Module called AppEvents with code like this:
Option Explicit
Private WithEvents App As Application
Private Sub Class_Initialize()
Set App = Application
End Sub
Private Sub App_AfterCalculate()
ClearCache
End Sub

Then I added some code to the ThisWorkbook module:

Option Explicit
Private XLAppEvents As AppEvents
Private Sub Workbook_Open()
Set XLAppEvents = New AppEvents
End Sub

This sets up the hooks that are needed for Application level events. Quite a lot of code just to run the ClearCache sub after each calculation!
ClearCache just empties the first key in the Cache so that the find loop in GetUsedRows3 exits straight away.

This code is ignored in Excel 2003 and earlier: since the AfterCalculate event does not exist it never gets called but still compiles OK.

Performance of GetUsedRows3

For 640K rows of data 1000 calls to GetUsedRows3 takes 66 milliseconds. The original CountUsedRows function took 33 seconds.
Thats a speedup factor of 500!



Pinot Noir Tasting: Old World 95 to 2004 versus New World 2006 to 2008

$
0
0

Some of you will know that I am a bit of a fanatic about Pinot Noir. Once you get hooked by the (often expensive) wine made from this grape everything else seems slightly second-best.
Once or twice a year a group of us get together for a wine-tasting evening. The last event was Cote-du-Rhone and other Grenach-Syrah-Mourvedre wines from around the world, but last Saturday was an event we had been talking about for a few years: the up-market Pinot Noir evening.

We selected 8 Pinot Noirs, 4 from France, 3 from New Zealand and 1 from America:

The oldest 4 wines are French, and these wines are made for lengthy cellaring with the aim of developing the complexity that only time (usually 10 years+) can bring. I included the 2004 Echezeaux even though it has not yet reached its recommended drinking window just to make the age comparison more interesting.
The youngest 4 wines are all New World, and are made with a more fruit-driven approach for earlier drinking (but these up-market New World Pinots will also continue to improve for a long time). The Bald Hills and Cornish Point both come from the Bannockburn district of New Zealand’s Central Otago in South Island, whereas the Schubert comes from Martinborough in North Island. The Au Bon Climat comes from California.

We usually taste the wines 4 at a time in 4 separate wine-glasses to make it easier to compare side-by-side, and this time we decided to go in strict chronological order.
The bottles were opened at about 6PM and the tasting started at 8.30.

We use a simple scoring system:

Tick one word,       score points for pleasure      Circle Descriptions

Name of wine:

 

SIGHT                  Score (max 4)

CLARITY:                            cloudy, bitty, dull, clear, brilliant

DEPTH of COLOUR:       watery, pale, medium, deep, dark

COLOUR:                        purple, purple/red, red, red/brown

VISCOSITY:        slight sparkle, watery, normal, heavy, oily

Starbright, tuile,straw, amber,tawny

ruby, garnet,

oeil de perdrix, hazy,

opaque

SMELL                 Score (max 4)

GENERAL APPEAL: neutral, clean,attractive,outstanding

Off (yeasty, acetic, oxidized, woody, …)

FRUIT AROMA:                none, slight, positive, identifiable

BOUQUET :                    none, pleasant, complex, powerful                    

Cedarwood, corky, woody, dumb, flowery, smoky, honeyed, lemony, spicy, mouldy, peardrops, sulphury

TASTE                  Score (max 9)

TANNIN:                                       astringent, hard, dry, soft

ACIDITY:                                  flat, refreshing, marked, tart

BODY:    very light & thin, light, medium, full bodied, heavy

LENGTH:                  short, acceptable, extended, lingering

BALANCE:      unbalanced, good, v well balanced, perfect

Appley, bitter, burning, blackcurrants, caramel, dumb, earthy, fat, flinty, green, heady, inky, flabby, mellow, metallic, mouldy, nutty, salty, sappy, silky, spicy, fleshy, woody, watery

OVERALL QUALITY   Score (max 3)

Coarse, poor, acceptable, fine, outstanding

Supple, finesse, breed, elegance, harmonious, rich, delicate

Total Score     (total out of 20)

In the first flight of 4 we all found the 1995 Volnay disappointing, although its colour was still a healthy red. The 1996 Chassagne Montrachet was well liked by everyone and was the winner of this group. The 2004 Echezeaux showed more fruit than the others and was excellent, but undoubtedly will be better drunk in a few years time.

The second flight of 4 were all well-liked. The 2008 Cornish Point was the winner in this group.

The 8 of us in the wine-tasting group have slightly different tastes: just over half tend to prefer the Old World style and the rest tend to prefer New World. But the scoring was reasonably consistent (well OK it did get slightly ragged towards the end of the evening …).

The overal winner by a narrow margin (declared sometime around midnight) was Felton Road’s Cornish Point.
The first time I drank this wine (maybe 6 years ago?) I was amazed at it’s depth of flavour and length of taste, and its still one of my all-time favourites.

Cornish_Point


The SpeedTools FILTERIFS function: Design and Implementation Part 1

$
0
0

Excel users have been using SUMPRODUCT and array formulas to create multiple-condition formulas for many years. This is a powerful technique, but can be painfully slow with large amounts of data. Pivot Tables and Excel 2013′s PowerPivot can provide good solutions in some instances, and the introduction of SUMIFS in Excel 2007 gave a fast alternative for some scenarios.

But there is still a need for a powerful, dynamic function that can perform better than SUMPRODUCT/ array formulas, so let me introduce my attempt at creating one : FILTERIFS.

FILTERIFS Design Objectives

  • Speed of calculation – multi-threaded, exploit sorted data and clustered data.
  • Extended criterion types to include AND/OR, Lists, Wild Card Patterns, Regular Expressions, Calculated Columns and Arrays etc.
  • Dynamic calculation in the same way as other Excel functions.
  • Extend multiple condtions to many more functions by outputing an array to other functions such as SUM, MEDIAN, LISTDISTINCTS, VSORT etc, or directly as a multi-cell array formula.

The original implementation was done using a VB6 automation addin, but lack of multi-threading and 64-bit support in VB6 lead me to re-implement as a C++ XLL.

So how do you make it fast?

The idea is to process each criterion in turn using only the rows that meet all the criteria processed so far, thus avoiding the SUMPRODUCT/array formula approach of evaluating all the criteria for all the rows.
Criteria operating on sorted columns are processed first using a fast High-Low binary search modified for relational operators.
Non-sorted columns and criterion types like Regex are then processed using linear search in a sequence designed to minimise data transfer/coercion time.
And using a C++ XLL allows multi-threading and fast execution.

FILTERIFS Syntax

The syntax uses a similar approach to SUMIFS to pass the criterion as a string concatenation of a relational operator and a value. Because the value is passed as a string FILTERIFS has to do some datatype conversions of the value to match the datatype of the criterion column (and hopefully avoid some of the SUMIFS bugs in this area).

FILTERIFS( nSortedCols, InputRange, ReturnCol, CriteriaColumn1, Criteria1,
CriteriaColumn2, Criteria2, … , ["#OR#", nsortedCols,] CriteriaColumnx, Criteriax, …)

nSortedCols gives the number of columns which are sorted in the InputRange

InputRange is a range reference to the data containing the sorted columns and return column.
The data can contain a header row of names for the columns.

ReturnCol is the header name or number of the column within InputRange to return results from.

Criteria Column gives either the name/number of a column in InputRange, or a range reference to an independent column, or an array or an expression returning a column of data to be used as the criterion column.

Criteria is the expression used to filter the criterion column.
This can be a relational operator (=, >=, <=, >, <, ¬=, ~, ¬~, ~~) ( ¬ means NOT, ~ means LIKE, and ~~ means Regex) and value.
It can also be a LIST of alternatives to look for, given either as an array ({“FL”,”NY”,TX”} or with a relational operator {“~ABC*”,”~DEF*}) or as a range reference.

#OR# allows you to have multiple alternative sets of criteria.

FILTERIFS Components

To deliver this fairly complex set of capabilities the function is broken down into a number of component blocks. These are the major ones:

  • Handle any header row column names & translate column names and numbers to column indexes.
  • Parse and analyse the criterias, storing the result in an array of Criterion structures
  • Data Type detection and type-casting of the criteria values
  • Find the optimum sequence to process the criteria
  • Row-Pairs class to store first-row last-row pairs for the rows that meet the criteria. Methods for this class include Append, Merge, Condense, CountPairs, CountRows etc.
  • High-Low binary search for the sorted criteria
  • Translate High-Low to rowpairs using the relational operators
  • Determine optimum data-transfer/coercion strategy and sequencing for the non-sorted criteria
  • Linear Search on row-pairs for non-sorted criteria
  • Comparison functions for the various Criterion operators.
  • Conversion of row-pairs to results

These components currently result in just under 5000 lines of code.

FILTERIFS Status

As at December 2012 the function is coded and the first phase of testing has been completed. It has taken considerably longer than planned, mainly because the VB6 version made extensive use of EVALUATE, which turned out not to be allowed to multi-thread in C++ and so I had to redesign most of the approach for non-sorted columns.

There is still some performance testing, refactoring and rework to be done but the target is to start Beta3 in early January 2013.

(Assuming that the Mayan calendar is wrong in predicting the end of the world today Friday 21 Decembery 2012).


The SpeedTools FILTER.IFS Function Design Part 2: Excel Data Types – When is a Number a String?

$
0
0

Excel Data Types

Excel has only  4 or 5 native data types:

  • Numbers (which can be formatted as Dates, Times, Currency, Integers, Doubles etc, but are all held internally as floating point doubles)
  • Strings (Text including zero length strings like “”)
  • Booleans (True or False)
  • Errors (#N/A, #DIVO etc)
  • Empty (which annoyingly is only partly supported -  for instance you can’t return it from a function or a formula)

You can format all these data types in lots of different ways so that they look different, but a Cell’s underlying value is always going to be one of these types.

And unlike most Database systems Excel allows the cells in a column to contain multiple data types.
This can lead to problems: the most frequent one being a column of numbers some of which have been entered as text strings and some as real numbers. Usually you can visually see them because the numbers that are text are left-aligned in the cell and the real numbers are right-aligned.

Numbers as text can arrive in Excel in various ways:

  • Start by entering a ‘ followed by the number
  • Format the cell as text
  • Data imported from external sources

Sorting Columns containing multiple data types

When Excel sorts data contining different data types it uses this relationship between types:

Numbers<Strings<Booleans<Errors

Empty cells are always sorted last, both in Ascending and Descending sorts!
When you sort data containing numbers stored as text strings Excel asks you if you want to sort Text numbers as text or as numbers. Usually its better to sort text numbers as text rather than risk confusing any subsequent operation that relies on things being properly sorted.

Comparing Data Types

If you use a simple formula (=A6<A5)  to compare data types you get this:

DataTypes1

You can see that XYZ is >= the empty cell above it, but ABCD is <XYZ.
Numeric 1234 in A9 is less than text string 1234 in A8
A12 entered as a ‘ is a zero-length string and is > the number in A11.
The errors in A16:A18 propagate so you can’t see how Excel compares them.

So if you get rid of the error cells, sort the data and change the formula so its looking for A6>A5 you get this:

DataTypes2

So the formula comparison precedence rules are the same as the Excel sorting rules, except for empty cells!

FILTER.IFS Data Type Comparison Operators

Its useful to be able to filter by data type (although the standard Excel Filter command does not have this option), so I added some type filtering operators:

  • #ERR – filters all the error cells
  • #TXT – filters all the string/text cells
  • #N – filters all the number cells
  • #BOOL – filters all the True/False cells
  • #EMPTY – filters all the empty cells
  • #ZLS – filters all the cells containing a zero-length string
  • #TYPE – filters all the cells that have the same data type as the first cell in the filtered range
  • #BLANK – filters all the cells that contain one or more blanks or spaces

You can prefix these operators with ¬ to filter out everything that does NOT match the data type.

And you can have a list of multiple filtering operators: {“¬”,”#EMPTY”,”#ZLS”,”#BLANK”,”#ERR”} would exclude empty cells, cells with zero length strings or blanks, and cells with errors.

Here is an example:

DataTypes4

The FILTER.IFS formula is =FILTER.IFS(0,$A$6:$B$17,1,D$5) entered as an array formula (Control-Shift-Enter) and copied across.

  • The 0 says there are no sorted Criterion columns (because the type filters don’t care if the data is sorted or not).
  • $A$6:$B$17 gives the range to be filtered
  • 2 gives the column within the data range to be returned as the answer
  • 1 gives the column within the data range to be filtered using the criterion
  • D$5 gives the cell containing the criterion itself

Handling Data Types with the Relational Operators <,<=,>,>=,¬=

Suppose you create a FILTER.IFS formula  like this: =FILTER.IFS(1,$A$6:$B$17,2,1,”<1235″)

The criterion says less than 1235, but which 1235 – the numeric one or the string one or both?

I don’t think there is neccessarily a “correct” answer to this, so I invented a rule!

If the Criterion value can be converted into more than one data type (in this case a string and a number) choose the same data type as the first cell in the column.

In this case the first cell is a number, so FILTER.IFS chose to use numeric 1235, which results in a single result, the 1 from row 6.

Because the data is sorted the binary search routine has to use a single datatype, so looking for both the string 1235 and the numeric 1235 is not an option.

But if the data is NOT sorted a linear search can find both: so you can tell FILTER.IFS to compare using ALL the avialable datatypes by using an & prefix.

=FILTER.IFS(0,$A$6:$B$17,2,1,”&<1235″)

you get ALL the data which is less than numeric 1235 AND all the data that is less than string 1235.

Of course if you don’t use any of the Criterion operators and it finds only the matching data type (=FILTER.IFS(1,$A$6:$B$17,2,1,”1235″) or =FILTER.IFS(1,$A$6:$B$17,2,1,1235)

Conclusion

Using mixed data types with relational operators can be tricky – sometimes its difficult to work out what Excel is doing.
A drawback of following the same kind of syntax as SUMIFS (a string containing both the relational operator and the value) is that there is no clear datatype choice.

But I was not sure that the previous FILTER.IFS design, which could give different results for sorted and unsorted data, made sense, so I changed it so that sorted and unsorted data gave the same results and added the & prefix :
What do you think?


The SpeedTools FILTER.IFS Function Design Part 3: Excel Data Types – Strange COUNTIF behaviour

$
0
0

The previous post discussed how Excel’s data types, and how FILTER.IFS was designed to handle them.

Colin Legg suggested that a good starting point for the design choices could be what COUNTIF/SUMIF do. So here is an example of some of the problems with COUNTIF, and what the equivalent SpeedTools function ACOUNTIFS does. (ACOUNTIFS uses the same filtering engine as FILTER.IFS).

Using COUNTIF with Number Strings

Suppose you have a list of zero-prefixed numbers, headed DATA, and you want to count how many of each of the numbers there are:
Each zero-prefixed number is unique apart from 0012345 which appears twice in the second and third row (54 and 55).
So I created a COUNTIF formula to count the number of occurrences in the list for each number, using each of the different criteria operators.

If COUNTIF works correctly in this situation the answer should be {1;2;2;1;1}, but as you can see below it gets it wrong!

DataTypes6

Each row in the table tries to count how many of the corresponding cell can be found using the relational operator.

So COUNTIF always give 5 when using =, so I think it must be converting ALL the text in both the data and the criteria to numbers.
And < and > always give zero because COUNTIF thinks all the data cells contain the same thing (a number  12345).

But COUNTIF($A$50:$A$54,”<>” & $A50) also gives 5 ! This looks like a BUG to me.

Here is what the SpeedTools ACOUNTIFS function gives:

DataTypes7

ACOUNTIFS treats the text numbers as text numbers and so gives what looks to me like a more “correct” answer for all the relational operators.

Conclusion

Using COUNTIF/SUMIF/COUNTIFS/SUMIFS with mixed data types looks very unwise to me!

But maybe you can figure out a way to make them work sensibly?


Volatile Dependencies, Indirect Dependencies, False Dependencies – When Dependencies Don’t Work the Way You Think They Should

$
0
0

Its always a convenient shorthand to say that UDFs and formulas are recalculated when one of their arguments (or a precedent further upstream in the calculation chain) changes.

But in fact that turns out to be a bit of an oversimplification of how Excel works.

The Test Setup

I have 2 UDFs in a standard VBA module:

Depends1

I have used F9 in the VBE to switch to debug mode whenever either of these UDFs execute.
The first UDF (Depends) has 2 arguments (Arg1 and Arg2), but only the first of them (Arg1) is actually used by the UDF. The second UDF (Depends2) uses both the arguments.

The Excel sheet has 2 sets of data for Arg1 and Arg2 and then calls both the UDFs. Calculation is set to Automatic.

Depends2The result of Depends is 6, and of Depends2 is 24.

When you press F9 nothing happens because nothing has changed to cause a recalculation.

Changing Upstream Precedents

  • When you change cell A2 from 1 to 2 the Depends2 UDF calculates first and then Depends calculates second (assuming you entered the Depends formula in D3 before the Depends2 formula in D6 – Excel calculates formulas last entered first calculated unless this sequence gets changed by dependencies or other factors).
    The values change from 24 to 25 and from 6 to 7.
  • If you change cell A2 from 2 to 2 nothing happens – Excel recognises that nothing has changed.
  • When you change B2 from 5 to 50 both UDFs recalculate, even though Depends does not need to since its result is not dependent on B2.

I call the Arg2 dependency in the first UDF (Depends) a False Dependency since its not actually needed.

Volatile Dependencies

Things work differently if you make one of the dependencies volatile. Lets change cell B2 to =RAND()*100

As expected both UDFs recalculate.

Now press F9 to recalculate again without changing anything else.

Depends2 recalculates, but Depends does NOT recalculate even though a value in Arg2 of the Depends UDF has changed.

In other words if the False Dependency is Volatile it is ignored in a recalculation.
This also happens with built-in Excel functions like INDEX().
If A1 contains =NOW(), and A2:A5 contain the numbers 2 to 5 then

  • =INDEX(A1:A5,1,1) is directly dependent on volatile cell A1 and will always be recalculated.
  • =INDEX(A1:A5,3,1) is only indirectly dependent on volatile cell A1 and will NOT always be recalculated, but it will be recalculated once if for example cell A5 is changed even though the answer will not change

I call Volatile False Dependencies Indirect Dependencies.

EVALUATE and Volatile Dependencies

Stephen Gersuk discovered what looks like another bug with the EVALUATE method and volatile dependencies.

If you have a UDF like this:

Function MySum2(r As Range) As Double
MySum2 = Evaluate("sum(" & r.Address(External:=True) & ")")
End Function


then it does not get recalculated when it has a volatile precedent and you press F9.
So this case gives you the wrong answer, because its not a true False Volatile Dependency: the result really does depend on the argument.

You can bypass this bug by adding anything that references the Value of a cell in the argument:

Function MySum2(r As Range) As Double
If IsEmpty(r) Then Exit Function
MySum2 = Evaluate("sum(" & r.Address(External:=True) & ")")
End Function

But just referenceing properties of the range object is not sufficient:

Function MySum3(r As Range) As Double
Dim strAdd As String
strAdd = r.Address(External:=True)
MySum3 = Evaluate("sum(" & strAdd & ")")
End Function

MySum3 has the same problem.

False Dependencies and Calculation Sequence

It has been suggested that you can use False Dependencies to control the sequence in which Excel calculates formulas.

This is a dangerous idea because false dependencies on uncalculated cells cannot be recognised by Excel since it does not get a chance to discover that they are uncalculated.

Conclusion

  • Yet more reasons to avoid Volatile Functions!
  • Another EVALUATE bug!

Do you have any bad experiences with volatile functions?

 


SpeedTools Beta 3 – Win a SpeedTools Coffee Mug and SpeedTools License

$
0
0

FastExcel SpeedTools Beta 3

FastExcel SpeedTools Beta 3 is a state-of-the-art set of tools to help you speed up calculation of slow Excel workbooks.

Download the 30-day trial of FastExcel SpeedTools Beta 3

Download the SpeedToolsHelp file.
(you may need to unblock the downloaded .CHM file – Right-Click->Properties->Unblock)

Win one of 20 exclusive FastExcel SpeedTools coffee mugs plus a free SpeedTools License for the best 20 Beta Test reports submitted before the end of March 2013

The mug will look something like this!

Send your Beta Test reports to Charles@DecisionModels.com

Good beta test reports include the following:

  • Windows Version used
  • Excel Version used
  • SpeedTools functions tested
  • Bugs found (include enough information to enable duplication)
  • Positive and negative comments (likes & dislikes)
  • Documentation & Help file problems

You can also post in the SpeedTools Beta Test Google Groups Forum

Supercharge Excel’s calculation engine with SpeedTools

With FastExcel V3 SpeedTools you can calculate what you need, when you need, faster:

  • 90 Powerful faster-calculating functions to unblock your calculation bottlenecks.
  • New Calculation methods and modes give you greater control of calculation.
  • FastExcel V3 high-resolution timers so that you can accurately compare and contrast the calculation performance of your formulae, UDFs, worksheets and workbooks.

Choose just the speed up components you need with 4 separate SpeedTools products;

  • Or from the Toolbar in Excel 2003 and earlier:

Supports Excel 2013 and previous versions

  • Fast Multi-threaded calculation with Excel Versions 2013 64 and 32 bit, 2010 64 and 32 bit, 2007
  • Excel 2003, 2002, 2000 also supported
  • Fast calculating functions written in C++ using the XLL interface
  • Windows 8, Windows 7, Vista and XP

Want to know more?

Download the SpeedToolsHelp file
(you may need to unblock the downloaded .CHM file – Right-Click->Properties->Unblock)


SpeedTools AVLOOKUP2 & MEMLOOKUP versus VLOOKUP – Performance, Power and Ease-of-Use Shootout Part 1

$
0
0

Its time for some peformance tests to see how the new functions in SpeedTools stack up against the standard Excel functions. First up is MEMLOOKUP and AVLOOKUP2 vs VLOOKUP!

SpeedTools Lookups are easier to use, more powerful and less error prone than VLOOKUP or INDEX/MATCH

Having the right Default Parameters helps Ease-of-Use

Most people want their LOOKUPs to tell them when the thing they are looking for does not exist in the lookup table. And most of the time people are working with unsorted data.
Unfortunately VLOOKUP’s default settings don’t do that: it defaults to trying to give you an approximate match on sorted data.

So if you use the VLOOKUP defaults you will probably get the wrong answer!
MEMLOOKUP always does an exact match, even with sorted data (but it will still use fast binary search if you have sorted data and tell MEMLOOKUP about it).
AVLOOKUP2 also defaults to unsorted data and exact match, with an option for approximate match on sorted data if you are sure thats what you want.

Here is an example of VLOOKUP getting the wrong answers when using its defaults:

Vlookup1

And here is the same example showing MEMLOOKUP getting the correct results.

Memlookup1

Simplify Lookups with built-in error handling, header labels and more!

Use both Exact Match and Approximate Match with Sorted Data

AVLOOKUP2 has separate parameters for sorted data and exact match, and can use the superfast binary search algorithm on sorted data for both exact match and approximate match.

Header Labels

Both MEMLOOKUP and AVLOOKUP2 allow you to use column labels from a header row instead of column numbers. This is easier to use and understand, and also makes your LOOKUP formulas more resistant to changes such as rearrangement of the data or extra columns appearing. And you can also use this to do 2-dimensional lookup.

Built-in Error Handling

AVLOOKUP2 allows you to specify what you want returned if no exact match can be found, avoiding the need for wrapping the LOOKUP inside an IFERROR function.

The lookup column does not have to be first

You can tell AVLOOKUP2 which column to use for the lookup using a column label or a column number.

Use multiple lookup columns without requiring slow, complicated concatenation or array formulas

AVLOOKUP2 makes it simple to use multiple lookup columns (you can use a constant array {“Jess”,”Ben”} or a range of cells).

Find the first, last, Nth or all Lookup matches

AVLOOKUPNTH extends AVLOOKUP2 with an extra parameter so that you can find the first, last or Nth match when you have duplicates, for text, numbers and dates etc.
AVLOOKUPS2 returns ALL the records that match the lookup criteria. You can use AVLOOKUPS2 either as a multi-cell array formula or embedded inside an aggregating function like MAX, SUM, MEDIAN etc.

Also MATCH, Case-Sensitive and Regular Expression Lookups

The SpeedTools Lookup family also includes variations for MATCH as opposed to LOOKUP, Case-Sensitive lookups and lookups using Regular Expressions.

Here are some examples of using AVLOOKUP2 to do things that are complicated, inefficient or difficult to do with VLOOKUP:

AVLOOKUP1

Try it out yourself!

You can download a free 30-day trial of SpeedTools from the Decision Models website.

And you can also download a workbook VLOOKUP1.xlsx in MemLookup2.zip that contains all the examples used above.



SpeedTools AVLOOKUP2 & MEMLOOKUP versus VLOOKUP – Performance, Power and Ease-of-Use Shootout Part 2

$
0
0

In part 1 I looked at how FastExcel SpeedTools MEMLOOKUP and AVLOOKUP2 compared to VLOOKUP and INDEX/MATCH for ease of use and power.
This post will benchmark the performance of the SpeedTools lookups against the standard Excel functions.

Download the Test Workbooks to your system

You can download a free 30-day trial of SpeedTools from the Decision Models website.

The test workbooks are VLOOKUP2.xlsx and MEMLOOKUP2.xlsx in the downloadable file MemLookup2.zip .

The LOOKUP Dependency Problem

A problem with all Excel LOOKUP formulas is that if even if only one of the values in the Lookup Table changes every single LOOKUP formula that refers to the lookup table gets recalculated, although most of them will returns a completely unchanged answer. When you have large amounts of data (tens or hundreds of thousands of rows) this can be very slow.

Exact Match with Sorted Data

SpeedTools MEMLOOKUP and AVLOOKUP2 both use a variation of the superfast binary search algorithm that can do exact match searches on sorted data. You can make Excel’s VLOOKUP do a similar thing by using two VLOOKUPS and an IF (see Why 2 VLOOKUPS are better than 1 VLOOKUP).

If you sort the data in the test workbook and use the sorted data option it takes about 0.14 seconds to do 20000 MEMLOOKUPs on 70000 rows on my system. This compares with about 4.25 seconds to do the same thing with VLOOKUP using the VLOOKUP exact match option. (The 2 VLOOKUPS trick is faster than MEMLOOKUP but more complicated!).

Exact Match with Unsorted Data

But if your data is not sorted you are stuck with doing a slow linear search from the start until a match is found. The VLOOKUP2.xlsx file has 20000 VLOOKUPs on a lookup table with 70000 rows.
This calculates in 4.25 seconds on my desktop system (Intel i7 quad core 870 2.93GHz with 4 GB RAM and using Excel 2013 32-bit and Windows 7). This actually quite fast if you consider that Excel has to make about 1100 million comparisons (so thats 258 MXIPS – Million eXcel Instructions Per Second).

But if you do exactly the same thing (see test workbook MEMLOOKUP2.xlsx) using SpeedTools MEMLOOKUP  it only takes 0.12 seconds! Thats about 35 times faster.

MemLookup2

So how does it work?

Multi-threaded XLL

The MEMLOOKUP and AVLOOKUP family of functions are implemented using a multi-threaded C++ XLL. This is the fastest available technology for extending Excel’s function library, and allows the functions to support all the Excel versions from Excel 2013 64-bit to Excel 2000.

Using Lookup Memory with MEMLOOKUP and AVLOOKUP2

MEMLOOKUP and AVLOOKUP2 store in memory the index of the lookup result for each row.
So suppose for the MEMLOOKUP on row 3 the result was found in the 47th row of the lookup table. Then MEMLOOKUP stores in memory 47 for row 3.
At the next recalculation of that formula MEMLOOKUP first looks in the memory, finds 47 and checks if the lookup column row 47 still gives the correct result.
If it does then MEMLOOKUP returns the result from the answer column of row 47 in the lookup table.
If row 47 no longer gives the correct result (because the data in the lookup column on that row in the lookup table has changed) then MEMLOOKUP does a full lookup.

This is a fail-safe and very efficient process.

Built-in Optimisation

If (as often happens) you have more than one lookup on the same row returning data from different columns then the lookup memory can be reused for the subsequent lookups. This built-in optimisation is similar to creating an extra MATCH column with several MATCH formulas referring to the MATCH, but is much simpler and more automatic.

Memory is stored with the workbook.

The lookup memory is automatically stored and retrieved with the workbook so that when you reopen a workbook your MEMLOOKUP and AVLOOKUP2 formulas will reuse the lookup memory from the previous calculation.

Memory Type Options

SpeedTools has options for 4 different kinds of lookup memory:

  • Book-Sheet-Row memory (default option): This option stores the index separately for each Workbook, Worksheet and row. This works well unless you are using lookups on multiple tables within the same row on a worksheet.
  • Named Memory: this option stores the index separately for each combination of Name, Workbook and row. Usually you would use the same name for the memory as the lookup table. This allows for optimising the re-use of the lookup memory across all the worksheets in a workbook for each lookup table, and for multiple lookups on different tables within a single formula.
  • Global memory for rows or columns: This option stores the index globally for each row or column so that it can be re-used acroos all open workbooks and worksheets. This is the most efficient option for a single lookup table.
  • Book-Sheet-Cell memory: this option provides the most tightly scoped memory.

Summary

The SpeedTools MEMLOOKUP and AVLOOKUP family of Lookup functions provide significant performance advantages compared to the standard Excel lookup functions, together with enhanced ease of use and extended capability.

Please try them out and let me know what you think.


SpeedTools now Live – but more Feedback needed!

$
0
0

SpeedTools Beta 3 has completed and SpeedTools is now live. You can download the 30-day trial from here, or purchase a license from here.

If you have not already submitted your feedback on SpeedTools its not too late to win a SpeedTools Coffee mug and free license! 

Send your feedback to Charles@DecisionModels.com

SpeedTools Mug


Writing Efficient UDFs Part 11 – Full-Column References in UDFs: Used Range is Slow

$
0
0

Excel users often find it convenient to use full-column references in formulas to avoid having to adjust the formulas every time new data is added. So when you write a User Defined Function (UDF) you can expect that sooner or later someone will try to use it with a full-column reference:

=MyUDF(A:A,42)

When Excel 2007 introduced the “Big Grid” with just over 1 million rows it became even more important to handle these full-column references efficiently. The standard way to handle this in a VBA UDF is to get the INTERSECT of the full-column reference and the used-range so that the UDF only has to process the part of the full-column that has actually been used. The example VBA code below does this intersection and then returns the smaller of the number of rows in the input range and the number of rows in the used range.

Public Function GetUsedRows(theRng As Range)
Dim oRng As Range
Set oRng = Intersect(theRng, theRng.Parent.UsedRange)
GetUsedRows = oRng.Rows.Count
End Function

The parent of theRng is the worksheet that contains it, so theRng.Parent.UsedRange gets the used range of the worksheet you want.

Two problems with this technique are:

  • Getting the Used Range can be slow.
  • The XLL interface does not have a direct way to access the Used Range, so you have to get it via a single-thread-locked COM call. (More on this later).

So just how slow is it to get the used Range?

I created a very simple UDF and timed the calculation of 1000 calls to this UDF for filled used ranges of between 10K rows and 640K rows.

Public Function CountUsedRows()
CountUsedRows = ActiveSheet.UsedRange.Rows.Count
End Function

It turns out that the time taken to execute this UDF is a linear function of the number of used rows in the used range.

Used_Range_Times

And its quite slow, 1000 calls to this UDF with 640K rows of data takes 33 seconds!

When the used range is small you won’t notice the time taken, but for large used ranges with the big grid you certainly will. And the problem is that your UDF will do this check on every range that is passed to the UDF, even if its not really needed.

Colin points out that what affects the time is actually the number of cells containing data or formatting (or that previously contained data or formatting) rather than the last cell in the used range.

Speeding up finding the used range.

So you could start by only doing the used-range check when theRng parameter has a large number of rows:

Public Function GetUsedRows2(theRng As Range)
Dim oRng As Range
If theRng.Rows.Count > 500000 Then
Set oRng = Intersect(theRng, theRng.Parent.UsedRange)
GetUsedRows = oRng.Rows.Count
Else
GetUsedRows = theRng.Rows.Count
End If
End Function

This example only does the check if the user gives the UDF a range referring to more than half a million rows.

Another, more complicated, way of minimising the time is to store the number of rows in the used range in a cache somewhere and retrieve it from the cache when needed. The tricky part of this is to make sure that the used-range row cache always is either empty (in which case go and get the number) or contains an up-to-date number.

One way of doing this would be to use the Application AfterCalculate event (which was introduced in Excel 2007) to empty the cache. Then only the first UDF that requested the used range for each worksheet would use time to find the used range, and (assuming that the calculation itself did nothing to alter the used range) the correct number would always be retrieved.

The equivalent for Excel versions before Excel 2007 would be to use the Application SheetCalculate event to empty the cache for that particular worksheet. This technique would be less efficient since a worsheet may well be calculated several times in each calculation cycle.

As Colin points out, if you want to find the last row containing data it is faster to use Range.Find when you have many cells containing data.
Note that you can only use Range.Find in UDFS in Excel 2002 and later, and you cannot use the Find method at all from an XLL except in a command macro or via COM.

Public Function CountUsedRows2()
CountUsedRows2 = ActiveSheet.Cells.Find(What:="*", LookIn:=xlFormulas, SearchOrder:=xlByRows, SearchDirection:=xlPrevious).Row
End Function

So have you got any better ideas on how to process full-column references efficiently?


Catchup Post

$
0
0

The blog has been quiet for too long …

That’s because I have been too busy with a large customer project and summer activities.

This summer after 32 years we decided to sell our lovely Norfolk holiday cottage:

DSCF3600

DSCF3602

and downsize to a static caravan on the Norfolk coast, closer to the sailing and the beach. So August was spent in Norfolk sailing, beaching, walking barbecuing etc.

sailing

beachBack in Leamington we have just had a lovely family birthday celebration at Le Manoir

DSCF3902

followed by some gentle exercise on the Le Manoir croquet lawn.

croquet

So whilst there IS life outside the Excel world, I think its time to get back to work.

So how was your summer?


Custom Excel Worksheet Templates in Departmental Solutions – Pros and Cons

$
0
0

Excel allows you to create and use custom workbook and worksheet templates (you can also change the default workbook and worksheet templates).
This post explores the pros and cons of using custom Templates in the sort of general purpose departmental application-level Excel solutions I develop for my clients.

Custom Workbook Templates

For simple solutions where all the code, formulas and formatting can be contained within a single workbook (for example a simple expense sheet) a custom workbook template can be a good solution. Excel gives the user the chance to create a new workbook that is a copy of the Template. The Template can be self-customised by VBA code contained within it. Excel has a large number of these pre-built application templates available when you create a new workbook, and its easy to create your own custom workbook template.

But workbook templates don’t work too well when the solution starts to need merging or creating worksheets within an existing workbook, and can be a maintenance nightmare if the created workbooks are supposed to have any shelf-life as the code contained in the template gets proliferated across a large number of workbooks.

Custom Worksheet Templates

Excel also allows you to create custom worksheet templates that the user can copy into the active workbook by right-clicking a worksheet tab and choosing to insert the template. The worksheet template can be a single sheet or a self-contained set of interlinked sheets: a single right-click insertion will copy all the sheets in the template into the active workbook. And you can do the same thing from VBA using

Sheets.Add Type=TemplatePathandBookName

This works fine for formatting and formulas that only reference the other sheets within the template.
You can even add on-sheet controls that use VBA code within the VBA sheet modules (but this rapidly leads to maintenance problems).

If the user can copy these template sheets into the active workbook just using a right-click they can do this more than once and create multiple copies within a single workbook.
In many circumstances this will not be a good idea: the template needs to built to handle multiple copies of itself in one workbook. For example all the defined names contained within the worksheet template should be local in scope (otherwise the second copy makes local names from any global names).
So its may be best use a VBA command to copy the template sheets into the active workbook instead of asking the user to do this with a right-click.

But in that case I don’t see any real advantage to using worksheet templates compared with ordinary worksheets embedded within an addin.
Unless you want to allow the administrator to maintain and modify the templates!

My preferred Solution Architecture

For the reasons outlined above I tend not to use templates in general purpose departmental application-level Excel solutions
Here is an outline of my favoured approach, but as always YMMV!
The approach assumes that there is a developer, an administrator and several end-users of the solution.

The main XLA/XLAM addin

This contains:

  • The VBA Code that embodies the application-level addin in general and class modules.
  • Workbook Open code to create and manage the end-user interface commands (Ribbon or Addins menu/toolbars)
  • Worksheets in the addin that contain:
    • Addin specific constants and data (for example version numbers and best-before dates) that will be maintained by the developer.
    • Addin specific formulas and formatting that will be copied to user worksheets
  • Defined Names used by the VBA to reference the stuff on the addin worksheets

The XLA/XLAM is parked somewhere on a departmental server in a specific folder dedicated to the application, and its name contains a sequential build number.

The Addin Loader

To simplify maintenance of the solution I use a stub addin loader. This is a small piece of code packaged as an XLA/XLAM whose sole purpose is to open the latest version (highest build number) of the main XLA/XLAM addin on the users PC. The addin loader XLA is installed on each users PC using the Excel addin manager and is usually located in the folder on the departmental server. So when the main addin has to be updated all the administrator has to do is to copy the latest build of the main XLA to the folder on the server, and the next time a user starts an Excel session they automagically get the updated version. This way the main addin is never actually installed on the users PC.

The Control File

Any solution data that needs to be maintained by the administrator (for example: this period’s plan exchange rates or the paths to shared databases or Templates) is contained in one or more control files that are also located in the departmental folder.

Conclusion

Templates work well for self-contained workbook-level applications as long as you can avoid maintenance problems.

I prefer not to use them for application-level solutions.

But maybe I should use them more?

Whats your experience with Templates?


Finding missing items in lists: VLOOKUP vs COMPARE.LISTS performance and ease of use

$
0
0

Returning to the subject of finding the missmatches between 2 lists I want to compare using VLOOKUP with using SpeedTools COMPARE.LISTS.

Test Data

My test data consists of 2 lists of 500000 6-digit numbers. Most of these numbers match, but 5000 of them are different. The lists are not sorted. What I want to do is:

  • Filter out the missmatches showing ** for each miss
  • Count the missmatches
  • Produce a list of the missmatches

The first list is in A2:A500001 and the second list is in D2:D500001

Using SpeedTools COMPARE.LISTS

(If you want to try this on your PC you can download a full-featured trial version of SpeedTools, and you can download the sample data and examples of using COMPARE.LISTS here.).

COMPARE.LISTS allows you to control what kind of output you want from the comparison.

  • A count of either the matches or the miss-matches
  • Either True/False or Blank/** (** means not found)
  • A count of the matches and a list of the matches
  • A count of the miss-matches and a list of the missmatches

To get just the count of miss-matches you enter COMPARE.LISTS into a single cell as an ordinary (non-array) formula:

=COMPARE.LISTS(D2:D500001,A2:A500001,3)

This formula looks for each of the cells in D2:D500001 in the list A2:A5000001 and counts the number of items that can’t be found (5000 in this case).

And it only takes 0.6 seconds on my PC!
Thats fast enough for you to add the formula as a safety check that all items match.

To get a count and a list of the missmatches you enter the same formula as a multi-cell array formula (select a vertical range of cells, type the formula in the formula bar and press Control-Shift-Enter). The count appears in the first row and the following rows conatin the list of missing items. And it only takes 0.7 seconds on my PC!

To filter out the rows containing the missmatches you enter the following formula into the 500000 cells in E2:E500001 as a multi-cell array formula, and then filter for **:

{=COMPARE.LISTS($D$2:$D$500001,$A$2:$A$500001,2)}

This formula checks each of the cells in D2:D500001 against the range A2:A500001 and returns either blank for a hit or ** for a miss to the 500000 cels in E2:E500001. And it only takes 0.9 seconds on my PC!  Thats fast enough for you to make corrections and recalculate until all the errors are fixed.

You can also get counts and lists of matching items as well as missing items.

And there is an option to a case-sensistive text compare if you need to find missmatches caused by upper-lower case differences.

Using VLOOKUP

I can use the unsorted (exact match) range lookup option of VLOOKUP for each number I want to check, and it will return #N/A if it can’t be found in the other list: or I can check for the #N/A error and show ** for the missmatches.

=IF(ISERROR(VLOOKUP(D2,$A$2:$A$500001,1,FALSE)),"**","")
and copy down for 500000 rows.

This works, but its very slow: it takes over 500 seconds on my quad-core machine even with multi-threaded calculation, and on a single core machine it would be up to 4 times slower!
But thats not surprising if you think about how many MXIPS (millions of eXcel instructions per second) this is using.

Each VLOOKUP that finds a match is doing a linear search and on average is comparing with 250000 rows (and the ones that don’t have  a match are comparing all 50000 rows): so for 500000 VLOOKUPs thats roughly 500000 x 250000 = 125000 million compares in 500 seconds = 250 MXIPS.

Then I can use autofilter to filter on ** to show only the rows with missmatches.
and use COUNTIF to count the missmatches:

=COUNTIF($E$2:$E$500001,"~**")

Because COUNTIF treats * as a wild-card character I need to add a ~ to stop this happening.

Using VLOOKUP this way works OK for small amounts of data but is just not practical for large numbers of rows.

Its probably possible to create an array formula that just gives the count of miss-matches but I expect it would be too slow to be useful.

And I am sure someone cleverer than me can create an array formula to find case-sensitive missmatches.

Conclusion

My objective for COMPARE.LISTS was to create an easy-to-use and fast function that enabled you to quickly find and fix data missmatches. The performance has exceeded my expectations.


Pivot Table Sort is Too Clever

$
0
0

I created a pivot table from a list containing 3 character IDs, then used the Pivot Table field pulldown to sort it.

PivSort1

The resulting sorted list looks like this:

PivSort2

Looks like Pivot Table sort recognises 3 character abbreviations for

  • The day of the week
  • Months
  • A quarter given as the initial letter of the month

And decides to sort them into (some kind of mangled) time sequence and place them before the unrecognised 3-character IDs!

And I can’t find a way to make it do a proper sort!

Any ideas how you can make Excel do this in a not-so-clever but more sensible way?

Thanks to Debra, Alastair and Rory for telling me how to do it.



Using ENVIRON to find the XLB and QAT files

$
0
0

I am currently updating the FastExcel profiler to run with 64-bit Excel. This involves the rather tedious conversion of a large number of Windows API statements to use conditional compilation, VBA7 and WIN64.

Whilst doing this I discovered the VBA ENVIRON function, which gives you an easy way to get some information about the, well, environment.

For example I wanted to show the size of the XLB file, which stores toolbar customisations even in Excel 2007 and later, and the QAT file which stores QAT customisations. The reason for this is that some poor coding practices cause the size of these files to ballon and become corrupt: then Excel starts crashing but does not tell you why!

You could do this by hardcoding the paths to the files in your code, but thats a bad idea because the paths are different for different versions of Windows.
Or you could do this by using Windows API calls to find the directories, and handle the 32-bit/64-bit coding etc.

Or you can use ENVIRON, which is MUCH easier!

The path to the XLB file under Windows 7 on my system is something like:
E:\Users\your username\Appdata\Roaming\Microsoft\Excel\Excelnn.xlb Using the ENVIRON function in a Windows and Excel version-independent way it looks like this:

strGetXLBPath = Environ(“AppData”) & “\Microsoft\Excel\Excel” & CStr(CLng(Val(Application.Version))) & “.xlb”
kXLBSize=FileLen(strGetXLBPath)

The path to the QAT file under Windows 7 on my system is something like:
E:\Users\your username\AppData\Local\Microsoft\Office\Excel.QAT
And using ENVIRON in VBA:

strGetQATPath = Environ(“LocalAppData”) & “\Microsoft\Office\Excel.qat”
kQATSize=FileLen(strGetQATPath)

My XLB file is currently about 12KB and my QAT is less than 1KB, and I reckon anything over about 30KB is asking for trouble.

If they get corrupt you can delete or rename these files and Excel will happily recreate fresh copies (But of course you lose your customisations).
The easiest way to navigate to the directories is to enter %AppData% or %LocalAppData% in the Windows search programs and files box (Windows button).

Other things I use ENVIRON for include:

  • Getting the path to the Temp Files folder using ENVIRON(“TEMP”)
  • Getting the number of processors using ENVIRON(“NUMBER_OF_PROCESSORS”)
  • Getting the computer name using ENVIRON(“COMPUTERNAME”)

If you want to see all the environment variables active on your system (in the VBE Immediate window) you can use this code which I found on StackOverflow


Sub EnumSEVars()
 Dim strVar As String
 Dim i As Long
 For i = 1 To 255
 strVar = Environ$(i)
 If LenB(strVar) = 0& Then Exit For
 Debug.Print strVar
 Next
 End Sub

You can find 2 excellent articles on ENVIRON here (Win XP) and here (Win 7 8).

OK so how many of you use ENVIRON, or like me you did not even know it existed?


Exploring Conditional Format Performance Part 1: What’s slow, whats buggy and whats faster!

$
0
0

Patrick wanted to know if I had any information on Conditional Format calculation and performance, and I have  not looked at it for several years, so here goes!

I have done a series of experiments, using Excel 2007, 2010 and 2013, to try and get some insight on what Excel is doing under the covers. Because there is a lot to cover I have split the post into 3 parts.

This first part covers a simple experiment to see when Conditional formats get executed.

Formatting versus Calculation.

What Excel shows you on the screen or in a printout is the formatted (rendered) version of the results of a calculation.
And because formatting/rendering is such a cpu-intensive process Excel has a lot of tricks to try and minimise the time used (and thats why using Application.Screenupdating=False should be used everywhere to speedup your VBA).

Conditional Formats often do both calculation and formatting, so you have got double the chance of things being slow!

Excel does not generally allow formatting to be part of the calculation chain because formatting occurs after the calculation has finished.
This is also true
for conditional formatting, although it its not clear to what extent there is a separate calculation-of-conditional-formats step before the formatting step.

Excel dynamically formats (re-paints) only what you see on the screen.

To save time Excel only does final formatting for the part of the results you can see on the screen. (so large screens are slower than small ones, and zooming out a long way is slower!). When you have a lot of conditional formats this can cause very noticeable delays in scrolling a page up or down,

Conditional Formats can be Super-Volatile

Because of this dynamic repainting conditional formats are often executed even when no calculation occurs (for instance when you scoll up or down). So its not usually a good idea to embed a heavy calculation into a Conditional Format formula!

Lets start by looking at a very simple example that allows you to track when a condtional format gets executed. You can download the workbook FormatConditionsA.xlsb from SkyDrive. Note it contains VBA so will not run properly in the Excel Web App.

Test workbook FormatCondtionsA.xlsb

The workbook uses 3 cells and 2 VBA UDFs:

  • Cell B2 contains a formula =D21 and has two conditional format rules – colour orange if =signal1(b2) and colour green if =signal2(b2).
    Signal1 and Signal2 are VBA UDFs that increment a calculation counter and show it in the immediate window. Signal1 returns TRUE if B2 is an odd number and Signal2 returns TRUE if B2 is an even number.
  • Cell E2 contains 2 conditional format rules that directly check cell D21 for odd (orange) or even (green).
  • Cell D21 contains a number which you can change to either odd or even to see the effect on the conditional formats.

To run the experiments open the workbook and press F11 to see the VBIDE, then press Ctrl G to View the immediate window.
Then arrange the Excel window and the VBE window so that you can see both of them, and make sure that you can see Row 2 through 21 of the Excel window.

CFEx1_1

Experiment 1: Automatic calculation mode, User-interface driven

Switch to Automatic Calculation mode.
Clear the VBE immediate window.
Select Cell D21 and increment the number by 1.

Both cell E2 and B2 should change colour, and the Immediate Window shows how many times the UDFs have been calculated.

CfEx1_2

With Excel 2013 I get a total of 10 executions of the UDFs! (5 of each) !!! (No, I have absolutely no idea why, thats got to be a bug.)

Excel 2007 and 2010 only do 4 executions (2 for each UDF).

Experiment 2: Manual calculation mode, User-interface driven

Now switch to Manual calculation mode, clear the immediate window and select D21.

Increment D21 by 1: the result is

  • The UDFs are not executed (nothing in the Immediate Window).
  • B2 and E2 stay the same colour.

Page Down and then Page Up (to refresh the Excel window):

CFEx1_3

  • The UDFs are executed once.
  • Cell E2 changes colour because it directly refers to cell D21 which is now Odd.
  • Cell B2 has correctly NOT changed colour because the conditional format is driven by cell B2 itself, which has not yet changed because it has not yet recalculated.

Now press F9:

The UDFs are executed once and cell B2 changes colour.

Experiment 3: The effect of refreshing the screen with Page Up and Page Down

Now increment Cell D21 again so that the status bar shows Calculate.

Press page Up Page Down repeatedly: the immediate window shows that the UDFs execute each time the screen gets refreshed with Page Up.

Now Press F9 to recalculate:

  • Excel 2013 Page Down Page Up does not execute the UDFs
  • Excel 2010 and 2007 does execute the UDFs once for each Page Down Page Up, even though it does not need to.

Experiment 4: Recalculating but with conditional formats scrolled out of sight.

  • Clear the immediate window.
  • Scroll the Excel window so that row 15 is the first row showing
  • Increment cell D21 by 1
  • Press F9 to recalculate, or Ctrl/Alt/F9 to Full Calculate

The immediate window shows nothing: the conditional formats have NOT been executed and will not be until you Page Up to make them visible.

(Note: if you only scroll so that the first row is row 3 the conditional formats DO get exceuted: looks like Excel is using about a 12 row buffer!)

Conclusions from Experiment 1.

  • Conditional formats are executed when the cell containing the conditional format gets repainted.
  • Conditional Formats are not executed at a calculation unless they are on the visible prtion of the screen.
  • Excel 2013 looks a bit over-enthusiastic in Automatic Calculation mode, but smarter in Manual Calculation mode than Excel 2007/2010.

In the next post I will explore what the performance impact of conditional formats is, and what is the impact of setting Application.Screenupdating=False and Worksheet.EnableFormatCondtionsCalculation=False.


Exploring Conditional Format Performance Part 2: What’s slow, whats buggy and whats faster!

$
0
0

This is the second in a series of Posts on Conditional Formats (see part 1).

This post looks at the effects (and the resulting bugs!)  on Conditional Formats of:

  • Application.Screenupdating
  • Application.EnableConditionalFormatsCalculation
  • Application.Calculation
  • Whether the cells containing the conditional formats are visible or not
  • Screen Refresh
  • Excel 2007, Excel 2010 and Excel 2013

I am using the same (but slightly updated) test workbook as in Part 1: you can download it from SkyDrive.

Running the Tests

The FormatConditionsA.xlsb workbook contains 12 VBA subroutines to do the testing (Test1 through Test3C).

Cell B2 uses 2 UDFs (Signal1 and Signal2) to determine whether B2 is even or odd, and the B2 formula refers to D21

Cell E2 has 2 conditional format formulas that test directly whether D21 is even or odd.

CFEx1_1You need to run the tests with the VBE window open and the immediate window visible.
For a more detailed explanation of this example workbook see part 1.

If you run the tests using Excel 2007, Excel 2010 and Excel 2013 you will see that a lot of work has been done by the Excel team to minimise the number of times the conditional formats get executed. But (as always when doing optimisations) this has tended to introduce bugs.

Test1: Screenupdating=True, Enable=True, Calc=Auto

  • Excel 2013: OK, 1 call to each UDF, large pause of a second or two before cell B2 refreshes its colour.
  • Excel 2010: OK , 2 calls to each UDF, no noticeable pause.
  • Excel 2007: OK, 4 calls to each UDF, no noticeable pause.

Test2: Screenupdating=true, Enable=true, Calc=manual

  • Excel 2007 & Excel 2010: OK
  • Excel 2013: Bug in cell B2neither of the conditional formats is applied to B2 and neither of the UDFs are executed. Scrolling down and up to refresh the screen does not fix this, but pressing F9 does.

CFEx2_1

Test3: ScreenUpdating=False, Enable=True, Calc=Manual

  • Excel 2013: OK
  • Excel 2010: Bug in B2. Neither of conditional formats are applied to B2 and the UDFs are not executed. Page Down Page Up does not fix but F9 does.
  • Excel 2007: Bug in B2. Page Down Page Up fixes.

Test1A: Screen=True, EnableFormatConditionsCalculation=False, Calc=Auto

So what does setting EnableFormatConditionsCalculation to False actually do?
I am not sure, but what it does NOT do is to permanently switch off the evalution of conditional formats!

  • Excel 2007: OK – the pause in Test 1 has disappeared!
  • Excel 2010: Bug in Cell E2. the left-most vertical border is coloured correctly but the rest of the cell is not! Page Down Page Up fixes it.
  • CFEx2_1A
  • Excel 2013: Bug in cell B2 and E2. Page Down Page up fixes it.CFEx2_1A2013

Test2A: Screen=True, EnableFormatConditionsCalculation=False, Calc=Manual

  • Excel 2007: OK
  • Excel 2010: Bug in cell E2. Page Down Page Up fixes it.
  • Excel 2013: Bug in cell B2. Neither Page Down Page Up nor F9 fix it, but Ctrl/Alt/F9 does.

Test3A: Screen=False, EnableFormatConditionsCalculation=False, Calc=Manual

  • Excel 2013: OK
  • Excel 2010: Bug in cell B2: Neither Page Down Page Up nor F9 fix it, but Ctrl/Alt/F9 does
  • Excel 2007: Bug in cell B2. Page Down Page Up fixes it.

Tests 1B to 3C: switching to another sheet, run the tests, switch back

Its magic: all these tests run correctly in all versions!

Conclusions

  • Looks like using UDFs in conditional format formulas is rather buggy: avoid.
  • EnableFormatConditionsCalculation does not look useful.
    But there were many reports of a problem importing Excel 2003 files with conditional formats into later versions that could be fixed by setting it to True: I don’t know if this problem still exists.
  • The safest way is to activate a sheet that does not contain any conditional formats.

The next post will focus on the performance of conditional formats.


Exploring Conditional Format Performance Part 3: What’s slow, whats buggy and whats faster!

$
0
0

This is the third in a series of Posts on Conditional Formats (see part 1 and Part2).

This post looks at the effects  on the performance of Conditional Formats of:

  • Application.Screenupdating
  • Application.EnableConditionalFormatsCalculation
  • Application.Calculation
  • Whether the cells containing the conditional formats are visible or not
  • Screen Refresh
  • Excel 2007, Excel 2010 and Excel 2013

The workbook I am using is called (with stunning originality) FormatConditionsB.xlsb, and you can download it from my Skydrive.

It contains 1.9 million Rand() formula in A1:Z72858, and each of these cells has 3 conditional format rules:

CFEx3_1So thats 5.7 million conditional format rules.

There are 2 worksheets: Formats and Empty

The workbook also contains the MicroTimer api code for high resolution timing and 5 subs, Testing 1 through 4 and testscroll1.
The subs typically set calculation mode, screenupdating and enableformatconditionscalculation, time a calculation and then time a screen update.
For example here is the code for Testing1:


Sub testing1()
 Dim osht As Worksheet
 Dim dtime As Double
 Application.Calculation = xlCalculationManual
 Worksheets("Formats").Activate
 Set osht = Worksheets("Formats")
 Application.ScreenUpdating = True
 osht.EnableFormatConditionsCalculation = True
 dtime = MicroTimer
 Application.Calculate
 dtime = MicroTimer - dtime
 Debug.Print dtime
 dtime = MicroTimer
 Application.ScreenUpdating = True
 dtime = MicroTimer - dtime
 Debug.Print dtime
 End Sub

Timings with different sheets visible.

The workbook opens with the formats sheet visible.
If you click the Empty tab you instantly see the empty sheet.
But if you then switch back to the Formats sheet there is a noticeable delay of about a second before the screen refreshes. Similarily pressing Page Up takes just over a second before the screen refreshes.

This is because Excel re-evaluates the conditional formats for the visible cells on the active sheet at each screen refresh.

Prssing F9 to recalculate the 1.9 million RAND() formulas with the Formats sheet visible takes 2.8 seconds, but with the Empty sheet visible it takes 0.2 seconds- again its the evaluation of the visible conditional formats that takes the time.

Conditional Formats are not directly evaluated by a calculation.

Running the Testing Subroutines

Here are the timings in seconds for running Test1 through Test4, with the Formats sheet visible.

CF2_Timings1

The conclusions of this test are:

Excel 2010 and Excel 2013 are noticeably faster than Excel 2007.

  • Turning off screen updating is the big winner
  • Switching off EnableFormatConditionsCalculation is only worthwhile if ScreenUpdating is true
  • Switching off EnableFormatConditionsCalculation is much less effective than switching off ScreenUpdating
  • Although Refresh looks very fast in Excel 2013 it actually just postpones the refresh to after the VBA has finished, so in fact its not faster.

I then repeated the tests, but with the Empty sheet visible rather than the Formats sheet:

CF2_Timings2

This completely avoids the refresh evaluation of the conditional formats and the times are comparable to the first set of tests with Screen Updating False.

I also tried repeating the tests with the Formats sheet active but hidden behing the VBE window.
The timings were virtually the same as with the Formats sheet visible.

So its the refresh of the conditional format cells within the activesheet window that uses the time, even if its hidden behind some other window.

I also ran TestScroll1. This times the effect of a complete scroll of the conditional formats window.

CF2_Timings3

As you can see the scroll times are comparable to the refresh times in the first set of tests, except for Excel 2013.
But the Excel 2013 refresh timings in the first test are cheating because the refresh actually takes place after the VBA sub has ended.

Range.Calculate and Range.CalculateRowMajorOrder

If you use Range.CalculateRowMajorOrder on a single cell (or a large block of cells) it takes about 1.4 seconds – the same time as a scroll/screen refresh.
But Range.Calculate takes almost exactly twice as long – looks like it causes 2 screen refreshes not one!

Seriously Slow Conditional Formats

If you want to play with a workbook containing some seriously heavyweight conditional formats you can download ConditionalFormatsC.xlsb

This has 132K formulas =INT(RAND()*1000)  in A1:V6000 and each cell has a single formatting rule to colour orange duplicated values in A1:V6000. (well of course they all turn orange).

With the Formats sheet visible pressing F9 to recalculate takes about 40 seconds.
And it looks like evaluating the conditional formats is all single-threaded: no advantage from multiple cores!

But with the Empty sheet visible F9 takes 0.03 seconds.

Conclusions

  • Heavy conditional formatting can be slow
  • Conditional Format evaluation is single-threaded
  • EnableFormatCondtionsCalculation is not very useful
  • Evaluation of conditional format rules takes place at screen refresh time rather than calculation time
  • Only the conditional format rules for cells that are shown on the active window(s) get evaluated
    (large screens will be slower than small screens and zoom out slows you down!)
  • ScreenUpdating=false works well, but the final refresh time will occur when the Sub is exited.
  • Using UDFs in conditional formats is probably not a good idea
  • The interaction of VBA and conditional formats looks buggy
  • Excel 2013 and 2010 are faster than 2007 for Conditional Formats

2 other bugs with conditional formats have been reported, but I don’t know if they have been fixed in Excel 2013:

  • Opening a file created in Excel 2003 with Excel 2007 could make the conditional formats fail to refresh unless you manually set EnableFormatConditionsCalculation=true
  • Repeated copy-pasting Conditional formats in Excel 2007 duplicated the conditional formatting rules so that large numbers of rules were created.

So whats your experience with Conditional Formats?


UNIQUES and DISTINCTS: exploring lists with LISTDISTINCTS

$
0
0

I just added some options to the SpeedTools LISTDISTINCTS functions that make them surprisingly powerful. You can now easily find the most frequently occurring item in a list, or find the item with the largest sum or average of a corresponding column.

But first since there is disagreement about the meaning of the terms UNIQUES and DISTINCTS I should explain what I mean:

  • A unique item in a list is one that only occurs once
  • Distinct items in a list can occur once or more than once

Creating a list of distinct items

Suppose you have a list formatted as a table:

Distincts1

Then entering the formula =LISTDISTINCTS(Table1) as a multi-cell array formula (select 14 cells in a column, enter the formula in the formula bar and press Control/Shift/enter) gives you this:

Distincts2Notice that the items appear in the sequence of their first occurrence in the list.

There are rather a lot of #N/As since Excel pads out the excess cells (the cells for which the array formula did not return anything) with #N/A.
But of course LISTDISTINCTS allows us to fix that using the PAD option in the formula =LISTDISTINCTS(Table1,,,,,1)
Pad can be 0 = pad with #N/A, 1 = pad with “”, 2 pad with zero.

Distincts3That looks better, but OOPS the #N/A in the list has disappeared! Thats because the default option for LISTDISTINCTS is to ignore error values, blanks and empty cells, so we just need to change the Ignore option to 2.

Distincts4

And the result looks like this: (it shows #ERROR rather than #N/A so that you can distinguish it from the padding #N/A)

Distincts5There some more options for LISTDISTINCTS

Distincts6

Case_Sense defaults to false, so the aa and AA in the list are treated as being the same.

If your list of items has more than one column you can either ask for a list of distinct rows (ByRows=True) or a list of all the distinct items across all the columns.

And you can sort the result list ascending (Sort=1), descending (Sort=2) or leave it unsorted (Sort=0).

Here is an example of LISTDISTINCTS sorted ascending, case-sensitive, pad with blanks, include errors, showing the difference with ByRows True and ByRows False.

Distincts7

Counting Distinct Items

There are 2 variations of LISTDISTINCTS for counting the number of distinct items: COUNTDISTINCTS and LISTDISTINCTS.COUNT

Distincts8COUNTDISTINCTS is not an array formula and just gives you the count.
But LISTDISTINCTS.COUNT adds an extra column that gives the count of occurrences of each of the distinct items.

Finding the most frequently occurring item

You can also sort the output of LISTDISTINCTS.COUNT most frequent occurrences first (Sort=-2) or last (Sort=2).

So the formula =LISTDISTINCTS.COUNT(A21:A34,,,,-2) (not an array formula, entered in a single cell) returns AA which is the most frequently occurring item.

Sums and Averages for distinct items.

As well as LISTDISTINCTS.COUNT there are LISTDISTINCTS.SUM and LISTDISTINCTS.AVG
These take an additional column argument showing what to sum or average for each distinct item.
And, just like LISTDISTINCTS.COUNT, you can sort the output either on the distinct items or on the resulting sums or averages.

distincts10

Distinct9

Summary

Of course you can achieve similar things with Pivot Tables and PowerPivot. But there are many occasions when I find that a simple formula that automatically refreshes whenever Excel recalculates is a better solution.

And adding the option to sort ascending or descending on either the item list or the count, sum or average adds a lot of pwer to the functions.

So what do you use for this kind of thing: Formula, UDF, Pivot Table or PowerPivot?


Viewing all 94 articles
Browse latest View live