Use Subqueries to Count Distinct 0X Faster

NB: This post is in response to Use Subqueries to Count Distinct 50X Faster and my comment on Hacker News.

On Periscope’s The High-Performance SQL Blog there is an article that shows how to improve the performance of COUNT(DISTINCT.... It begins by saying that these techniques are universal but goes on to focus on the performance benefits exclusively when they’re applied to PgSQL.

By way of comparison, I’ve tried to emulate their environment and techniques with MSSQL.

It’s not clear what hardware Periscope uses for their benchmarks. I suspect it’s more powerful than my computer, because their best result is 0.7 seconds with 10 million rows in time_on_site_logs whereas my best (when forcing SQL Server to generate serial plans like theirs) for 10 million rows is just under two seconds. I’m benchmarking on my computer:

  • Windows 8.1 x64
  • SQL Server 2012
  • Intel Core i3 540
  • 4GB RAM
  • Intel SSD 520 Series

The Tables

Please forgive any inconsistencies in capitalization – I usually write my Queries LIKE 'This' but Periscope’s blog has SQL written like 'this'.

Don’t forgive any mistakes made translating Periscope’s experiment to MSSQL or in my analysis – let me know if I got anything wrong in the comments.

The structure is reverse-engineered from the queries in the blog post.

CREATE TABLE dashboards
(
    id INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    name NVARCHAR(80) NOT NULL
);

CREATE TABLE time_on_site_logs
(
    id INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    dashboard_id INT NOT NULL,
    CONSTRAINT FK_Time_On_Site_Logs_Dashboards
      FOREIGN KEY (dashboard_id)
      REFERENCES Dashboards(id),
    user_id INT NOT NULL
);

I didn’t bother with a users table and foreign key because as far as I know there’s nothing in the following queries that would benefit from it. (SQL Server can apply JOIN elision when working with trusted foreign keys … but we never JOIN to users)

The Data

Not knowing what the data is I made some guesses.

WITH cteNumbers AS
(
  SELECT 1 AS Number
  UNION ALL
  SELECT Number + 1
  FROM   cteNumbers
  WHERE  Number < 216
)
INSERT INTO dashboards(name)
SELECT CAST(cteNumbers.Number AS NVARCHAR(80))
  FROM cteNumbers
OPTION (MAXRECURSION 0);

INSERT INTO time_on_site_logs(dashboard_id, user_id)
SELECT dashboards.id, 1 + (RAND(CAST(NEWID() AS VARBINARY)) * 1000)
FROM   dashboards,
       dashboards A,
       dashboards B;

SELECT COUNT(*) FROM time_on_site_logs;

It’s slightly unrealistic because there are so many rows (10,077,696) that in practice the RAND... bits cause every user to visit every dashboard at least once, but should do for now.

The Indexes

Periscope say that assume the handy indices on user_id and dashboard_id are in place. The most handy indexes I can think of are as follows.

Note that I’m assuming that time_on_site_logs doesn’t have a natural key in dashboard_id and user_id – users might visit dashboards multiple times.

CREATE INDEX IX_Dashboard_Id
  ON time_on_site_logs (dashboard_id, user_id);

CREATE INDEX IX_User_Id
  ON time_on_site_logs (user_id, dashboard_id);

Both columns are included in the key rather than covered — the ordering helps when calculating DISTINCT.

Using covering indexes causes MSSQL to pick a different plan, but the performance is similar. The relative performance between each query is still the same … exactly the same.

I’m throwing in a unique index on dashboards.name too; if we’re going to report on dashboards by name it doesn’t make sense to have dashboards with the same name. No online retailer is looking at their sales reports and thinking that John Smith is their best customer and Amorette McFredson is their worst.

CREATE UNIQUE INDEX IX_Name
  ON dashboards (name);

The Queries

Modified for Transact-SQL and serial execution is enforced.

select 
  dashboards.name, 
  count(distinct time_on_site_logs.user_id)
from time_on_site_logs 
join dashboards on time_on_site_logs.dashboard_id = dashboards.id
group by name 
/* modified: was "order by count desc" */
order by count(distinct time_on_site_logs.user_id) desc
/* serial execution */
OPTION (MAXDOP 1);

select
  dashboards.name,
  log_counts.ct
from dashboards
join (
  select
    dashboard_id,
    count(distinct user_id) as ct
  from time_on_site_logs 
  group by dashboard_id
) as log_counts 
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc
/* serial execution */
OPTION (MAXDOP 1);

select
  dashboards.name,
  log_counts.ct
from dashboards 
join (
  select distinct_logs.dashboard_id, 
  count(1) as ct
  from (
    select distinct dashboard_id, user_id
    from time_on_site_logs
  ) as distinct_logs
  group by distinct_logs.dashboard_id
) as log_counts 
on log_counts.dashboard_id = dashboards.id
order by log_counts.ct desc
/* serial execution */
OPTION (MAXDOP 1);

The Results

le_meme

Identical execution plans across the board. Approximately two seconds when running serially, one second when running in parallel.

boring

The parallel plans look similar to the above but with some parallel operators thrown in.

Conclusion

I’m quite sure to make of this. MSSQL did exactly what I thought it would do, but it’s not clear why PgSQL couldn’t generate the best plan from the simplest form the query.

It makes me wonder if MSSQL is so good at this because both MSSQL and the Entity Framework live under the Microsoft umbrella. EF produces queries that are straightforward but more verbose than an individual would write. Every optimization that MSSQL can make will make EF look better.

There’s a discussion to be had about the cost of free vs. proprietary tools and developer productivity … but I’m sure you’ve already drawn that conclusion.

A Word About dashboards.name

I don’t know whether the query that was chosen for optimization because it’s good for demonstrating the technique on PgSQL or because it represents a query that is actually in use. I don’t like group by name without knowing whether dashboard names are unique.

With that said, if it’s not unique then the query is arguably “wrong” because it aggregates results for distinct dashboard names rather than distinct dashboards.

Removing the unique index on dashboards.name does change the execution plan of the first query – it’s slightly worse than the other two (which remain identical). “Fixing” the first query to group by dashboards.id and dashboards.name causes the better plan to be generated again.

GenerateBootstrapperTask and the Visual Studio 2010 Command Prompt

If you’re using MSBuild and want to generate a bootstrapper for your application you might be drawn to this example on MSDN. It looks straightforward enough, and I’ve reproduced it below with a few changes (to install a later version of the .NET Framework, and Windows Installer 3.1).

<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

    <ItemGroup>
        <BootstrapperFile Include="Microsoft.Net.Framework.3.5.SP1">
            <ProductName>Microsoft .NET Framework 3.5 SP1</ProductName>
        </BootstrapperFile>

        <BootstrapperFile Include="Microsoft.Windows.Installer.3.1">
            <ProductName>Windows Installer 3.1</ProductName>
        </BootstrapperFile>
    </ItemGroup>

    <Target Name="BuildBootstrapper">
        <GenerateBootstrapper
            ApplicationFile="WindowsApplication1.application"
            ApplicationName="WindowsApplication1"
            ApplicationUrl="http://mycomputer"
            BootstrapperItems="@(BootstrapperFile)"
            OutputPath="C:\output" />
    </Target>

</Project>

However, building this project from the Visual Studio 2010 Command Prompt will give you many nasty errors:

bootstrapper.proj(10,9): error MSB3147: Could not find required file 'setup.bin' in 'C:\Output\Engine'.

The problem here is that even though you launched MSBuild from the Visual Studio 2010 Command Prompt, it doesn’t know what toolset to use.

The solution is implemented easily enough, simply change the root node of the project XML:

<Project ToolsVersion="4.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">

With this change the project builds successfully:

Microsoft (R) Build Engine version 4.0.30319.17929
[Microsoft .NET Framework, version 4.0.30319.17929]
Copyright (C) Microsoft Corporation. All rights reserved.

Build started 28/09/2012 4:24:11 PM.

Build succeeded.
    0 Warning(s)
    0 Error(s)

Time Elapsed 00:00:00.40

If changing the project XML is not is not suitable or convenient, you can also set the ToolsVersion from the command line.

Dynamic ODBC (RDO) Connection Strings in Crystal Reports

Apparently changing the connection string of an ODBC (RDO) connection in Crystal Reports dynamically (at runtime) is hard to do.

I had a crack at it and this is what I came up with:

private static ConnectionInfo GetConnectionInfo()
{
    // DbConnectionAttributes contains some, but not all, consts.
    var logonProperties = new DbConnectionAttributes();
    logonProperties.Collection.Set("Connection String", @"Driver={SQL Server};Server=TODD-PC\SQLEXPRESS2;Trusted_Connection=Yes;");
    logonProperties.Collection.Set("UseDSNProperties", false);

    var connectionAttributes = new DbConnectionAttributes();
    connectionAttributes.Collection.Set("Database DLL", "crdb_odbc.dll");
    connectionAttributes.Collection.Set("QE_DatabaseName", String.Empty);
    connectionAttributes.Collection.Set("QE_DatabaseType", "ODBC (RDO)");
    connectionAttributes.Collection.Set("QE_LogonProperties", logonProperties);
    connectionAttributes.Collection.Set("QE_ServerDescription", @"TODD-PC\SQLEXPRESS2");
    connectionAttributes.Collection.Set("QE_SQLDB", true);
    connectionAttributes.Collection.Set("SSO Enabled", false);

    return new ConnectionInfo
               {
                   Attributes = connectionAttributes,
                   // These don't seem necessary, but we'll include them anyway: ReportDocument.Load does
                   ServerName = @"TODD-PC\SQLEXPRESS2",
                   Type = ConnectionInfoType.CRQE
               };
}

This function imitates the ConnectionInfo object that I saw ReportDocument.Load produce.

Now we’ve got a ConnectionInfo we have to apply it. The correct way seems to be to iterate over:

  1. subreports and their tables
  2. your report and its tables

Once the ConnectionInfo has been applied you can set your parameters.

I made a small demo application to test my code. For simplicity’s sake the report has one connection and no parameters, so the code to configure the report’s connection is quite simple:

private void Form1_Load(object sender, EventArgs e)
{
    var reportDocument = new ReportDocument();
    reportDocument.Load(@"CrystalReport1.rpt");
    var table = reportDocument.Database.Tables[0];
    var logonInfo = table.LogOnInfo;
    logonInfo.ConnectionInfo = GetConnectionInfo();
    table.ApplyLogOnInfo(logonInfo);

    crystalReportViewer1.ReportSource = reportDocument;
}

Now if we take a quick look at the way the report is set up, you’ll see it has one command that asks for the server name:

The connection string in the report is configured to point to the first of two SQL Server instances on my machine. We can verify this by previewing the report:

But that’s not what you came here to see! If we run the application, we see Crystal Reports connect to the server we specified in the code behind:

Note that the retrieved server name is TODD-PC\SQLEXPRESS2.

Update

I’ve since found that ApplyLogonInfo uses the server name to determine which properties in the connection details to update. Depending on your circumstances, you may want to always generate a unique server name/description to ensure that as many properties as possible are changed.

Visual Studio 2010 SP1, Windows SDK 7.1 Install Order

Thanks to an issue between Visual Studio 2010 SP1 and the Windows SDK 7.1 installing and fully patching Visual Studio 2010 became a little bit more complicated and time consuming.

Not only is there a special installation order you should follow, but it seems that the Windows SDK 7.1 is a bit picky with the Visual C++ 2010 runtimes it will install beside.

If you happen to fully patch your computer before installing the Windows SDK 7.1 you’ll probably end up downloading KB2565063. This updates contains some of the things the Windows SDK 7.1 installer doesn’t like.

So, to improve upon the previously suggested install order:

  1. Install Visual Studio 2010 RTM
  2. Uninstall any Visual C++ 2010 runtime newer than 10.0.30319 (I’ve seen 10.0.30419 and 10.0.40219 block the installation)
  3. Install Windows SDK 7.1
  4. Install Visual Studio 2010 SP1
  5. Install Visual C++ 2010 Service Pack 1 Compiler Update

In my case, skipping step 2 resulted in the following message during Windows SDK 7.1 installation:

Installation of the “Microsoft Windows SDK for Windows 7″ product has reported the following error: Please refer to Samples\Setup\HTML\ConfigDetails.htm document for further information

And the following in the error logs:

DDSet_Error: Patch Hooks: Missing required property 'ProductFamily': Setup cannot continue.
DDSet_Warning: Setup failed while calling 'getDLLName'. System error: Cannot create a file when that file already exists.

Update

I was installing Visual Studio 2008, 2010, 2012 and SQL Server 2012 on Windows 8, only to stuff up the installation by not following my own instructions! If anybody wants to go the whole hog with three side-by-side installations of Visual Studio, here’s the install order that worked for me:

  1. Visual Studio 2008, then SP1, then later updates
  2. Visual Studio 2010 and Windows SDK 7.1 as per the instructions above
  3. SQL Server 2012 (SQL Server Management Studio 2012 is based on Visual Studio 2010, so I thought it best to install this before Visual Studio 2012)
  4. Visual Studio 2012

If you’re successful, you should have yourself a very busy Start screen:

Disabling WCF’s Explicit Element Ordering

If you consume a SOAP web service via WCF, and the service doesn’t return its XML elements in the order specified by its WSDL, you may experience “mysterious null values” in the deserialized response. This is because WCF enforces element ordering for sequences as defined by the XSD standard. If it encounters an element that’s not in the right place it simply gives up.

This is unfortunate if you are consuming a third party web service and you do not have control over the element ordering.

As a workaround you can generate proxies that serialize with XmlSerializer instead of DataContractSerializer, and then remove the element ordering information from the proxy code. Fortunately (with a little help from sed) this can be automated. You’ll need to add a Pre-Build action to your project, and populate it like so:

set proxy_tool="C:\Program Files\Microsoft SDKs\Windows\v7.1\Bin\SvcUtil.exe" /nologo /t:code /ser:XmlSerializer /UseSerializerForFaults
set sed_tool="$(ProjectDir)sed.exe" -r -i "s/,?[[:space:]]*Order=[[:digit:]]+//"

%proxy_tool%  /o:"Proxy1.cs" /n:*,Namespaces.Name1 "Proxy1.wsdl"
%sed_tool% "Proxy1.cs"

%proxy_tool%  /o:"Proxy2.cs" /n:*,Namespaces.Name2 "Proxy2.wsdl"
%sed_tool% "Proxy2.cs"

...

“The database engine could not lock table ‘…’ because it is already in use by another person or process” error in JET

When performing DML and DDL queries against an Access database you may encounter something similar to the following:

The database engine could not lock table ‘Orders’ because it is already in use by another person or process.

The error message gives some information about one likely cause of this error – somebody else is trying to modify the same table you are. However it’s also possible to get this error message even if you have exclusive access to the database. This example illustrates:

DbConnection connection = Program.OpenAceOleDbConnection("C:\\Temp\\Northwind.mdb");
DbTransaction transaction = connection.BeginTransaction();
DbCommand command = connection.CreateCommand();
command.Transaction = transaction;

command.CommandText = "DELETE FROM [Orders] WHERE [Order ID] < 4000";
command.ExecuteNonQuery();

command.CommandText = "ALTER TABLE [Orders] ADD COLUMN [Test] INT";
command.ExecuteNonQuery();

transaction.Rollback();
connection.Close();

As soon as the second query begins to execute the error is raised:

JET DDL - Locking Exception

JET DDL – Locking Exception

The error message can be produced using JET, ACE and DAO. It is not raised consistently, it appears to be affected by whether the first query modifies row data pages partially or completely. Furthermore, it is only raised when executing queries inside a transaction.

I have not found a good workaround for this problem. If possible, reorder your queries so that the DDL is performed first. Another option is to perform the queries outside of a transaction. In my case, I took advantage of JET’s file-based format and the fact that my application would always have exclusive access to that file and emulated transactions by copying the database to a temporary location, performing the queries and then copying it back.

Update

Q331594 seems to come pretty close to describing this behaviour.

“Operation must use an updateable query” error in JET

When performing update queries against a Microsoft JET database, you may come across the following error:

Operation must use an updateable query

The Microsoft Knowledge Base has an article for the most obvious cause of this error, that the database file is not writable. However, you may also receive this error when trying to run an UPDATE query which contains a subquery:

UPDATE [Orders]
   SET [Status ID] = ( SELECT TOP 1 [Status ID]
                         FROM [Orders Status]
                        ORDER BY [Status ID] );

I encountered this error when trying to write an update query that was valid JETSQL and TSQL. Unfortunately it’s not possible to write the query in a way that is valid SQL on both platforms1 – to work around this error in JET you need to change the way the query is structured:

UPDATE [Orders]
       , ( SELECT TOP 1 [Status ID]
             FROM [Orders Status]
            ORDER BY [Status ID] ) AS [tmpSubquery]
   SET [Orders].[Status ID] = [tmpSubquery].[Status ID];

The query now runs, but we’re not out of the woods yet. Let’s try writing a similar query that would achieve the same thing:

UPDATE [Orders]
       , ( SELECT MIN([Status ID]) AS [Min Status ID]
             FROM [Orders Status] ) AS [tmpSubquery]
   SET [Orders].[Status ID] = [tmpSubquery].[Min Status ID];

This query does not run – it fails with the same error. The problem is our use of an aggregate function (in this case, MIN).

To work around this we must take a departure from ANSI SQL and use the Domain Aggregate functions in Access.

UPDATE [Orders]
   SET [Status ID] = DMin("[Status ID]", "[Orders Status]")

This works, and can even be extended to emulate a correlated subquery via each domain aggregate function’s optional criteria argument.

1 JETSQL and TSQL have differing syntax for update queries. TSQL requires you to use the FROM clause when the update involves multiple tables, JETSQL does not (in line with ANSI SQL). Thus to write an update query that works on multiple platforms you need to use a subquery.

Update

Turns out there is a Knowledge Base article for this problem, for Access 2.0, and describes 3 workarounds. The post above describes the third workaround. Note that the first workaround may not work if you’re inside a transaction (it uses DDL), so I recommend looking at workarounds two and three.