Wednesday, December 21, 2016

Big Crazy Linked Server Query

I got bored and was fiddling around with a statement I'd been doing manually in several steps to try to make it as easy to use (not read) and as quick as possible. The rundown is this:

    I have a query I want to run which spans two databases.

    The relevant schema for these two servers is as follows:

-- Server 1

    -- Has about 500m rows
    create table dbo.orders
        order_id varchar(55) primary key clusterd,
        Data varchar(8000) -- just a placeholder to show that there's other stuff in here

    -- Server 2
    create table dbo.orderLog
        order_id varchar(55) primary key clustered,
        PayloadId int unique,
        OtherData varchar(8000) -- Placeholder for other columns. 
       --Not that it matters, but this data is different than the data on server 1
    -- Has about 20 million rows
    create table
        RecordId int identity(1,1) primary key clustered,
        PayloadId int,
        TriggerDate datetime,
        OtherData varchar(8000)

    exec master.dbo.sp_addlinkedserver
        @datasrc = 'Server1',
        @server = 'lnkServer1',
        @srvproduct = 'SQL Server

I need to query Server1.dbo.Orders.Data using a list of order_ids having a certain bad data point in them.The issue is that a subset of the orders have a bad data point (TriggerDate) which can't be derived from the data on Server 1, but CAN be derived from the data on server 2. The bad rows are any rows in Server2.dbo.Events where TriggerDate < '1900-00-00'.

Server1.dbo.orders has close to half a billion rows in it, and just about any way you slice it, doing a linked server join ends up doing  a scan on Server1.dbo.Orders.Data. However if I use OPENQUERY with the results of a query on server 2 hard coded in there, it compiles the SQL locally on Server1 and performs admirably. The trouble is then constructing a query to do this. Actually this is not that hard to do, but  because I'm feeling particularly masochistic, I decided to try to compact it into as few sql statements as possible. And it is here, dear reader, where the fun begins.

  1. Events has the information I need, but to trace back to a specific order, I need the PayloadID, which is not unique in EventLog
  2.  So I query OrderId from AlertLog where such a payload matches the criteria from step 1 exists in EventLog
  3.  I then use a FOR XML PATH statement to serialize all these orders so I can concatenate them into a DSQL clause
  4.  Also, apparantly OPENQUERY doesn't allow a query larger than 8000 characters. So I have to select the top n rows where n = 8000 (max field size) / the length of an order_id (55)
  5.  I use quotename() on the OrderID to wrap it in single quotes. However since this is nested in several levels of DSQL, I'll have to do more to it (more on that later)
  6.  I use stuff() to cut off the leading comma
  7.  I use replace() on all single quotes and make them two single quotes. This way, as it exists in the openquery code portion, the quotes are properly escaped. This could have been accomplished in any on several different ways. This is just the route I chose.
  8.  I can now almost do my select from OPENQUERY, but OPENQUERY doesn't allow parameterization of the code portion, or anything but a string literal
  9. So the final tactic here is to put all this in DSQL. That way the DSQL executes the openquery statement as though it were literal SQL. Within the context of the DSQL, the OPENQUERY doesn't realize it's being tricked into executing dynamic code.  It then executes the dynamic SQL using a linked server, while executing locally compiled code on the target server.
  10. The dynamic SQL is executed. Although it irks me that I have two statements here, going the exec ('/*sql code*/') route actually introduces a bunch of unnecessary complexities which I feel just aren't worth making to consolidate the query further. 
Sadly, the setup of this is so convoluted I can't really easily provide test data for this, but trust me, it works.

declare @sql nvarchar(max) = '
    select *
    from openquery(lnkServer1, ''
        select *
        from Server1.dbo.orders with (nolock)
        where order_id in
        (' + replace(stuff((select ',' + quotename(OrderId, '''')
                            from (select top (8000 / 55) OrderId
                                  from Server2.dbo.AlertLog o with (nolock)
                                  where exists (select 1
                                                from Server2.dbo.EventLog i with (nolock)
                                                where i.TriggerDate < '1900-01-01'
                                                    and o.PayloadId = i.PayloadId)) a
                            for xml path('')), 1, 1, '') , '''', '''''')
        + ')'')'

exec sp_executesql @sql

No comments:

Post a Comment