Point in Time (Nearest Neighbor Algorithms for SQL)

Kevin Meade's picture
articles: 

Been doing a lot with Historical Perspective lately. This has caused me to think a bit about the different variations of lookup that can be called a Point In Time query. My customers have found this discussion useful in understanding the details of how their report programs find specific rows in time. It is helpful to them because it provides an understanding as to why a specific row shows up on a report, and thus allows them to create more exacting definitions of what they want. Plus I find it is a good primer for newbies on staff to read so they don't make the same mistakes we made when we first started doing PIT queries. Maybe you can use it too.

First lets start with some data for which we note the following:

create table claim_history
(
 claim_id number not null
,status_date date not null
,row_eff_dt date not null
,row_exp_dt date not null
)
/

insert into claim_history values (1,trunc(sysdate-5),trunc(sysdate-5),trunc(sysdate-4)-(1/24/60/60));
insert into claim_history values (1,trunc(sysdate-4),trunc(sysdate-4),trunc(sysdate-3)-(1/24/60/60));
insert into claim_history values (1,trunc(sysdate-3),trunc(sysdate-3),trunc(sysdate-2)-(1/24/60/60));
insert into claim_history values (1,trunc(sysdate-2),trunc(sysdate-2),trunc(sysdate-1)-(1/24/60/60));
insert into claim_history values (1,trunc(sysdate-1),trunc(sysdate-1),trunc(sysdate-0)-(1/24/60/60));

commit
/

select * from claim_history
/

1) this is a historical claim table that keeps snapshots of claims over time (we only included the absolute necessary data elements for the discussion)
2) the granularity of our dates is a second
3) date ranges defined by the effective and expiration dates on our rows exhibit the collective characteristics of being contiguous and non-overlapping. This means the date ranges defined by these two date columns, follows a specific set of rules: a) that there are no missing points in time, b) each moment in time maps to exactly one row.
4) row_eff_dt and row_exp_dt are not the only dates in the table (see status_date) (hmm...)

Some Terminology

Best Fit Low: looking for the closest prior row
Best Fit High: looking for the closest next row
Inclusive: date matching may consider a row which exactly matches our target date
Exclusive: date matching may not consider a row which exactly matches our target date
Nearest Neighbor Lookup: In general any point in time query, but more specifically also queries that use combinations of high and low best fits.

Lets look at some examples for clarity:

BOB BARKER Lookup (most common best fit lookup (best fit low inclusive)).

The most common point in time lookup is by far the BEST FIT LOW INCLUSIVE lookup, meaning we want the row closest to the target date but not after. We affectionately refer to this as the Bob Barker lookup from Price is Right (whoever comes closest without going over wins...). Best fit low means we want a row prior to our target date, INCLUSIVE means we will accept a row that matches our target date. Nine times out of ten, this is the lookup you will do, and given our example above it can be written in three common ways, each of which is worth noting:

SQL> select *
  2  from claim_history
  3  where claim_id = 1
  4  and row_eff_dt <= sysdate-3 and row_exp_dt >= sysdate-3
  5  /

  CLAIM_ID STATUS_DATE          ROW_EFF_DT           ROW_EXP_DT
---------- -------------------- -------------------- --------------------
         1 06-apr-2007 00:00:00 06-apr-2007 00:00:00 06-apr-2007 23:59:59

1 row selected.

SQL> select *
  2  from claim_history
  3  where claim_id = 1
  4  and sysdate-3 between row_eff_dt and row_exp_dt
  5  /

  CLAIM_ID STATUS_DATE          ROW_EFF_DT           ROW_EXP_DT
---------- -------------------- -------------------- --------------------
         1 06-apr-2007 00:00:00 06-apr-2007 00:00:00 06-apr-2007 23:59:59

1 row selected.

SQL> select *
  2  from claim_history
  3  where claim_id = 1
  4  and row_eff_dt =
  5     (
  6      select max(row_eff_dt)
  7      from claim_history
  8      where claim_id = 1
  9      and row_eff_dt <= sysdate-3
 10     )
 11  /

  CLAIM_ID STATUS_DATE          ROW_EFF_DT           ROW_EXP_DT
---------- -------------------- -------------------- --------------------
         1 06-apr-2007 00:00:00 06-apr-2007 00:00:00 06-apr-2007 23:59:59

1 row selected.

Query#1 above is as straight forward as it gets. We are looking for the one row who's date range brackets the target date. Notice that we are using >= and <= (two inclusive tests) in this query. This is a direct reflection of the rules of our data. In some circles, the expiration date is made out to be the next row's effective date. This works too, it just means you have to alter the <= row_exp_dt to be < row_exp_dt. It also means you can not claim to have non-overlapping date ranges which opens the door for adhoc query tools to make mistakes by finding two rows for the same point in time, and also requires developers to know not to use INCLUSIVE against the expiration date column. Well, the argument that ensues from this we are going to skip.

Query#2 you will recognize as exactly the same as Query#1 except that it use the between operator to streamline the query syntax. Here is where arguments around date ranges related to query#1 (and hence this query too) get interesting. It is this query that many adhoc tools will generate by default. I suspect the reasoning being that this is what most end users will be looking for most of the time. This is reasonable, and it reaffirms the point above that you must clearly articulate the rules of your date ranges in order to reduce the possibility of error in formulating queries that match what your customers want.

Query#3 does not use the expiration date at all to lookup a row, but instead implements what might be called "old school" point in time. Why bother with it? Well, several reasons:

1) this is the query you will be writing most of the time. This is so because our example above is "best scenario". Most of the time, one of two things will be true:

a) you often will not have a table with date ranges explicitly recorded. Keeping such a date range requires extra updates to tables. The insert of a new row requires the update of an existing row. For those of us who data model we would recognize that any time you see changes to one row in a table forcing changes to other rows in the table, you have a violation of one of the normal forms (don't ask me which one, I forgot them all). So either because of data modeling practices in house, or a desire to skip extra updating for performance reasons, you will often not have an expiration date on your table, and thus no explicit date range to exploit in your query formulations. You are left with only this version of the lookup to write.

b) much of the time you won't be interested in questions relating to the effective/expiration date range of the row. Given the multitude of dates that likely exist in your data, you will usually have questions that relate to dates other than simple row effective and row expiration dates. These other dates usually have no effective/expiration set. And so again you will have to go "old school".

2) another reason this query form is valuable is that it is the generic form of all versions of BEST FIT lookups. We can write every best fit lookup by doing simple alterations of this query. Thus there is uniformity in using this version of the query. This can have side benefits like easier dynamic sql but we wont' worry about these side bennies. Instead we will simply note that it is easier to understand all best fit queries when we use this single common form.

All this said, what are the variations of lookup? They are based on the combinations of characteristics we have talked about along with the specific forms of Nearest Neighbor lookup which are combinations of High and Low lookup done together.

Best Fit HIGH vs. Best Fit LOW
Inclusive vs. Exclusive

best fit low inclusive (Bob Barker Method)
best fit low exclusive (true previous row)
best fit high inclusive
best fit high exclusive (true next row)

Nearest Neighbor Preference Low
Nearest Neighbor Preference High

Here are samples of the queries using status_date from our example:

BEST FIT LOW INCLUSIVE (Bob Barker Method) (Most Commonly Done PIT Lookup): find the row closest to the target date but not after the target date. If there is a row that exactly matches the target date, use that one.

select *
from claim_history
where claim_id = 1
and status_date =
   (
    select max(status_date)
    from claim_history
    where claim_id = 1
    and status_date <= sysdate-3
   )
/

BEST FIT LOW EXCLUSIVE: find the row closest to the target date but not after the target date. If there is a row that exactly matches the target date, DO NOT USE IT. Thus the goal is to find a truly previous row.

select *
from claim_history
where claim_id = 1
and status_date =
   (
    select max(status_date)
    from claim_history
    where claim_id = 1
    and status_date < sysdate-3
   )
/

BEST FIT HIGH INCLUSIVE: find the row closest to the target date but not before the target date. If there is a row that matches the target date, use that one.

select *
from claim_history
where claim_id = 1
and status_date =
   (
    select min(status_date)
    from claim_history
    where claim_id = 1
    and status_date >= sysdate-3
   )
/

BEST FIT HIGH EXCLUSIVE: find the row closest to the target date but not before the target date. If there is a row that matches the target date, DO NOT USE IT. Thus the goal is to find a truly next row.

select *
from claim_history
where claim_id = 1
and status_date =
   (
    select min(status_date)
    from claim_history
    where claim_id = 1
    and status_date > sysdate-3
   )
/

NEAREST NEIGHBOR PREFERENCE LOW: find the row closest to the target date but not after the target date. If there is a row that matches the target date, use that one. If there is no row that is before the target date then find the row that is closest to the target date but not before it and use that one. This lookup is thus the combination of BEST FIT LOW INCLUSIVE and BEST FIT HIGH EXCLUSIVE with preference given to LOW first.

select *
from claim_history
where claim_id = 1
and status_date =
   (
    select min(status_date)
    from (
          select max(status_date) status_date
          from claim_history
          where claim_id = 1
          and status_date <= sysdate-3
          union all
          select min(status_date)
          from claim_history
          where claim_id = 1
          and status_date > sysdate-3
         )
   )
/

NEAREST NEIGHBOR PREFERENCE HIGH: find the row closest to the target date but not before the target date. If there is a row that exactly matches the date, use that one. If there is no row that is after the target date the find the row that is before the target date and use that one. This lookup is thus a combination of BEST FIT HIGH INCLUSIVE and BEST FIT LOW EXCLUSIVE with preference given to HIGH first.

select *
from claim_history
where claim_id = 1
and status_date =
   (
    select max(status_date)
    from (
          select max(status_date) status_date
          from claim_history
          where claim_id = 1
          and status_date < sysdate-3
          union all
          select min(status_date)
          from claim_history
          where claim_id = 1
          and status_date >= sysdate-3
         )
   )
/

As you can see, these queries are all variations on a theme. Which on you use depends of course on what you want to get. Nearest Neighbor algorithms tend to be most useful when you are joining rows between tables that have dates which do not exactly line up.

The Nearest Neighbor family of queries actually contains several variations not all listed here. I'll let you do that actual query writing for these if you want them. I can think of six variations that make sense. Of particular interest are the variations of exclusive/exclusive where in a row with the target date is not wanted. You see this in situations where you have already retrieved a row from somewhere and are wanting a companion row and thus don't want to reselect the row itself.

nn low inclusive/exclusive (will take row with exact date if it exists)
nn low exclusive/inclusive (will take row with exact date if it exists but only if no previous row exists)
nn low exclusive/exclusive (will not take row with exact date)

nn high inclusive/exclusive (will take row with exact date if it exists but only if no next row exists)
nn high exclusive/inclusive (will take row with exact date if it exists)
nn high exclusive/exclusive (will not take row with exact date)

Also, we have somewhat skirted around the adjective "CLOSEST". We have not actually used closest in its real sense. We have assumed in all our PIT queries that there was some kind of demarcation or boundary line for time and hence we were specifically seeking a row that was either before or after our target date. But it is also true that we could want a Nearest Neighbor row which is actually the closest row regardless of whether it is before or after our target date. Here is a sample of such a query. There may be more efficient ways of writing this so if anybody has an idea, please post it in a reply.

CLOSEST ROW

select *
from claim_history a
where claim_id = 1
and status_date =
   (
    select status_date
    from (
          select max(b.status_date) status_date,min((sysdate-3)-b.status_date) time_difference
          from claim_history b
          where claim_id = 1
          and status_date <= sysdate-3
          union all
          select min(status_date),min(b.status_date-(sysdate-3))
          from claim_history b
          where claim_id = 1
          and status_date > sysdate-3
          order by 2
         )
    where rownum = 1
   )
/

So there it is. Best Fit and Nearest Neighbor in a nutshell. Hope Bob doesn't mind being imortalized in the IM world. I also hope someone finds this discussion useful.

Kevin

Comments

SELECT a.*
FROM   claim_history a
WHERE  claim_id = 1
AND    ABS (status_date - (SYSDATE - 3)) =
                             (SELECT MIN (ABS (status_date - (SYSDATE - 3)))
                              FROM   claim_history
                              WHERE  claim_id = 1)

Any comments?

The beauty of all of the earlier queries (except closest neighbor) is that they can use FIRST ROW INDEX RANGE SCANs. This incredibly powerful technique is discussed further here.

By adding min((sysdate-3)-b.status_date) to the sub-query, you are forcing a plain-old range-scan, which means you have to read EVERY row with claim_id=1

The same is unfortunately true of Maarten's example.

I haven't checked the syntax, but the following is the sort of thing that would work whilst retaining the FIRST ROW INDEX RANGE SCANs.

select *
from claim_history a
where claim_id = 1
and status_date =
   (
    select min(status_date) KEEP (DENSE_RANK FIRST 
                                  ORDER BY ABS((sysdate-3) - status_date)
                                 ) status_date
    from (
          select max(b.status_date) status_date
          from claim_history b
          where claim_id = 1
          and status_date <= sysdate-3
          union all
          select min(status_date)
          from claim_history b
          where claim_id = 1
          and status_date > sysdate-3
         )
   )
/

Those uncomfortable with the KEEP extension to the aggregate functions could simply nest another level deeper in the in-line view and ORDER BY / ROWNUM = 1 as Kevin did earlier.

select *
from claim_history a
where claim_id = 1
and status_date =
   (
    select status_date
    from (
        select status_date
        from (
              select max(b.status_date) status_date
              from claim_history b
              where claim_id = 1
              and status_date <= sysdate-3
              union all
              select min(status_date)
              from claim_history b
              where claim_id = 1
              and status_date > sysdate-3
             )
        order by ABS((sysdate-3) - status_date)
       )
    where rownum = 1
   )    
/

Kevin Meade's picture

I will check out some of this stuff. Also keep in mind, that when selecting CLOSEST row, it is possible that both the low row and high row are equi-distant from the target date. To account for this, any closest query construct must make certain to return only one of these two row. Arguably either is fine though there might be a preference.

Kevin