Period BladeLet


The Period Bladelet provides SQL support for managing information about fixed intervals in a timeline: real world phenomenon like the duration of a hotel stay, or the scheduled activity of some manufacturing equipment. The BladeLet implements a pair of OPAQUE TYPE User-Defined Type (UDT) instances, and a set of User-Defined Functions (UDFs) to perform the operations appropriate to these types. In addition to being a useful bundle of extensions in its own right, the Period Bladelet also provides a good example of several extensibility features:  building an OPAQUE TYPE, how to use ORDBMS the Operator Class feature with R-Trees, a User-Defined Aggregate example, and a set of statistics and selectivity facilities.

Contents


Introduction and Background.

The motivation for Period extensions to SQL  is best explained with an example. Consider a scheduling system. For dramatic effect our example discusses a railway, but the same logic applies to hotel rooms, rental cars, production scheduling, personal appointments, and any application where a shared resource is being allocated for fixed  periods of time. Temporal information is becoming an increasingly important part of many information systems. See Snodgrass, Richard T. Developing Time-Oriented Database Applications in SQL ISBN: 1-55860-436-7 for more details than there is space for here.

Figure 1, which is intended to convey a set of track usage schedules for railway trains, illustrates the basic concept of temporal intervals.

         Time:
Track  : T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17
Seg #:
  1      |--|  |--|  |-----|     |--|               |-------|

  2         |-----|     |-----|          |---|

  3            |--------------|     |-----------|       |---|

  4         |--|  |--|--|        |-----|        |-------|   |-------)

  5      |----------------------------------------------------------)

  6                                 |--|  

  7         |-----|     |-----|     |-------|       |---------------)


  8            |-----|  |--------|              |---|

  9     

 10      (-----|              |----------|   |-------|   |----------)

Figure 1: Diagrammatic Representation of Temporal Periods

Each number on the left identifies a track segment. Each of the '|---|' strings signifies that some train is scheduled onto a particular track segment for some period of time. Segment schedules all have a start date or datetime indicator, extend for some number of units of time, and then finish. Each of the single interval instances we call a period. More formally, a period is a fixed interval in the time line, which contrast periods with the SQL-92 INTERVAL type  corresponding to floating intervals in a time-line.

In certain special cases we also need to consider schedules that are "open". A period's start may be unknown, or the situation may have 'always been thus', or the time at which a period's finishes may be undetermined. Railway schedules are not open-ended in this way but in applications such as command and control systems managing "open" Periods is a necessity..

A situation like the one illustrated in Figure 1 might be accommodated in a SQL-92 RDBMS with a table that looks like  Figure 2:

CREATE TABLE Track_Schedule  (
        Train    INTEGER                 NOT NULL,
        Track    INTEGER                 NOT NULL,
        Starts   DATETIME YEAR TO SECOND NOT NULL,
        Ends     DATETIME YEAR TO SECOND NOT NULL,
        CHECK ( Starts < Ends ) CONSTRAINT Starts_must_preceed_Ends,
            FOREIGN KEY ( Train ) REFERENCES Trains ( Id ) CONSTRAINT Train_FK,
            FOREIGN KEY ( Track ) REFERENCES Tracks ( Id ) CONSTRAINT Track_FK
);

Figure 2: SQL-92 Handling of Temporal Periods

Figure 2's Track_Schedule table is a reasonable approach to the problem, but it has several deficiencies. First, querying this table's information in an application requires writing some fairly complex SQL-92 expressions.  Second, within the limits of most RDBMS products it is impossible to index queries that perform temporal operations like overlap, within, and contains.  And third, some behaviors of range data structures like Period cannot be reasonably accommodated within SQL-92. All of these problems are addressed by the Period Bladelet's functionality.

To illustrate each problem in more detail, let us examine a couple of fairly obvious business questions and see how they would be handled using the Track_Schedule table/SQL-92 approach. In each of the examples that follow we first list the question in plain language, and then present the corresponding query.

Clearly, it is rather desirable for our railway company to prevent two trains from being scheduled on the same track segment for overlapping periods. (Obviously the situation is more complex than this, but bear with us here.) Answering this question using just the standard SQL-92 predicates is more awkward than it needs to be.

"How are there any situations where two different trains are scheduled on the same track over the next week?"
SELECT T1.Track, T1.Train, T2.Train, T1.Starts, T1.When,
       T2.Start, T2.Ends
  FROM Track_Schedule T1, Track_Schedule T2
 WHERE T1.Finish >= CURRENT
   AND T1.Start  <= CURRENT + 7 UNIT DAY
   AND T2.Start  <= T1.Finish
   AND T2.Finish >= T1.Start
   AND T1.Track  = T2.Track
   AND T1.Start  <> T2.Start
   AND T1.End    <> T2.End
   AND T1.Train  <> T2.Train;

Figure 3: SQL-92 Query Performing Temporal Overlaps Predicate

While this is not a particularly complex query expressions, it takes a rather close reading of the query's logical details to figure out that its WHERE clause is computing an Overlap(). Other temporal expressions, such as Contains() and Within() are similar in form, and this similarity makes it difficult to spot errors in temporal expressions, particularly because applications that care about overlapping ranges employ these query language patterns often. And what about more complex kinds of query; ones that join tables based on overlapping intervals such as how many stays in a hotel overlap a conference or a special price? Simplifying and clarifying declarative SQL would reduce many of these problems.

In Figure 4, we present the way this same question would be answered using the facilities of the Period BladeLet. (Look ahead to Figure 6 for the re-defined Track_Schedule table used in this query). Even though it accomplishes more than the expression in Figure 3,  the SQL in Figure 4 is clearly simpler, and it is fairly obvious what it is that the query is attempting to do.

SELECT T1.Track, T1.Train, T2.Train, GetIntersect ( T1.When, T2.When)
   FROM Track_Schedule T1, Track_Schedule T2
 WHERE Overlap ( T1.When, DT_Period( CURRENT, CURRENT + 7 UNITS DAY) )
   AND Overlap ( T1.When, T2.When )
   AND T1.When  <> T2.When
   AND T1.Track  = T2.Track
   AND T1.Train  <> T2.Train;
 

Figure 4: Query Performing Temporal Overlaps Predicate with Period BladeLet

The second problem with the query in Figure 3 relates to performance. Developers familiar with SQL can tell by looking at the WHERE clause in Figure 3 that an  RDBMS will probably be unable to use an index to accelerate this query. The best that can be hoped for is that the DBA builds a compound B-Tree < Starts, Ends > or < Ends, Starts > index. But because of the different legs of the query uses a different operator (<= or >=) with different columns,  the best plan that a SQL-92 RDBMS can come up with is to scan of both indices and a hash join to identify matching rows. This is generally so expensive that the query planner is likely to punt on the question and simply scan the entire table.

By contrast, both Overlap() predicates in Figure 4, and a range of other temporal operations, are amenable to indexing. An R-Tree on the Track_Schedule.When column can be used both to limit the segments checked to only those over the next week, and also on the inner of a join between the tables. In contrast with some other approaches, there is no absolute requirement to use a resolution table. Sometimes, a HASH join is preferred to a NEST LOOP. Sometimes, it is not. Part of the Period BladeLet's functionality is a statistics and selectivity estimator to help make this choice.

Finally, some intuitively simple temporal business questions are enormously difficult to express in SQL-92. Sometimes, for example, it is desirable to determine whether or not there is some fixed interval of time that all of the Periods in some set overlap, and if they do, the extent of this overlap is. That is, is there a period of time during which all of a set of events are occuring simultaneously? Performing this operation in SQL-92 requires finding the maximum starting time and minimum finish time for which there exists no non-overlapping period in the set. Figure 5 illustrates how this operation is handled using the Period BladeLet.

"Between 10:30am and 10:35am on 10th April, 2001 there will be a scheduled reduction in power to electrical rail in Montana.  This is not a problem unless all tracks are in use at the same time. Is this the case?"

SELECT Min_Overlap( TS.When )
  FROM Track_Schedule TS, Track_Segment TR
 WHERE TS.Track = TR.Id
   AND TR.IsElectric
   AND Overlaps ( DT_Period("2001-04-10 10:30:00",
                            "2001-04-10 10:35:00" ),
                  TS.When )
   AND Within ( TR.Where, :Montana );

Figure 5: User-Defined Aggregate Query

Min_Overlap() is an example of a User-Defined Aggregate (UDA)  implemented as part of the Period BladeLet. In much the same way that MIN() takes a set of numbers or strings and returns the smallest value in the list, Min_Overlap() accepts a list of Period instances and returns a Period value -- which might or might not be one of the values ipassed in -- that is the minimum overlapping period for all of them. Actually, this aggregate is not that useful on its own. It is rather more useful when used as a building block for other, more complex business operations.

The Period BladeLet ships with complete source code so it can be modified to accommodate the specifics of other problems. In this release the BladeLet only supports Periods with DATE and DATETIME YEAR TO SECOND delimiters. Developers needing FRACTIONs or other, more exotic time units will need to change the code. The Period BladeLet deviates from some more common practices among datablade products in that it does not install using Blade Manager. Instead, you should register and unregister it using the SQL scripts that come with.

This page provides an overview of the BladeLet's functionality, and a guide to the finer points of its implementation. Currently, this work is ported as far as NT, Solaris and RedHat Linux without problems. Experience has shown that other platforms are rather easy to do, but APITB.

Contents

Design Overview.

To overcome both the query complexity and performance problems, the Period BladeLet provides a pair of new OPAQUE TYPE extensions: Period, and DT_Period. They correspond to T-SQL Period ( DATE ) and Period ( DATETIME ) objects, respectively. Both of these types come with a set of UDFs to handle common temporal operations, and to interact the built-in DATE, DATETIME and INTERVAL types. Both of them can be indexed using the R-Tree for optimal performance.
In Figure 6, we re-write the table introduced in Figure 2 to illustrate one of the new types. The idea is that instead of storing the start and finish values in separate columns,  DT_Period combines them into a single data object, stored in a single column. The CHECK() integrity constraint in Figure 2 is moved within the user-defined functions that implement the type's behaviors. Encapsulating this rule within the type's behavior simplifies its use in SPL variables and query expressions. Both the DT_Period and Period data types enforce this rule.
CREATE TABLE Track_Schedule  (
        Train    INTEGER      NOT NULL,
        Track    INTEGER      NOT NULL,
        When     DT_Period    NOT NULL,
            FOREIGN KEY ( Train ) REFERENCES Trains ( Id ) CONSTRAINT Train_FK,
            FOREIGN KEY ( Track ) REFERENCES Tracks ( Id ) CONSTRAINT Track_FK
);

Figure 6: Track Schedule Table Re-defined using Period BladeLet Types

The public format of these types is quite similar, and both Period and DT_Period handle the "start is epoch" and "finish is unbound" cases in similar fashions. Rather than relying on programmers to get the syntax right each time--which is not always easy in the case of DATETIME and INTERVAL strings--the BladeLet includes a set of "constructor" UDFs to simplify data management. Figure 7 lists several examples of both of these types.
 
"1999-10-10 12:10:10" to "1999-12-20 22:20:20" "2000-02-09 08:30:30" to "2000-03-20 08:30:30"
"2000-03-20 08:30:30" to "2000-08-07 18:40:40" "2000-08-07 18:40:40" to "2000-10-07 04:50:50"
"EPOCH" to "1999-10-20 22:20:20" "1999-12-20 22:20:20" to "FOREVER" 
"EPOCH" to "2001-03-07 01:11:10" "EPOCH" to "FOREVER"
"10/10/1999" to "12/20/1999" "2/9/2000" to "3/20/2000"
"3/20/2000" to "8/7/2000" "8/7/2000" to "10/7/2000"
"EPOCH" to "10/20/1999" "12/20/1999" to "FOREVER"
"EPOCH" to "3/7/2001" "EPOCH" to "FOREVER"
Figure 7: Examples of DT_Period and Period Data Type Instances
So far as the rest of the DBMS is concerned, the Period and DT_Period types are treated like any of the built-in types. They both fit entirely within a data row, the Period being 8 bytes long, and the DT_Period being 48 bytes long. Included in the regression tests for this Bladelet are back-up and recovery tests. Although I have not tested it, it should also be possible to use replication with these types.

A large set of temporal operations are implemented as part of the BladeLet. These deal with the intuitively obvious set of comparisons between two Period instances; Overlap, Contains, Equal, Within, Before, After, etc. Consult the section on User-Defined Functions for a detailed list. Where ever it was possible to do so, the operations have been combined using an R-Tree operator class (opclass) to allow developers to exploit the R-Tree indexing technology. In practice this is incredibly useful: without indexing the regression test suite takes almost two hours to run on an NT laptop, but only one minute with indexing.

So far as the indexing operations are concerned, EPOCH and FOREVER are treated as the smallest possible and largest possible date or datetime values. Internally, the byte level values used to store them are not valid SQL-92 internal formats. Exercise caution when writing code to modify these values.

Contents

 

User-Defined Types and User-Defined Functions in the BladeLet.

The BladeLet ships with two OPAQUE TYPE instances, and a large set of User-Defined Functions implementing behaviors for them. In addition to the DATE and DATETIME YEAR TO SECOND boundaries, the BladeLet implements two "special" values: EPOCH and FOREVER. The idea is to support periods where the open or the close is not known at the time the object is created. These words can appear in the public string, and there are user-defined functions that can get and set these values.

Tables involving these types can be both backed up using standard techniques, and because these range values can reside entirely within a single data page, they can be replicated.

List of User-Defined Types.

DT_Period is a fixed interval bound by a DATETIME Start and Finish. Internally, Start and Finish are stored using real DATETIME data structures, and to implement EPOCH and FOREVER these SAPI data structures are extended with additional macro values. All of the User-Defined Functions described in the next section work for DT_Period, although the details of INTERVAL and size vary from the Period.

List of User-Defined Functions.

This list of UDFs implemented as part of the BladeLet is broken into several sections, each containing a particular category of UDF. Different categories of UDF perform different functions: some of them are SQL-level expressions, some of them are never seen by SQL developers. Each UDF description includes the UDF's signature (its name, and the vector of its arguments), its return type, and a brief description of what it does. Note that this list can be extracted from the database's system catalogs using the query presented in Figure 8.

SELECT UPPER ( S.procname ||
              ' ( ' || S.paramtypes::LVARCHAR || ' ) -> ' ||
              ifx_ret_types ( S.procid )::LVARCHAR
             )
  FROM sysprocedures S
 WHERE S.paramtypes::LVARCHAR LIKE '%dt_period%'
    OR ifx_ret_types(S.procid)::LVARCHAR LIKE '%dt_period%';

Figure 8:  Query to Extract Information About DT_Period User-Defined Functions from Catalogs
Constructor Functions.

Constructor UDFs are the input (and output) logic used to create an instance of the new data type. In addition to functions converting the public string format into the internal data structures, other constructor functions convert a vector of data values into an instance of the new data type.



 

Code Walk through and Implementation Overview

The 'C' code implementing this BladeLet is to be found in ./src/Period.c, ./src/Period.h, and ./src/DT_Period.h. Several support functions are found in ./src/support.c. To build the shared object, the wad includes a Microsoft Developer Studio project file; ./Period.dsw, and a UNIX makefile in ./UNIX.mak.

A secondary objective of this BladeLet was to provide a generalizable framework for doing a variety of range object; ranges of INTEGER, FLOAT, VARCHAR in addition to DATE and DATETIME. Key to achieving this is the design of the type's comparison sub-system. It turns out that, given the rule that within any range object, finish >= start, comparing the start and the finish of two ranges yields only 18 logically possible outcomes, and that these can be determined entirely using only a Compare() function for the types at either end of the range. In Figure 18 below, we present the list of these outcomes from ./src/Period.h.

#define DT_CMP_ERROR 0  /* Invalid - throw exception    */
#define EQ_EQ_EQ_EQ 1   /* ( 1 -> 1 )   ( 1 -> 1 )  EQ  */
#define EQ_LT_EQ_LT 2   /* ( 1 -> 1 )   ( 1 -> 2 )  LT  */
#define LT_LT_LT_LT 3   /* ( 1 -> 1 )   ( 2 -> 2 )  LT  */
#define EQ_EQ_GT_GT 4   /* ( 1 -> 2 )   ( 1 -> 1 )  GT  */
#define EQ_LT_GT_EQ 5   /* ( 1 -> 2 )   ( 1 -> 2 )  EQ  */
#define EQ_LT_GT_LT 6   /* ( 1 -> 2 )   ( 1 -> 3 )  LT  */
#define LT_LT_EQ_EQ 7   /* ( 1 -> 2 )   ( 2 -> 2 )  LT  */
#define LT_LT_EQ_LT 8   /* ( 1 -> 2 )   ( 2 -> 3 )  LT  */
#define EQ_LT_GT_GT 9   /* ( 1 -> 3 )   ( 1 -> 2 )  GT  */
#define LT_LT_GT_GT 10  /* ( 1 -> 3 )   ( 2 -> 2 )  LT  */
#define LT_LT_GT_EQ 11  /* ( 1 -> 3 )   ( 2 -> 3 )  LT  */
#define LT_LT_GT_LT 12  /* ( 1 -> 3 )   ( 2 -> 4 )  LT  */
#define GT_GT_GT_GT 13  /* ( 2 -> 2 )   ( 1 -> 1 )  GT  */
#define GT_EQ_GT_EQ 14  /* ( 2 -> 2 )   ( 1 -> 2 )  GT  */
#define GT_LT_GT_LT 15  /* ( 2 -> 2 )   ( 1 -> 3 )  GT  */
#define GT_EQ_GT_GT 16  /* ( 2 -> 3 )   ( 1 -> 2 )  GT  */
#define GT_LT_GT_EQ 17  /* ( 2 -> 3 )   ( 1 -> 3 )  GT  */
#define GT_LT_GT_GT 18  /* ( 2 -> 4 )   ( 1 -> 3 )  GT  */

Figure 18:  Set of Logically Possible Comparisons Between Two Legal Range Objects
In Figure 18, the composition of the macro string ("GT_LT_GT_LT", for example) reflects the relationships between the four elements of the two range values; First.Start and Second.Start, First.Start and Second.Finish, First.Finish and Second.Start, and finally First.Finish and Second.Finish. Each relationship can have one of three possible values; LessThan, Equal, or GreaterThan. To clarify the relationship between the ranges, each macro is accompanied by a pair of examples where the start and finish values are taken from the range of INTEGERS, 1 through 4.

Thus, for example, in case number 9, First.Start equals Second.Start, First.Start is less than Second.Finish, First.Finish is greater than Second.Start, and First.Finish is greater than Second.Finish. This situation arises when the pair of ranges involved are of the form ( 1 -> 3 ) and ( 1 -> 2 ).

Different logical relationships between pairs of ranges--Overlap(), Within(), Equal() and so on--can be determined as a mapping from these basic possibilities. Further, sorting operations can be supported by implementing a Compare() for ranges although such a concept has little semantic value. The macro strings in Figure 18 are returned the CompareString( DT_Period, DT_Period ) User-defined function.

Statistics and Selectivity

The Period BladeLet includes statistics gathering and selectivity estimation functionality. Given the nature of the objects being indexed, standard approaches--histograms--are a bit awkward. It is not clear, for example, how one would figure out the average size of an Period in a column given histograms of the start and finish. And using histogram techniques to determine the selectivity of operations like Overlap() and Within() is problematic because the start and finish values are not independent: a necessary assumption before standard probability can be used.
 
 
 
 


Glossary

Terms and acronyms used by this tech note include:
 
Blade or BladeLet Set of semantically related extensions -- types and functions -- to the ORDBMS.
APITB A Pain in the Behind.
COLLECTION Non-first normal form object. That is, a set of data values that can be considered as a single data value for some purposes (variables). COLLECTIONS can also be thought of as small, in-memory, temporary tables for the purpose of querying.
DBA DataBase Administrator.
Iterator An iterator is a special kind of UDF that returns more than one result. Implementing Iterators raises conceptual and engineering difficulties. This Bladelet contains an example of a quite complex Iterator.
User-defined Function (UDF) Module of procedural logic that extends SQL. This Bladelet included UDFs implemented in 'C' and SPL. Through out this document I use the term Routine synonymously with UDF.
SAPI Server Application Programming Interface. This is the set of data structures and 'C' functions used to implement ORDBMS extensions.
SPL Stored Procedure Language. Simple procedural language that can be used to implement UDFs. SPL is much simpler than 'C', but it is not parallelizable, nor as run-time efficient for complex operations. 
Statement Local Variable Variable value returned by reference from a UDF and associated with a name in the query expression. When referring to this as the mechanism for returning more than one value from a UDF it is more common to refer to it as an OUT parameter. 



References

 
Contents

Last updated 31-March-2000.