Implementing and Using Fishhook Relation in Transact SQL

Zoran Horvat
13y
7.5k
0
0

Article

Introduction

One of the common requirements in programming is to implement hierarchy of entities. Anyone can list several examples from own experience: hierarchy of directories, hierarchy of company branches and departments, hierarchy of people employed in a company, etc.

Simplest and widely used method to represent such hierarchies is to assign every instance a reference to the parent instance. When this idea comes to the ground of database programming, it becomes a well-known fishhook relation, i.e. a foreign key which references the same table in which it is declared. For example, in the Employee table, with primary key EmployeeID, foreign key SuperiorEmployeeID would reference EmployeeID column in order to specify the superior person to any given employee. Value NULL would then indicate a rare and joyful case of an employee without anybody positioned above. The name fishhook derives from visual representation of such foreign keys, as shown on the following picture.

foreign keys in T-SQL

This simple design allows application to represent hierarchy of arbitrary size and depth, which is a significant power in many practical applications. Add to this the fact that application can query child entities very simply, by selecting them by parent identity, and it becomes obvious why this solution is so widely used. But now, we come to the point at which we should ask what are particular operations performed on this structure and how these operations can be implemented.

We will first identify requirements that one system operating on hierarchies should meet. Then complete solution on a simple example will be provided.

Requirements

Broadly speaking, application which operates on a hierarchy of entities should be able to construct, maintain and report the hierarchy. If hierarchy is viewed as a tree structure, as it is when fishhook relation is used, then required operations can be precisely defined like this:

Insertion - Adding new item to the tree, specifying parent item under which new one will be inserted. Optionally, child nodes in the tree might be ordered and this operation might also require a position within the child list at which new node should be inserted. In that case, positions of siblings might need to be updated.
Accessing one item - This operation may be defined in many ways. For example, by searching the tree from root towards the required item, and similar. It is up to particular application to define this operation precisely. For example, in directory tree case, a directory full path can be broken down into segments, each segment representing name of one directory in the hierarchy. Then application would traverse the tree from the root towards the child, at each level selecting a node based on corresponding directory name in the path.
Updating one item - Once item is located, it can be changed. This might as well have effect on the tree structure. Application must be aware of pitfalls like attempting to move an item under another item which is actually its own descendant. Such operation would create a loop, which is off course not allowed in the tree structure.
Deleting a node with all its descendants - This is complex operation which requires application to identify all descendants of one node before removing them from the tree. It cannot be implemented by using ON DELETE CASCADE option in database simply because cascade deletion is not allowed on fishhook relation. Consequently, application must traverse the subtree rooted at the target node, deleting all nodes visited in the bottom-up manner, i.e. in such way that any node visited is deleted only after all its descendants have already been deleted.
Reporting the subtree - This is another complex operation which requires application to traverse the subtree in a depth-first order, printing out all nodes visited, so that the output takes a familiar look like directory tree in file system browsers.

In the following sections we will propose a solution which offers all these operations on one simple example.

Proposed Solution

First issue which must be discussed here is this: What is the natural place at which functionality regarding the table with fishhook relation should be implemented? Generally, there are two candidates: one is middle tier and another one is data tier.

In first case database would only provide a crude interface to allow the application to operate on raw data. It is application's obligation in that case to guarantee data integrity, e.g. that there are no loops in the tree. In the second case, when all logic is located in the database, application has no concerns about data integrity. It is all maintained by the database itself and database would provide a high level interface, exposing only complex and well defined operations on the tree.

In this article we will stick to the second solution and here is why. Working with hierarchies is quite complex as can be seen from the list of required operations. Leaving all these complex operations to the application may be risky because data might become corrupt or inconsistent, without possibility for the database to control and prevent the damage to the data. On the other hand, if one tries to allow the database to control the storage in such way that integrity is preserved, logic would be spread both in the application and in the database, which is probably the worst possible design. Furthermore, if multiple applications would desire to access the same database, they would all have to implement this same logic. This is likely to become a serious task, especially if different languages are used to program different components.

All these reasons advocate towards placing the logic into the database. And once that is done, there come the benefits. For example, we can afford some level of modifications of database schema without modifying the applications. Further on, some operations might be implemented using different shortcuts and optimizations that might not be accessible to the application. For example, deleting complete tree could be very difficult from application because it would have to traverse the tree and delete nodes one by one. On the contrary, if this operation is implemented in the database, it could benefit from the knowledge of internal structure of the affected tables and use it for the better. Deletion would then be implemented in two simple steps: in first step parent reference would be set in all nodes to NULL, virtually dispersing the tree into a cluster of unconnected nodes; only then, second step comes and that is to delete all rows in the table. Since parent references have been removed (replaced with NULL), no error would occur in this bulk delete operation and that would be much faster than deleting the nodes one at a time.

Example Solution

Suppose that we are dealing with directory system, each directory being part of the parent directory unless it is the root of the file system. Here is the definition of the table with fishhook relation which can contain the directory structure:

CREATE TABLE Directory
(
      DirectoryID INT PRIMARY KEY IDENTITY(1, 1),
      ParentDirectoryID INT FOREIGN KEY REFERENCES Directory(DirectoryID) ON DELETE NO ACTION,
      Name NVARCHAR(1000) NOT NULL,
      Ordinal INT NOT NULL,                     -- One-based ordinal number of the directory
      Depth INT NOT NULL                              -- Zero-based depth of the directory
);

This table contains name of every directory and reference to its parent. Full path is constructed by iterating through parents up to the file system root, concatenating all directories visited (in the reverse order). Note that every directory has its own position within parent directory. This could resemble order of insertions, lexicographical order, or any other order desired by the application. Depth field represents depth of the directory within directory structure, with value zero indicating root directory.

Basic operation would be to insert new directory into this table. We can create a procedure which receives name of the new directory, reference to parent directory and optional ordinal number. If ordinal number is missing (i.e. NULL), new directory would be added at the end of the list of subdirectories of its parent. To cut the long story short, here is the procedure listing:

CREATE PROCEDURE CreateDirectory
(
      @Name NVARCHAR(1000),
      @ParentDirectoryID INT,
      @Ordinal INT                  -- if NULL, directory will be inserted as last within parent
)
AS
BEGIN
      DECLARE @MaxSiblingOrdinal INT
      DECLARE @Depth INT = 0

      SELECT @MaxSiblingOrdinal = MAX(Ordinal) FROM Directory
      WHERE (ParentDirectoryID = @ParentDirectoryID OR (ParentDirectoryID IS NULL AND @ParentDirectoryID IS NULL))

      IF @ParentDirectoryID IS NOT NULL SELECT @Depth = Depth + 1 FROM Directory WHERE DirectoryID = @ParentDirectoryID
      IF @MaxSiblingOrdinal IS NULL SET @MaxSiblingOrdinal = 0    -- New directory is the first child of its parent
      IF @Ordinal IS NULL OR @Ordinal > @MaxSiblingOrdinal + 1 SET @Ordinal = @MaxSiblingOrdinal + 1 -- @Ordinal was too large; must be set to max+1

      -- Now suppose that @Ordinal comes into the middle - ordinal positions of other siblings must be increased by one

      UPDATE Directory SET Ordinal = Ordinal + 1
      WHERE (ParentDirectoryID = @ParentDirectoryID OR (ParentDirectoryID IS NULL AND @ParentDirectoryID IS NULL)) AND Ordinal >= @Ordinal
      INSERT INTO Directory(Name, ParentDirectoryID, Ordinal, Depth) VALUES (@Name, @ParentDirectoryID, @Ordinal, @Depth)
      RETURN @@IDENTITY
END
GO

Procedure first tries to determine actual ordinal position of new directory. If input parameter is null or larger than existing maximum ordinal value, then it is set so that new directory gets ordinal value by one larger than any other sibling, effectively being pushed to the end of the list of its siblings.

Once ordinal position is determined, other directories within the same parent might require update on their own ordinal positions so that new directory perfectly fits within the uninterrupted sequence of ordinal positions. Finally, new directory is created by inserting a row into the table.

This procedure is a powerful tool to insert new directories into the tree, but it lacks simplicity – its interface is based on knowing the identity of the parent directory. Wouldn't it be better to use backslash-delimited paths instead of directory identities? We can try that approach by defining a procedure which only receives a path to the directory which should be created:

CREATE PROCEDURE CreatePath
(
      @Path NVARCHAR(1000)
)
AS
BEGIN

      DECLARE @ParentDirectoryID INT = NULL
      DECLARE @Name NVARCHAR(1000)
      DECLARE @DelimiterIndex INT
      DECLARE @Ordinal INT
      DECLARE @DirectoryID INT

      WHILE LEN(@Path) > 0
            BEGIN

                  SET @DelimiterIndex = CHARINDEX('\', @Path)
                  IF @DelimiterIndex > 0
                        BEGIN
                              SET @Name = SUBSTRING(@Path, 1, @DelimiterIndex - 1)
                              SET @Path = SUBSTRING(@Path, @DelimiterIndex + 1, LEN(@Path) - @DelimiterIndex)
                        END
                  ELSE
                        BEGIN
                              SET @Name = @Path
                              SET @Path = ''
                        END

                  -- Now check whether the directory already exists
                  SET @DirectoryID = NULL
                  SELECT @DirectoryID = DirectoryID FROM Directory
                  WHERE (ParentDirectoryID = @ParentDirectoryID OR (ParentDirectoryID IS NULL AND @ParentDirectoryID IS NULL)) AND Name = @Name

                  IF @DirectoryID IS NULL
                        BEGIN -- Directory does not exist and must be created
                              -- Now calculate ordinal position so that new directory takes position of the first directory with lexicographically larger name
                              SET @Ordinal = NULL

                              SELECT @Ordinal = MIN(Ordinal) FROM Directory
                              WHERE (ParentDirectoryID = @ParentDirectoryID OR (ParentDirectoryID IS NULL AND @ParentDirectoryID IS NULL)) AND Name >= @Name
                              EXEC @ParentDirectoryID = CreateDirectory @Name, @ParentDirectoryID, @Ordinal
                        END
                  ELSE
                        SET @ParentDirectoryID = @DirectoryID
            END
END
GO

This procedure breaks down the path into segments, each segment representing one directory, and then creates all directories that are missing, including the last one which corresponds with the full path. Approach demonstrated by this procedure is favorable in practice because application is completely relieved from thinking about the directory structure – it rather operates on directory paths and lets the database keep the record. Here is demonstration how simple it is to use such procedure:

EXEC CreatePath 'C:\Temp'
EXEC CreatePath 'C:\Windows\System'
EXEC CreatePath 'C:\Pictures'
EXEC CreatePath 'C:\Program Files\Windows Media Player'
EXEC CreatePath 'C:\Windows\System32\WindowsPowerShell'
EXEC CreatePath 'C:\Windows\System32\wbem'
EXEC CreatePath 'C:\Windows\System32\spool'

This set of calls will actually create 11 directories, as can be seen if content of the Directory table is listed:

Directory table in T-SQL

All directories along the paths have been created, all ordinal positions and depths maintained as new directories were added to the tree. Application was only supposed to perform trivial calls just passing the desired full directory path.

We can go even further and define a utility procedure which converts full directory path into corresponding directory identity (without creating the directory):

CREATE PROCEDURE GetDirectoryID
(
      @Path NVARCHAR(1000)
)
AS
BEGIN
      DECLARE @DirectoryID INT = NULL
      DECLARE @ChildDirectoryID INT
      DECLARE @Name NVARCHAR(1000)
      DECLARE @DelimiterIndex INT

      WHILE LEN(@Path) > 0
            BEGIN
                   SET @DelimiterIndex = CHARINDEX('\', @Path)
                   IF @DelimiterIndex > 0
                        BEGIN
                              SET @Name = SUBSTRING(@Path, 1, @DelimiterIndex - 1)
                              SET @Path = SUBSTRING(@Path, @DelimiterIndex + 1, LEN(@Path) - @DelimiterIndex)
                        END
                  ELSE
                        BEGIN
                              SET @Name = @Path
                              SET @Path = ''
                        END

                  -- Now check whether the directory already exists
                  SET @ChildDirectoryID = NULL

                  SELECT @ChildDirectoryID = DirectoryID FROM Directory
                  WHERE (ParentDirectoryID = @DirectoryID OR (ParentDirectoryID IS NULL AND @DirectoryID IS NULL)) AND Name = @Name

                  IF @ChildDirectoryID IS NULL
                        BEGIN
                              SET @Path = ''    -- This will cause loop to exit and procedure to fail by returning negative value
                              SET @DirectoryID = NULL
                        END
                  ELSE
                        SET @DirectoryID = @ChildDirectoryID
            END
      IF @DirectoryID IS NULL SET @DirectoryID = -1
      RETURN @DirectoryID
END
GO

Procedure GetDirectoryID operates similarly to CreatePath procedure, by splitting the path into segments and walking along the hierarchy of directories until the last child is found which corresponds with the full path. GetDirectoryID procedure may fail, if required directory does not exist, and then it would simply return negative value instead of regular directory identity. This procedure is then used as a utility to convert paths into identities, so that application may never need to remember directory identity as long as it remembers paths, which are meaningful to the user and more readable.

Now comes the real piece of work, and that is the procedure which deletes the subtree.

CREATE PROCEDURE DeleteDirectoryRecursive
(
      @Path NVARCHAR(1000)
)
AS
BEGIN
      DECLARE @stack TABLE(Position INT, DirectoryID INT)   -- Table emulating stack used to traverse child documents depth-first
      DECLARE @stackPos INT = 0                                         -- Number of elements on stack; zero indicates that stack is empty
      DECLARE @curDirectoryID INT                                       -- Identity of the directory which is currently on top of the stack
      DECLARE @curChildDirectoryID INT                            -- Identity of the child document which is currently being pushed to the stack
      DECLARE @parentDirectoryID INT                                    -- Identity of the parent document of @DocumentID
      DECLARE @ordinal INT                                              -- Ordinal number of the deleted directory, used to update ordinal numbers of siblings after directory is deleted
      DECLARE @directoryID INT

      EXEC @directoryID = GetDirectoryID @Path
      IF @directoryID > 0
      BEGIN

            SET @stackPos = @stackPos + 1
            INSERT INTO @stack(Position, DirectoryID) VALUES (@stackPos, @directoryID)
            SELECT @parentDirectoryID = ParentDirectoryID, @ordinal = Ordinal FROM Directory WHERE DirectoryID = @directoryID
      END
      WHILE (@stackPos > 0)
            BEGIN -- Repeat while stack is not empty
                  SELECT @curDirectoryID = DirectoryID FROM @stack WHERE Position = @stackPos
                  IF EXISTS (SELECT * FROM Directory WHERE ParentDirectoryID = @curDirectoryID)
                        BEGIN
                              DECLARE ChildrenCursor CURSOR FOR
                              SELECT DirectoryID FROM Directory WHERE ParentDirectoryID = @curDirectoryID
                              OPEN ChildrenCursor
                              FETCH NEXT FROM ChildrenCursor INTO @curChildDirectoryID

                              WHILE @@FETCH_STATUS = 0
                                    BEGIN -- Iterate through all child directories and push them to stack

                                       -- Now push new descendant DirectoryID to the stack
                                       SET @stackPos = @stackPos + 1
                                       INSERT INTO @stack(Position, DirectoryID) VALUES (@stackPos, @curChildDirectoryID)
                                       FETCH NEXT FROM ChildrenCursor INTO @curChildDirectoryID
                                    END
                              CLOSE ChildrenCursor
                              DEALLOCATE ChildrenCursor
                        END
                  ELSE
                        BEGIN
                               -- Directory on top of the stack does not contain children, so it can be deleted
                              DELETE FROM Directory WHERE DirectoryID = @curDirectoryID

                              DELETE FROM @stack WHERE Position = @stackPos
                              SET @stackPos = @stackPos - 1
                        END
             END
      IF @directoryID IS NOT NULL
            BEGIN -- This indicates that directory has been deleted, and now ordinal positions of its siblings must be updated

                  UPDATE Directory SET Ordinal = Ordinal - 1
                  WHERE (ParentDirectoryID= @parentDirectoryID OR (ParentDirectoryID IS NULL AND @parentDirectoryID IS NULL)) AND Ordinal > @ordinal
            END
END
GO

Please pay attention to this procedure, as it demonstrates the core functionality of the database when working with fishhook relation. We have created a table variable which serves as the stack. Then tree nodes are iterated depth-first using this stack. Whenever a directory is reached which contains children, its children are pushed to the stack. Only when directory on top of the stack has no children can it be popped and deleted from the table. This order of steps guarantees that only rows without referencing descendants will be deleted from the Directory table, which is the required condition for deletion to be successful – otherwise we would only get the error from the database.

With similar ideas behind the code we can create one interesting function:

CREATE FUNCTION GetDirectoryGlobalOrdinalRecursive
(
      @RootDirectoryID INT
)
RETURNS @retval Table(DirectoryID INT, GlobalOrdinal INT)
AS
BEGIN
      DECLARE @stack TABLE(Position INT, DirectoryID INT)   -- Stack used to iterate the documents subtree depth first
      DECLARE @stackPos INT = 0                                         -- Total number of items on stack; NULL indicates that stack is empty
      DECLARE @curDirectoryID INT                                       -- Identity of the directory which is currently being pushed to the output
      DECLARE @curGlboalOrdinal INT = 0                           -- Value of GlobalOrdinal which was assigned to last directory visited
      DECLARE @curChildDirectoryID INT                            -- Identity of the child directory which is currently being visited

      IF @RootDirectoryID IS NULL OR EXISTS (SELECT * FROM Directory WHERE DirectoryID = @RootDirectoryID)
            BEGIN

                  SET @stackPos = @stackPos + 1
                  INSERT INTO @stack(Position, DirectoryID) VALUES (@stackPos, @RootDirectoryID)

                  WHILE @stackPos > 0
                        BEGIN
                              -- Pop directory from stack
                              SELECT @curDirectoryID = DirectoryID FROM @stack WHERE Position = @stackPos
                              SET @stackPos = @stackPos - 1

                              IF @curDirectoryID IS NOT NULL
                                    BEGIN -- @curDirectoryID may be null if all directories are listed
                                          SET @curGlboalOrdinal = @curGlboalOrdinal + 1
                                          INSERT INTO @retval (DirectoryID, GlobalOrdinal) VALUES (@curDirectoryID, @curGlboalOrdinal)
                                    END

                              DECLARE ChildrenCursor CURSOR FOR
                              SELECT DirectoryID
                              FROM Directory
                              WHERE ParentDirectoryID = @curDirectoryID OR (ParentDirectoryID IS NULL AND @curDirectoryID IS NULL)
                              ORDER BY Ordinal DESC
                              OPEN ChildrenCursor

                              FETCH NEXT FROM ChildrenCursor INTO @curChildDirectoryID

                              WHILE @@FETCH_STATUS = 0
                              BEGIN -- Push all child directories to stack
                                    SET @stackPos = @stackPos + 1
                                    INSERT INTO @stack (Position, DirectoryID) VALUES (@stackPos, @curChildDirectoryID)
                                    FETCH NEXT FROM ChildrenCursor INTO @curChildDirectoryID
                              END

                              CLOSE ChildrenCursor
                              DEALLOCATE ChildrenCursor
                        END
      END
      RETURN
END
GO

This function starts from the directory and then produces a table containing identities of that directory and all its descendants, plus one more column giving global order of directories listed. This function actually sorts the subtree into a list of directories. The result might become clear when function is used. We will first define a view which uses it:

CREATE VIEW DirectoryOrderV AS
SELECT Directory.*, G.GlobalOrdinal FROM
Directory INNER JOIN
GetDirectoryGlobalOrdinalRecursive(NULL) AS G ON Directory.DirectoryID = G.DirectoryID

This view simply adds directory fields to the function's output. Columns returned by the view are actually all columns from the Directory table plus one column indicating global order among directories. Even without added column, it is a good idea to define a view which simply calls the function. This is important because all applications can read from the view, but not every application is equipped to invoke table returning functions. We can use this view to query the Directory table now:

SELECT * FROM DirectoryOrderV ORDER BY GlobalOrdinal

Output of this query is:

Query output in sql

This query shows all the power of keeping logic in the database. With only a single SQL statement application can get hold of globally sorted directory tree, a result set which it could obtain only with a price of some programming. With help of such views and procedures, applications can become very simple, yet fast and rich in features.

Conclusion

We have shown that database can contain all logic regarding any hierarchy without too much effort. Solution obtained in that way is portable in terms of application, i.e. application can be rewritten in other languages without multiplying the functionality in each implementation and without requiring database changes.

For example, only a couple of lines of C# code are required to print out complete directory structure to the console:

SqlCommand cmd = new SqlCommand("SELECT Name, Depth FROM DirectoryOrderV ORDER BY GlobalOrdinal", conn);

using (SqlDataReader reader = cmd.ExecuteReader())
{
    while (reader.Read())
    Console.WriteLine("{0}{1}", new string(' ', 2 * (int)reader["Depth"]), (string)reader["Name"]);
}

This piece of code produces nicely indented output:

C:
    Pictures
    Program Files
        Windows Media Player
    Temp
    Windows
        System
        System32
            spool
            wbem
            WindowsPowerShell

Without all logic already implemented in the database, application would spread over dozens of lines and, even worse, it might need to communicate with database many times to list the complete subtree, which is inefficient compared to stored procedures and functions which are executed in one call.

It would require quite similar programming effort to fill the ASP.NET SiteMap control or Windows Forms TreeView control. With logic left to the application, all these uses might require separate implementations of the same functionality.

Desktop and Web programmers often tend to minimize amount of logic in the data layer and to put all the pressure to the application. That is often wrong because more optimal and more natural solutions can be implemented in the database. Cause for the logic misplacement is often found in insufficient knowledge about the database concepts, rather than in some profound understanding which would imply that particular logic should be placed in application. Solution to the problem is obviously in getting more acquainted with database technologies, so that their full power can be used for the better of the application and its users.

Up Next

Ebook Download

View all

SQL Queries For Beginners

Read by 5.2k people

Download Now!

Learn

View all