Symbolic Hooks Part 4: The App Container Traverse-ty

After getting the driver in Part 3 of our blog to load and adding a DbgPrintEx statement in our hook, we managed to get all the paths that were being opened without crashing the machine. We got really excited thinking we were done. But as soon as we clicked on the Start Menu, we noticed things had gone awry – it wasn’t starting up at all, and when we launched Process Monitor from SysInternals, we could see ShellExperienceHost.exe crashing. We tried other applications, which ran fine but still, the machine was pretty much unusable. So, we relaunched our IDA and WinDbg and went hunting for more bugs.

As we were playing around, we noticed that another process that wasn’t working was the new Windows Calculator. We then launched it in the debugger, taking advantage of the fact that the WinDbg Preview on the Microsoft Store can now easily launch Application Packages (which is needed, since Calc.exe is now simply a launcher for the real Calculator.exe — which essentially just does a ShellExecute of calculator://). Unfortunately, as soon as the debugger “attached”, the process had already died. This is usually a sign of a loader issue – such as an import library not being present, failing to load, or missing some required import.

A really useful way to debug such issues is to enable “Loader Snaps”, which is a Windows debugging feature that leverages “Global Flags”. These flags are set either in the kernel (nt!NtGlobalFlag — and recently, nt!NtGlobalFlag2) or in user-space, in the Process Environment Block (PEB) of every process, as Peb->NtGlobalFlag (or again, recently, Peb->NtGlobalFlag2 as well). You can enable system-wide global flags (either kernel or user ones) as well as per-process global flags through a handy utility that ships with the Windows Debugging Tools, unoriginally also called Global Flags (Gflags.exe). In the screenshot below, you can see how we enabled this debugging feature for Calculator.exe

Loader snaps instruct the loader to print out short debugging messages (“snaps”) which trace all parts of the link-loading process: import resolution, DLL loading, manifest file parsing, SxS redirection, even down to calls of GetProcAddress. Our thought was to launch Calculator with snaps enabled, with and without our driver, and then do a simple diff between the two debugger outputs.

The first thing we noticed was that when running with our driver we get this interesting error, right as soon as the process starts:

LdrpInitializeProcess - ERROR: Initializing the current directory to "C:\Program Files\WindowsApps\Microsoft.WindowsCalculator_10.1910.0.0_x64__8wekyb3d8bbwe\" failed with status 0xc0000022

This tells us that initializing the current application’s directory failed with error 0xC0000022, which is the NTSTATUS code for STATUS_ACCESS_DENIED. Normally, you’d expect an application to have access to its own local directory, so this was already unusual. We searched for this error in the rest of the log, and compared with the log of our run without our symbolic link hook:

The log shows that we see this error a few times, both when initializing the process, as well as when loading certain DLLs, whenever our driver is hooking the C: volume, but we don’t see it at all when running without our hook.

At first, we were also puzzled as to why only certain DLLs were failing to load – then we realized most of the libraries needed by Calculator are “known DLLs”. These is a special optimization Windows does wherein Smss.exe pre-maps these libraries at boot and caches their section objects in the \KnownDlls namespace of the Object Manager. The LoadLibrary API (more strictly speaking, its LdrLoadDll implementation in the loader) has special logic to always look for DLLs in this namespace first, and only try accessing the file system if it cannot find them there.

So, it seems like we found a hint to our problem, but what is the cause? We investigated Calculator’s token to try and notice anything that our hook might’ve affected, which could’ve led to this access denied error:

Although the token looked the same with, and without, our driver, we did notice something obvious in hindsight – this process is running under an app container (Windows 8’s new sandboxing technology) – as does the Start Menu, and each one of the other applications which were failing to execute! This made a lot of sense, as Microsoft is moving more and more applications into their new sandboxing model.

At first, we thought this would affect all Microsoft Store applications, but our assumption had been broken when WinDbg Preview launched fine. Now the reason made sense: it’s a “Centennial” App, meaning it’s a Windows Desktop Bridge UWP Application – and runs with full, regular privileges.

Well then, what’s special about an app container? Among many other security restrictions, the default “traverse” checks which are performed by the I/O Manager work a little differently. You see, one of the things most people don’t think about, is that to access a path such as c:\windows\system32\spool\drivers\colors\foo.dat, the Windows ACL-based security model technically dictates that “c:”, “windows”, “system32”, “spool”, “drivers”, and “colors” should all be opened one by one, and that the current application’s token should be validated for FILE_TRAVERSE access to the directory. This is not only expensive but would also fail for privileged paths which contain user-accessible locations (such as this very example).

To solve this, Windows, by default, grants all users (even Guests!) the SeChangeNotifyPrivilege, which is a strange name for the “bypass traverse checking” privilege. As the name suggests, this causes the I/O Manager to bypass these expensive, likely failing, checks, and directly skip to checking the ACL of only the underlying file being accessed. And since app containers do run with a regular user token, even they get this privilege, as shown in the token screenshot above.

However, this may not be the desired behaviour for other types of device objects – remember that the I/O manager doesn’t only gate access to partitions, but all sorts of other virtual devices too, such as \Device\Afd to bring up one example, which represents Sockets, or \Device\NamedPipe, which is used for their namesake. Within these devices, there are internal paths as well, such as \Device\NamedPipe\SomePipeName. Because app containers are meant to provide strong security boundaries, the I/O manager implements a function, IopDoFullTraverseCheck, which we show below, in order to enforce certain restrictions:

As you can see, for user-mode callers, a full traverse check will always be done for an app container unless the device object has the FILE_DEVICE_ALLOW_APPCONTAINER_TRAVERSAL flag set. This helper routine is called deep in the guts of IopParseDevice, a function which we already talked about in Part 3 of this blog series and which we had referenced the ReactOS source code for. Unfortunately, as app container-related logic is new to Windows 8, ReactOS can’t offer much help here, so we’ll have to go back to IDA. In the same if branch where the VolumeOpen checks are eventually done (which caused the crash in Part 2), we can now see some additional code, which we’ve reversed and shown below:

The check at line 307 is what calls the helper function shown earlier, which then results in a full SeAccessCheck being done by IopCreateSecurityCheck. As a side note, if you’d like to read some great research on these checks, and some of the abuses around bypassing them, James Forshaw has a great presentation at NullCon 2019 which you should read over.

In our case, this failed, because the ACL for \Device\HarddiskVolume0 does not give FILE_TRAVERSE to the Calculator Package SID (or the ALL_APPLICATION_PACKAGES SID). While we could certainly add this, it would amount to a hack – the correct fix, which is what \Device\HarddiskVolume3 itself has (our original partition device object), is to add the FILE_DEVICE_ALLOW_APPCONTAINER_TRAVERSAL flag when we call IoCreateDevice. Note the debugger output below that compares our device with the real device:

lkd> !devobj \Device\HarddiskVolume3
Device object (ffffdd0602606b90) is for:
HarddiskVolume3 \Driver\volmgr DriverObject ffffdd05ffd25e30
Current Irp 00000000 RefCount 16457 Type 00000007 Flags 00001150
Vpb ffffdd0602854e00 SecurityDescriptor ffffc80824a024e0 DevExt ffffdd0602606ce0 DevObjExt ffffdd0602606ea8 Dope ffffdd0602854620 DevNode ffffdd0602607bd0
ExtensionFlags (0x00000800) DOE_DEFAULT_SD_PRESENT
Characteristics (0x00020000) FILE_DEVICE_ALLOW_APPCONTAINER_TRAVERSAL
AttachedDevice (Upper) ffffdd060285e030 \Driver\fvevol
Device queue is not busy.

lkd> !devobj \Device\HarddiskVolume0
Device object (ffffdd0602db87b0) is for:
HarddiskVolume0 \Driver\symlink DriverObject ffffdd05fd9e64e0
Current Irp 00000000 RefCount 0 Type 00000022 Flags 00000040
SecurityDescriptor ffffc808248a1aa0 DevExt 00000000 DevObjExt ffffdd0602db8900
ExtensionFlags (0x00000800) DOE_DEFAULT_SD_PRESENT
Characteristics (0000000000)
Device queue is not busy.

So, while you all had to read another six pages of ranting, the only line of code that we had to fix is our call to IoCreateDevice:

status = IoCreateDevice(DriverObject,
                        0,
                        &g_DeviceName,
                        FILE_DEVICE_UNKNOWN,
-                       0,
+                       FILE_DEVICE_ALLOW_APPCONTAINER_TRAVERSAL,
                        FALSE,
                        &g_DeviceObject);

Well, there you have it! With this small fix, our hook driver now perfectly works on all the systems we’ve tested it on (about most of you, given the ATMFD RCE we’ve been using to deploy the driver). The final version is now posted on our GitHub here. Thanks a lot for reading!

Symbolic Hooks Part 3: The Remainder Theorem

We ended the second part with, unsurprisingly, a bugcheck. We tried to redirect all access to the C: volume to our device in order to get information about all the paths that are being accessed, but the first time anyone tried opening the C: volume itself, the I/O manager threw a DRIVER_RETURNED_STATUS_REPARSE_FOR_VOLUME_OPEN blue screen at us.

Unfortunately, we can’t return any other status code than STATUS_REPARSE or the path will not be parsed properly and a lot of things will break in the system as our fake device now becomes the “file system” of this poor path. But what if we could find a way to never have to return STATUS_REPARSE for volume opens, because we never see a volume open to begin with?

First, we should probably understand what it means to have a volume open. While based on the ancient Windows Server 2003 code base, ReactOS can offer a clue here — as it contains the exact same behavior in IopParseDevice:


//
// In case we override checks, but got this on volume open, fail hard
//

if (OpenPacket->Override != FALSE)
{
    KeBugCheckEx(DRIVER_RETURNED_STATUS_REPARSE_FOR_VOLUME_OPEN,
                 (ULONG_PTR)OriginalDeviceObject,
                 (ULONG_PTR)DeviceObject,
                 (ULONG_PTR)CompleteName,
                 OpenPacket->Information);
}

We can see that Override is set at this line underneath the following if statement:

//
// Now check if we need access checks
//

if (((AccessMode != KernelMode) || (OpenPacket->Options & IO_FORCE_ACCESS_CHECK)) &&
    ((OpenPacket->RelatedFileObject == NULL) || (VolumeOpen != FALSE)) &&
    (OpenPacket->Override == FALSE))
{

This leaves us with the final question — how does VolumeOpen become TRUE? This line provides the answer:


//
// Check if this is a volume open
//

if ((OpenPacket->RelatedFileObject != NULL) &&
    (OpenPacket->RelatedFileObject->Flags & FO_VOLUME_OPEN) &&
    (RemainingName->Length == 0))
{
    //
    // It is
    //
    VolumeOpen = TRUE;
}

In other words, if a file object is being opened on top of an existing file object that represents a volume, and this new file object doesn’t have a RemainingName, then we are directly opening the volume represented by RelatedFileObject itself. This is exactly what happens when we open C:.

James Forshaw provided us with an interesting idea – what if we could make it so that our device never receives a path that’s seen as a volume open by the I/O manager? In other words, what if RemainingName would never be 0?

James’ suggestion was pretty simple. Instead of redirecting the symlink through the callback to \Device\HarddiskVolume0 (the name of our device), we’ll redirect it to \Device\HarddiskVolume0\Foo. That way, all paths reaching our device will start with \Foo, and none of them will be treated by the I/O Manager as a volume open, so returning a STATUS_REPARSE should not present any issues. We’ll just need to remove this suffix from the path and set the file name to the correct string before returning.

First, we define our suffix:

DECLARE_GLOBAL_CONST_UNICODE_STRING(g_TailName, L"\\Foo");

And when defining the Device Object name that we want the symbolic link callback to return, we now append this string in the DriverEntry:

RtlAppendUnicodeStringToString(&g_DeviceName, &g_TailName);

Finally, we make some changes to our IRP_MJ_CREATE handler. First, the final name buffer must remove the space of the \Foo suffix in the original file name:

//
// Allocate space for the original device name, plus the size of the
// file name, minus "\Foo", and adding space for the terminating NUL.
//
bufferLength = fileObject->FileName.Length -
               g_TailName.Length +
               g_LinkPath.Length +
               sizeof(UNICODE_NULL);

And then, we must skip past the suffix when concatenating the file name:

//
// Then add the name of the file name, minus "\Foo"
//
NT_VERIFY(NT_SUCCESS(RtlStringCbCatNW(buffer,
                                      bufferLength,
                                      fileObject->FileName.Buffer +
                                      (g_TailName.Length / sizeof(g_TailName.Buffer[0])),
                                      fileObject->FileName.Length -
                                      g_TailName.Length)));

That’s pretty much it! With these simple changes, the driver should no longer crash. However, there’s still a subtle bug here: while our symbolic link callback will guarantee that there’s always a \Foo present, there’s other ways that our IRP_MJ_CREATE handler could be reached: if someone directly attempts to open \Device\HarddiskVolume0 from the kernel or with a native API. One such example is the WinObjEx64 tool from hfiref0x — when double-clicking on our device object, we immediately crashed. So let’s be safe, and simply prohibit direct opens of our device, which would not have the required \Foo suffix, by adding one last block:

//
// If this is someone directly trying to access our device object,
// fail them, so that we do not crash the system (since we should
// not reparse direct opens).
//
if (fileObject->FileName.Length < g_TailName.Length)
{
    status = STATUS_ACCESS_DENIED;
    goto Exit;
}

We loaded our new driver, which you can now find on our GitHub repository here, and this time, we got all the paths that were accessed in the C: volume, and no machine crashes! We celebrated our victory with a drink, then another, and another. And then we noticed things on the machine didn’t work too well. Processes such as Calculator wouldn’t run, the Start Menu refused to show up, and pretty soon the machine was basically unusable. So we had another drink to handle this additional failure, and passed out.

We eventually did figure out this issue during a long flight, but that story will be told in part 4. We promise that’s the last part. It all worked afterward, which is why none of your machines are showing any symptoms.

Read our other blog posts:

Symbolic Hooks Part 2 : Getting the Target Name

In our last blog part, we concluded with a working callback, but no information about the path being opened. Of course, we could get it from the stack since it should be saved there somewhere, but we thought there must be a more elegant way. We also wanted to avoid writing a book on Unwind Opcodes and how they can be used to recover stack parameters efficiently.

And so, we to go a different path, and come up with a way to force our own sort of parse routine to execute, in which we could get the original path, and take a decision as to whether or not to redirect the caller. Two options came to mind:

  • We could create a new object type with ObCreateObjectTypeEx, implement our own ParseRoutine, and have the symlink redirect to an object of our type so that we can have our routine return STATUS_REPARSE, with the name of the original target Device Object.
  • We could create a new Device Object with IoCreateDevice, implement our own IRP_MJ_CREATE handler, and have it use the I/O Manager’s existing reparsing logic (which it calls transmogrification) so that we can return STATUS_REPARSE and a new name for any File Object it creates, which would re-direct it to the original target Device Object.

Ultimately, creating a new object type is undocumented, monitored by Patch Guard if we make any wrong moves, and, most importantly, does not have a matching API to undo/destroy the operation. Yep, there is no way to delete an object type, thus our driver would never be able to unload.

Therefore, we decided to have our symlink callback redirect the symbolic link to a Device Object we will create, instead of returning the original string. Then, when our Device Object’s IRP_MJ_CREATE handler is called, the I/O Manager has already created a File Object, and we can get it from the IRP, retrieve its name, plus any other information about it and the creator/caller.

Thus, we first create our device – \Device\HarddiskVolume0. Next, we get the symbolic link to the C: volume the same way we showed in Part 1, and modify it to point to our callback as the LinkTarget. Then, we only must make one change: instead of passing the original link target string as a parameter in SymlinkContext, we pass in the path of our new device:

_Use_decl_annotations_
NTSTATUS
DriverEntry (
    _In_ PDRIVER_OBJECT DriverObject,
    _In_ PUNICODE_STRING RegistryPath
    )
{
    NTSTATUS status;
    HANDLE symLinkHandle;
    DECLARE_CONST_UNICODE_STRING(symlinkName, L"\\GLOBAL??\\c:");
    OBJECT_ATTRIBUTES objAttr = RTL_CONSTANT_OBJECT_ATTRIBUTES(&symlinkName,
                                                               OBJ_KERNEL_HANDLE |
                                                               OBJ_CASE_INSENSITIVE);
    UNREFERENCED_PARAMETER(RegistryPath);

    //
    // Make sure our alignment trick worked out
    //
    if (((ULONG_PTR)SymLinkCallback & 0xFFFF) != 0)
    {
        status = STATUS_CONFLICTING_ADDRESSES;
        DbgPrintEx(DPFLTR_IHVDRIVER_ID,
                   DPFLTR_ERROR_LEVEL,
                   "Callback function not aligned correctly!\n");
        goto Exit;
    }

    //
    // Set an unload routine so we can update during testing
    //
    DriverObject->DriverUnload = DriverUnload;

    //
    // Open a handle to the symbolic link object for C: directory,
    // so we can hook it
    //
    status = ZwOpenSymbolicLinkObject(&symLinkHandle,
                                      SYMBOLIC_LINK_ALL_ACCESS,
                                      &objAttr);
    if (!NT_SUCCESS(status))
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID,
                   DPFLTR_ERROR_LEVEL,
                   "Failed opening symbolic link with error: %lx\n",
                   status);
        goto Exit;
    }

    //
    // Get the symbolic link object and close the handle since we
    // no longer need it
    //
    status = ObReferenceObjectByHandle(symLinkHandle,
                                       SYMBOLIC_LINK_ALL_ACCESS,
                                       NULL,
                                       KernelMode,
                                       (PVOID*)&g_SymLinkObject,
                                       NULL);
    ObCloseHandle(symLinkHandle, KernelMode);
    if (!NT_SUCCESS(status))
    {
        DbgPrintEx(DPFLTR_IHVDRIVER_ID,
                   DPFLTR_ERROR_LEVEL,
                   "Failed referencing symbolic link with error: %lx\n",
                   status);
        goto Exit;
    }

    //
    // Create our device object hook
    //
    RtlAppendUnicodeToString(&g_DeviceName, L"\\Device\\HarddiskVolume0");
    status = IoCreateDevice(DriverObject,
                            0,
                            &g_DeviceName,
                            FILE_DEVICE_UNKNOWN,
                            0,
                            FALSE,
                            &g_DeviceObject);
    if (!NT_SUCCESS(status))
    {
        //
        // Fail, and drop the symlink object reference
        //
        ObDereferenceObject(g_SymLinkObject);
        DbgPrintEx(DPFLTR_IHVDRIVER_ID,
                   DPFLTR_ERROR_LEVEL,
                   "Failed create devobj with error: %lx\n",
                   status);
        goto Exit;
    }

    //
    // Attach our create handler
    //
    DriverObject->MajorFunction[IRP_MJ_CREATE] = SymHookCreate;

    //
    // Save the original string that the symlink points to
    // so we can change the object back when we unload
    //
    g_LinkPath = g_SymLinkObject->LinkTarget;

    //
    // Modify the symlink to point to our callback instead of the string
    // and change the flags so the union will be treated as a callback.
    // Set CallbackContext to the original string so we can
    // return it from the callback and allow the system to run normally.
    //
    g_SymLinkObject->Callback = SymLinkCallback;
    RtlAppendUnicodeStringToString(&g_DeviceName, &g_TailName);
    g_SymLinkObject->CallbackContext = &g_DeviceName;
    MemoryBarrier();
    g_SymLinkObject->Flags |= OBJECT_SYMBOLIC_LINK_USE_CALLBACK;

Exit:
    //
    // Return the result back to the system
    //
    return status;
}

This code means that when someone tries to access the symlink they will reach our callback and will receive the path to our Device Object path (\Device\HarddiskVolume0) instead of \Device\HarddiskVolume<N>, where N is the real C: partition.

Then, when this path will be opened, the I/O manager will create a File Object for the remaining path, such as \Windows\Notepad.exe, and will then call our Driver Object’s IRP_MJ_CREATE handler, where we will get this name from the FILE_OBJECT structure, and replace it with a new, fully qualified path, including both the original Device Object path and the remaining path.

Replacing a FILE_OBJECT name is trickier than it sounds – the original path, allocated by the I/O Manager, has a specific pool tag, and us freeing it and allocating our own would look like a leak to various testing tools such as Driver Verifier, unless we mimic the original tag.

To fix this issue, Microsoft implemented a special API: IoReplaceFileObjectName. Not only does it use the correct internal kernel pool tag, but it also implements certain optimizations such that the length of the file name string buffer will always be “aligned” to 56, 120, or 248 bytes (unless the name is bigger, in which case the precise size is used). This avoids having to free/re-allocate the buffer in many situations, as the new name can simply override the old.

Here’s how creating this new name ends up looking like:

//
// Get the FILE_OBJECT from the I/O Stack Location
//
ioStack = IoGetCurrentIrpStackLocation(Irp);
fileObject = ioStack->FileObject;

//
// Allocate space for the original device name, plus the size of the
// file name, and adding space for the terminating NUL.
//
bufferLength = fileObject->FileName.Length +
               g_LinkPath.Length +
               sizeof(UNICODE_NULL);
buffer = (PWCHAR)ExAllocatePoolWithTag(PagedPool, bufferLength, 'maNF');
if (buffer == NULL)
{
    status = STATUS_INSUFFICIENT_RESOURCES;
    goto Exit;
}

//
// Append the original device name first
//
buffer[0] = UNICODE_NULL;
NT_VERIFY(NT_SUCCESS(RtlStringCbCatNW(buffer,
                                      bufferLength,
                                      g_LinkPath.Buffer,
                                      g_LinkPath.Length)));

//
// Then add the name of the file name
//
NT_VERIFY(NT_SUCCESS(RtlStringCbCatNW(buffer,
                                      bufferLength,
                                      fileObject->FileName.Buffer,
                                      fileObject->FileName.Length)));

//
// Ask the I/O manager to free the original file name and use ours instead
//
status = IoReplaceFileObjectName(fileObject,
                                 buffer,
                                 bufferLength - sizeof(UNICODE_NULL));
if (!NT_SUCCESS(status))
{
    DbgPrintEx(DPFLTR_IHVDRIVER_ID,
               DPFLTR_ERROR_LEVEL,
               "Failed to swap file object name: %lx\n",
               status);
    ExFreePool(buffer);
    goto Exit;
}

Once we’re replaced the File Object’s name, this code still has a problem – we can’t return STATUS_SUCCESS, since that would make us the owner Device Object for this new file, and not actually point to the original target Device Object of the partition. All future I/O will flow through our driver as IRPs, and we must now essentially implement forwarders for every operation.

We could get the correct Device Object for \Device\HarddiskVolume<N> and manually forward all IRPs to it, but then all requests will still be attached to our device. Not only does this make us a lot more visible, but it essentially turns is into a file system filter driver. We just want to get the creation request and then pass it on to the correct device and not have to handle it ever again.

To make this work correctly, we have to exercise the I/O Manager’s transmogrification logic, which is a two-step process:

  1. Return STATUS_REPARSE, to indicate that a reparse operation is needed. This causes IopParseDevice to look at the new name string in the File Object, and begin the name lookup logic all over again, based on this new name, freeing the old object and previous work done. This code is highly complex, but you can see a simpler version of it in the ReactOS sources here.
  2. Set the IRP’s Information field to IO_REPARSE, which indicates the type of reparsing operation that we are attempting. This is normally where a true hard link or symlink would be indicated by using a special reparse tag and a matching structure documented by Microsoft, such as REPARSE_DATA_BUFFER. However, IO_REPARSE is a magic/reserved value which indicates just a plain replacement of the name, and not a true reparse point.

Taking these points into consideration, our IRP_MJ_CREATE handler completes with the following logic:

    //
    // Return a reparse operation so that the I/O manager uses the new file
    // object name for its lookup, and starts over
    //
    Irp->IoStatus.Information = IO_REPARSE;
    status = STATUS_REPARSE;

Exit:
    //
    // Complete the IRP with the relevant status code
    //
    Irp->IoStatus.Status = status;
    IoCompleteRequest(Irp, IO_NO_INCREMENT);
    return status;

So now we have a mechanism that looks something like this:

Some of you might point out that this method will work just as well without using a symlink callback at all – we could have just replaced the LinkTarget of the symbolic link with the path of our device, and would have gotten all the requests anyway – in fact, we’d only have had to change the last digit of the path. However, we felt that doing this makes us a lot more visible, as anyone inspecting the symlink object will easily see this path, as well as a change in the structure.

Another reason is that with the callback we can dynamically decide what to do. For example, if we know we are being inspected by an AV driver, we can use the callback to return the original string and not redirect the request to our device. We could even redirect to a completely different device if we wanted to, without having to constantly keep changing the path (which would result in a race condition anyway).

Excited to try things out, we load our new and improved driver and look at the results:

And get super happy, until, about 10 seconds later, we get a crash:

Shit. When our device receives an open request for the root of the C: volume itself, it can’t return STATUS_REPARSE, it’s in the rules.

So what do we do now? All will be revealed in part 3 (and 4… and possibly 5).

We have created a new branch on GitHub which implements the improved hooking mechanism introduced in this part.

Read our other blog posts:

“Move aside, signature scanning!” Better kernel data discovery through lookaside lists

Introduction

A while ago we did some research. That specific project might be published at some other time in the future and we won’t go into too much detail about it here. But as part of this project we wanted to gain access into an internal data structure used by some driver. Sadly, the driver’s global pointer to this data structure is not exported, and we couldn’t find a way to access it from outside the driver itself. It is stored in the pool, so we couldn’t even scan the driver address space for signs of this structure.

Of course, there is always the option of doing binary parsing on the driver based on a function signature that references the global, and/or using an array of known offsets for the global variable and adding the driver base to find it. But these methods require finding and using the correct RVA for every version of the driver, as well as all potential function signatures. Because this driver does not have exported functions, such signatures would be brittle and subject to change between releases. Therefore, although often used by malware authors, we find these techniques ugly and inconvenient to implement — we knew we could do better.

So, we reverse engineered the data structure itself and came up with an interesting idea that can give us easy access to this data structure and to many others. The data structure we were interested in is very large and contains, among other things, a few lookaside lists embedded in it. Lookaside lists are single linked lists containing pool allocations of a fixed size. They are used by drivers for caching memory allocations instead of always requesting them from the memory manager. Let’s see what makes these interesting.

System Lookaside Lists

Here is the wdm.h definition of a GENERAL_LOOKASIDE_LAYOUT (GENERAL_LOOKASIDE is just an aligned version of GENERAL_LOOKASIDE_LAYOUT):


//
// The goal here is to end up with two structure types that are identical except
// for the fact that one (GENERAL_LOOKASIDE) is cache aligned, and the other
// (GENERAL_LOOKASIDE_POOL) is merely naturally aligned.
//
// An anonymous structure element would do the trick except that C++ can't handle
// such complex syntax, so we're stuck with this macro technique.
//
#define GENERAL_LOOKASIDE_LAYOUT                \
    union {                                     \
        SLIST_HEADER ListHead;                  \
        SINGLE_LIST_ENTRY SingleListHead;       \
    } DUMMYUNIONNAME;                           \
    USHORT Depth;                               \
    USHORT MaximumDepth;                        \
    ULONG TotalAllocates;                       \
    union {                                     \
        ULONG AllocateMisses;                   \
        ULONG AllocateHits;                     \
    } DUMMYUNIONNAME2;                          \
                                                \
    ULONG TotalFrees;                           \
    union {                                     \
        ULONG FreeMisses;                       \
        ULONG FreeHits;                         \
    } DUMMYUNIONNAME3;                          \
                                                \
    POOL_TYPE Type;                             \
    ULONG Tag;                                  \
    ULONG Size;                                 \
    union {                                     \
        PALLOCATE_FUNCTION_EX AllocateEx;       \
        PALLOCATE_FUNCTION Allocate;            \
    } DUMMYUNIONNAME4;                          \
                                                \
    union {                                     \
        PFREE_FUNCTION_EX FreeEx;               \
        PFREE_FUNCTION Free;                    \
    } DUMMYUNIONNAME5;                          \
                                                \
    LIST_ENTRY ListEntry;                       \
    ULONG LastTotalAllocates;                   \
    union {                                     \
        ULONG LastAllocateMisses;               \
        ULONG LastAllocateHits;                 \
    } DUMMYUNIONNAME6;                          \
   ULONG Future[2];

A useful fact to notice is that this structure contains a linked list (GENERAL_LOOKASIDE.ListEntry), meaaning all lookaside lists do. Depending on whether the lookaside list was created with ExInitializeNPagedLookasideList or ExInitializePagedLookasideList (or, if ExInitializeLookasideListEx was used, the PoolType which was passed in), the data structure will be entered into one of two list heads. As such, if we follow the ListEntry of any lookaside list, we’ll eventually end up at either ExPagedLookasideListHead or ExNPagedLookasideListHead. Since we create our own lookaside list through these APIs, if we pick the same pool type as our target structure, we can therefore through all other lookasides, and eventually reach the one contained in our target structure. In this particular use case, using our own definition of the structure, the useful CONTAINING_RECORD macro, and the knowledge that the first member of the structure is a “magic” ULONG that always contains the same value, we searched all lookaside lists using this mechanism until we reached our structure.

But the possibilities don’t stop there – this method gives us access to any kernel structure, exported or not, that contains a lookaside list. So what else is there?

Pool-Based Lookaside Lists

With some WinDbg magic, we can also find out valuable information about the data – whether it’s inside a driver (and which one!) or in the kernel pool, who it belongs to, the allocation size, etc. To explore the possibilities, we wrote a simple WinDbg script that iterates through all lookaside lists and uses the extremely helpful !pool extension to dump information about them. Although we could build similar functionality in a custom C driver, there is no Windows Kernel API that can supply us with similar information about pool allocations and parsing pool pages to retrieve it is a lot of work, so we decided to avoid implementing the same functionality in C due to laziness. In fact, while we tried to implement our own C-based pool parser, we ended up realizing that nobody had described the myriad of changes in Windows 10 RS5 and above’s pool manager, so we’re busy writing a book on the topic.

Using our script, we found structures containing lookaside lists that belong to FltMgr.sys, Win32k.sys, Windows Defender drivers, various display drivers, and much more.

dx -r0 @$GeneralLookaside = Debugger.Utility.Collections.FromListEntry(*(nt!_LIST_ENTRY*)&nt!ExPagedLookasideListHead, "nt!_GENERAL_LOOKASIDE", "ListEntry")
dx -r0 @$lookasideAddr = @$GeneralLookaside.Select(l => ((__int64)&l).ToDisplayString("x"))
dx -r0 @$extractBetween = ((x,y,z) => x.Substring(x.IndexOf(y) + y.Length, x.IndexOf(z) - x.IndexOf(y) - y.Length))
dx -r0 @$extractWithSize = ((x,y,z) => x.Substring(x.IndexOf(y) + y.Length, z))
dx -r2 @$poolData = @$lookasideAddr.Select(l => Debugger.Utility.Control.ExecuteCommand("!pool "+l+" 2")).Where(l => l[1].Length != 0x55 && l[1].Length != 0).Select(l => new {address = "0x" + @$extractBetween(l[1], "*", "size:"), tag = @$extractWithSize(l[1], "(Allocated) *", 4), tagDesc = l[2].Contains(",") ? @$extractBetween(l[2], ": ", ",") : l[2].Substring(l[2].IndexOf(":")+2), binary = l[2].Contains("Binary") ? l[2].Substring(l[2].IndexOf("Binary :")+9) : "unknown", size = "0x" + @$extractBetween(l[1], "size:", "previous size:").Replace(" ", "")})

[0x4a]
    address : 0xffff988679939400
    tag :     Vi10
    tagDesc : Video memory manager process heap
    binary :  dxgmms2.sys
    size :    0x70
[0x4b]
    address : 0xffff98867b647650
    tag :     DxgK
    tagDesc : Vista display driver support
    binary :  dxgkrnl.sys
    size :    0x640
[0x4c]
    address : 0xffff98867b647650
    tag :     DxgK
    tagDesc : Vista display driver support
    binary :  dxgkrnl.sys
    size :    0x640
[0x4d]
    address : 0xffff9886790f5430
    tag :     Vi17
    tagDesc : Video memory manager pool
    binary :  dxgmms2.sys
    size :    0x150
[0x4e]
    address : 0xffff98867966e230
    tag :     Usla
    tagDesc : USERTAG_LOOKASIDE
    binary :  win32k!InitLockRecordLookaside
    size :    0xa0
[0x4f]
    address : 0xffff98867966ea50
    tag :     Usla
    tagDesc : USERTAG_LOOKASIDE
    binary :  win32k!InitLockRecordLookaside
    size :    0xa0
[0x50]
    address : 0xffff98867966e690
    tag :     Gla1
    tagDesc : GDITAG_HMGR_LOOKASIDE_DC_TYPE
    binary :  win32k.sys
    size :    0xa0
[0x51]
    address : 0xffff98867966e550
    tag :     Gla4
    tagDesc : GDITAG_HMGR_LOOKASIDE_RGN_TYPE
    binary :  win32k.sys
    size :    0xa0
[0x52]
    address : 0xffff98867966ecd0
    tag :     Gla5
    tagDesc : GDITAG_HMGR_LOOKASIDE_SURF_TYPE
    binary :  win32k.sys
    size :    0xa0

There are some results in which the pool tag is unknown, making the tracking of the driver they belong to difficult. A fun way to solve that is using driver verifier’s pool tracking feature. We can modify our script and replace the !pool <address> 2 command with !verifier <address> 2 and receive information about the allocating driver and the completes stack trace of the allocation. But running this command on so many addresses is extremely slow and it dumps a lot of information that is hard to sort through. So another option is going for a more manual approach – enabling driver verifier but executing the previous script as it is, and only querying specific addresses that seem interesting with verifier.

Image-Based Lookaside Lists

Initially we only searched for data in the pool because that is where the structure we were interested in was allocated. But with this trick we also get access to lookaside lists that are inside drivers, and we can use the cool new RtlPcToFileName function to find out what driver these structures are in. In this case we did choose to implement this in C code since it’s more straightforward and faster to execute:

_Use_decl_annotations_
NTSTATUS
DriverEntry (
    _In_ PDRIVER_OBJECT DriverObject,
    _In_ PUNICODE_STRING RegistryPath
    )
{
    NTSTATUS status;
    LOOKASIDE_LIST_EX lookaside;
    PLIST_ENTRY lookasideList;
    PLIST_ENTRY lookasideListHead;
    PGENERAL_LOOKASIDE generalLookaside;
    UNICODE_STRING pcName = RTL_CONSTANT_STRING(L"RtlPcToFileName");
    DECLARE_UNICODE_STRING_SIZE(driverName, 32);
    UNREFERENCED_PARAMETER(RegistryPath);

    DriverObject->DriverUnload = DriverUnload;

    auto RtlPcToFileNamePtr = (decltype(RtlPcToFileName)*)(MmGetSystemRoutineAddress(&pcName));
    NT_ASSERT(RtlPcToFileNamePtr != nullptr);

    //
    // Create our own lookaside list to use for finding other lookaside lists in the kernel.
    //
    status = ExInitializeLookasideListEx(&lookaside,
                                         nullptr,
                                         nullptr,
                                         PagedPool,
                                         0,
                                         8,
                                         'Fake',
                                         0); 
    if (!NT_SUCCESS(status))
    { 
        goto Exit;
    }

    //
    // Iterate over our lookaside list to find all the other lookaside lists
    // and print information about them
    //
    generalLookaside = nullptr;
    lookasideListHead = &lookaside.L.ListEntry;
    lookasideList = lookasideListHead->Flink;
    do
    {
        generalLookaside = CONTAINING_RECORD(lookasideList,
                                             GENERAL_LOOKASIDE,
                                             ListEntry);

        //
        // Use RtlPcToFileName to find whether the lookaside list is
        // inside a driver and if so, which one
        //
        status = RtlPcToFileNamePtr(generalLookaside, &driverName);
        if (NT_SUCCESS(status)) 
        { 
            DbgPrintEx(DPFLTR_IHVDRIVER_ID,
                       DPFLTR_ERROR_LEVEL,
                       "Lookaside list is in driver %wZ\n",
                       driverName);
        }
        else
        {
            DbgPrintEx(DPFLTR_IHVDRIVER_ID,
                       DPFLTR_ERROR_LEVEL,
                       “Lookaside list is not inside a driver\n”);
        }

        lookasideList = lookasideList->Flink;
    } while (lookasideList != lookasideListHead);

    status = STATUS_SUCCESS;

Exit: 
    ExDeleteLookasideListEx(&lookaside);
    return status;
}

With this code we found lookaside lists inside of Ntoskrnl.exe, Ci.dllNtfs.sys and more. Of course, since these are embedded inside of the driver memory, our only way to know whether these are independent lookaside lists or they are part of a larger structure is to dump the addresses and reverse engineer the drivers. But we’re all nerds who like reverse engineering, or we wouldn’t be writing/reading this blog.

We can also implement the same query in WinDbg if we choose to, using the ln command which searches for the nearest symbol to an address:

dx -r0 @$GeneralLookaside = Debugger.Utility.Collections.FromListEntry(*(nt!_LIST_ENTRY*)&nt!ExPagedLookasideListHead, "nt!_GENERAL_LOOKASIDE", "ListEntry")
dx -r0 @$lookasideAddr = @$GeneralLookaside.Select(l => ((__int64)&l).ToDisplayString("x"))
dx -r2 @$symData = @$lookasideAddr.Select(l => new {addr = l, sym = Debugger.Utility.Control.ExecuteCommand("ln "+l)}).Where(l => l.sym.Count() > 3).Select(l => new {addr = l.addr, sym = @$extractBetween(l.sym[3], "   ", "|")})

[0x9]
    addr        : 0xfffff8000e4eb300
    sym         : nt!AlpcpLookasides+0x100
[0xa]
    addr        : 0xfffff8000e4db180
    sym         : nt!IopSymlinkInfoLookasideList
[0xb]
    addr        : 0xfffff8000e4ef040
    sym         : nt!WmipDSChunkInfoLookaside
[0xc]
    addr        : 0xfffff8000e4eefc0
    sym         : nt!WmipGEChunkInfoLookaside
[0xd]
    addr        : 0xfffff8000e4ef140
    sym         : nt!WmipISChunkInfoLookaside
[0xe]
    addr        : 0xfffff8000e4ef0c0
    sym         : nt!WmipMRChunkInfoLookaside
[0xf]
    addr        : 0xfffff8001172a880
    sym         : FLTMGR!FltGlobals+0x340
[0x10]
    addr        : 0xfffff8001172ad00
    sym         : FLTMGR!FltGlobals+0x7c0
[0x11]
    addr        : 0xfffff8001172af00
    sym         : FLTMGR!FltGlobals+0x9c0
[0x12]
    addr        : 0xfffff8001172b080
    sym         : FLTMGR!FltGlobals+0xb40

This is a pretty cool trick, which led to all sorts of cool discoveries. And we only searched for paged lookaside lists. There is a whole world of non-paged lookaside lists that we didn’t even look at yet. We ran the same WinDbg scripts as before, and just changed our starting point from nt!ExPagedLookasideListHead to nt!ExNPagedLookasideListHead to get the non-paged lookaside lists, and got some interesting results. We looked for non-paged lookaside lists in the pool:

[0x55]
    address : 0xffff97884ba5c990
    tag :     Vkin
    tagDesc : Hyper-V VMBus KMCL driver (incoming packets)
    binary :  vmbkmcl.sys
    size :    0x2d0
[0x56]
    address : 0xffff97884bad1590
    tag :     NDnd
    tagDesc : NDIS_TAG_POOL_NDIS
    binary :  ndis.sys
    size :    0x800
[0x57]
    address : 0xffff97884bad3000
    tag :     NDrt
    tagDesc : NDIS_TAG_RST_NBL
    binary :  ndis.sys
    size :    0x800
[0x58]
    address : 0xffff97884ba19130
    tag :     Nnbf
    tagDesc : NetIO NetBufferLists
    binary :  netio.sys
    size :    0x800

And inside of drivers:

[0x14]
    addr        : 0xfffff8000e4db100
    sym         : nt!IopOplockFoExtLookasideList
[0x15]
    addr        : 0xfffff8000e4ee880
    sym         : nt!WmipRegLookaside
[0x16]
    addr        : 0xfffff80010e40bc0
    sym         : ACPI!BuildRequestLookAsideList
[0x17]
    addr        : 0xfffff80010e40dc0
    sym         : ACPI!RequestLookAsideList
[0x18]
    addr        : 0xfffff80010e40c40
    sym         : ACPI!DeviceExtensionLookAsideList
[0x19]
    addr        : 0xfffff80010e40d40
    sym         : ACPI!RequestDependencyLookAsideList
[0x1a]
    addr        : 0xfffff80010e40cc0
    sym         : ACPI!ObjectDataLookAsideList
[0x17]
    addr        : 0xfffff80010e40f40
    sym         : ACPI!XswContextLookAsideList

Per-Processor Lookaside Lists

There’s actually one more linked list of lookaside lists that we haven’t talked about yet: ExPoolLookasideListHead. Since the first versions of Windows NT, and up until Windows 10 RS5 when the pool manager was rewritten to use the Backend Heap (again, the topic of a future book!), it leveraged a per-processor array of 32 lookaside lists, one for each indexed multiple of the pool block size. On x86, this basically meant any 8-byte aligned allocation from 8 to 256 bytes, and on x64, any 16-byte aligned allocation from 16 to 512 bytes.

Since there was both a paged and nonpaged pool, each KPRCB had two such arrays — the PPNPagedLookasideList and the PPPagedLookasideList. With Windows 8 and the introduction of the non-executable nonpaged pool, a third array was created: PPNxPagedLookasideList. All of these lookaside lists are therefore inserted into the same linked list head, and on our system, you can easily see how many processors (16) are present:

lkd> dx -r0 @$poolasides = Debugger.Utility.Collections.FromListEntry(*(nt!_LIST_ENTRY*)&nt!ExPoolLookasideListHead, "nt!_GENERAL_LOOKASIDE", "ListEntry")
@$poolasides = Debugger.Utility.Collections.FromListEntry(*(nt!_LIST_ENTRY*)&nt!ExPoolLookasideListHead, "nt!_GENERAL_LOOKASIDE", "ListEntry")
lkd> dx @$poolasides.Count(), d
@$poolasides.Count(), d : 1536
lkd> dx 1536 / 32 / 3
1536 / 32 / 3 : 16
lkd> dx *(int*)&nt!KeNumberProcessors
*(int*)&nt!KeNumberProcessors : 16 [Type: int]

Originally, this seemed exciting, as it would imply the ability to easily locate not only structures that contain a lookaside list, but in fact, any pool structure that’s a multiple of the pool block size. Unfortunately, if we take a look at these lists on modern Windows 10 systems, we find that they’re completely unused:

lkd> dx @$poolasides.Sum(p => p.TotalAllocates + p.TotalFrees)
@$poolasides.Sum(p => p.TotalAllocates + p.TotalFrees) : 0x0

Indeed, looking at the code in ExAllocatePoolWithTag and friends, this logic was completely removed as part of the heap-related changes we’ll cover in a future research paper.

Executive Resources

The even cooler thing is that lookaside lists are not the only kernel structures that are linked to all other structures of the same type! Another example is theERESOURCE, a structure used to implement read/write locking for drivers. Executive resources are also contained inside of many kernel structures, and can give us access to even more internal kernel information, if we know how to find them. We changed our WinDbg scripts to iterate over the linked list found in ERESOURCE.SystemResourcesList, starting from nt!ExpSystemResourcesList.

We first searched for ERESOURCE objects in the pool:

[0xb8]
    address : 0xffff97884bf9fb90
    tag :     Ntfx
    tagDesc : Unrecognized NTFS tag (update base\published\pooltag.w)
    binary :  ntfs.sys
    size :    0x170
[0xb9]
    address : 0xffff97884bf50e80
    tag :     SeTl
    tagDesc : Security Token Lock
    binary :  nt!se
    size :    0x80
[0x4c]
    address : 0xffff97884bf9ed30
    tag :     Ntfx
    tagDesc : Unrecognized NTFS tag (update base\published\pooltag.w)
    binary :  ntfs.sys
    size :    0x170

And then for ERESOURCE objects inside of drivers:

[0x3c]
    addr        : 0xfffff8001268e8e0
    sym         : Ntfs!NtfsDynamicRegistrySettingsResource
[0x3d]
    addr        : 0xfffff80011211ef0
    sym         : NDIS!SharedMemoryResource
[0x3e]
    addr        : 0xfffff80012967630
    sym         : ksecpkg!g_rgCachedPagedSslProvs+0x410
[0x3f]
    addr        : 0xfffff80011a032f8
    sym         : tcpip!FlIsolationState+0x18
[0x40]
    addr        : 0xfffff80011d482e0
    sym         : mup!MupProviderTable+0x20
[0x41]
    addr        : 0xfffff80011d48100
    sym         : mup!MupiSurrogateList+0x20
[0x42]
    addr        : 0xfffff8000f4ac370
    sym         : CI!g_IgnoreLifetimeSigningEKU+0x70
[0x43]
    addr        : 0xfffff8000f4acb80
    sym         : CI!g_GRLContextLock
[0x44]
    addr        : 0xfffff80012f081c0
    sym         : netbios!g_erGlobalLock

We found some very interesting results that are probably worth further investigation, such as pool structures related to NTFS volume objects, structures inside Ci.dll, and much, much more. On our machine we found over 400 000 executive resources:

lkd> dx -r0 @$eresource = Debugger.Utility.Collections.FromListEntry(*(nt!_LIST_ENTRY*)&nt!ExpSystemResourcesList, "nt!_ERESOURCE", "SystemResourcesList")
@$eresource = Debugger.Utility.Collections.FromListEntry(*(nt!_LIST_ENTRY*)&nt!ExpSystemResourcesList, "nt!_ERESOURCE", "SystemResourcesList")
lkd> dx @$eresource.Count(),d
@$eresource.Count(),d : 400960

Because of the sheer number, making analysis with LINQ unwieldly, we wanted to get pool information for some of these ERESOURCE structures using C code and start analyzing them. Unfortunately, unlike lookaside lists, ERESOURCE structures don’t have their pool tag as part of the structure, so we have to write a pool parser to get the pool information for each ERESOURCE. As we’ve mentioned before, as it turns out, in RS5 and later, that is not an easy task at all, as you’ll see in our upcoming research on the new backend heap-backed kernel pool.

DKOM – Now with Symbolic Links!

You might think “What can ANYONE still say about kernel callbacks? We’ve already seen every callback possible – there are process creation callbacks, object type callbacks, image load notifications, callback objects, object type callbacks, host extensions… there can’t be any more kinds of callbacks. Right? Right…?”

Nope.

In Microsoft’s never-ending attempt to close one door for kernel hooking and open two more, Windows 10 Creators Update (RS2) added a new type of callback – this time for symbolic links.

Notice these recent changes to the OBJECT_SYMBOLIC_LINK structure:

typedef struct _OBJECT_SYMBOLIC_LINK
{
    LARGE_INTEGER CreationTime;
+   union
+   {
        UNICODE_STRING LinkTarget;
+       struct
+       {
+            POBJECT_SYMBOLIC_LINK_CALLBACK Callback;
+            PVOID CallbackContext;
+       };
+   }
    ULONG DosDeviceDriveIndex;
    ULONG Flags;
    ULONG AccessMask;
} OBJECT_SYMBOLIC_LINK, *POBJECT_SYMBOLIC_LINK;

What used to be a Unicode String containing the target of the symbolic link is now a union that contains one of our favorite keywords to see when looking at the kernel – callback.

These callbacks were added in RS2 to support Memory Partitions, which are a new type of object used to segment physical address ranges into their own instance of the memory manager. Without going into too many details, the key point is that some of the event objects in \KernelObjects, such as LowMemoryCondition are no longer global – but rather refer to the specific conditions in the Memory Partition of the current caller. However, in order not to break compatibility, their naming and location could not be changed (such as \KernelObjects\Partition2\
LowMemoryCondition). As a result, they were turned into symbolic links attached to a dynamic callback, which will look at the current Memory Partition in EPROCESS and return the appropriate KEVENT Object for the caller’s partition.

Now, whenever bit 5 in the symbolic link flags is set (Flags & OBJECT_SYMBOLIC_LINK_USE_CALLBACK), the LinkTarget will not be treated as a string, but instead will be treated as a function with this prototype:

typedef
NTSTATUS
(POBJECT_SYMBOLIC_LINK_CALLBACK*) (
    _In_ POBJECT_SYMBOLIC_LINK Symlink,
    _In_ PVOID SymlinkContext,
    _Out_ PUNICODE_STRING SymlinkPath,
    _Outptr_ PVOID* Object
    );

This function will be called whenever the symbolic link is reparsed, and has to either set the SymlinkPath parameter to the target path to be parsed by the object manager, or set the Object parameter to the correct object that will be used as the target for this symbolic link.

This callback is set (or not set) by the ObCreateSymbolicLink function, based on an input structure that contains flags and the target string or callback function:

#define OB_SYMLINK_TARGET_DYNAMIC 0x01
typedef struct _OB_SYMLINK_TARGET
{
    ULONG Flags;
    union
    {
        UNICODE_STRING LinkTarget;
        struct
        {
            POBJECT_SYMBOLIC_LINK_CALLBACK Callback;
            PVOID CallbackContext;
        };
    };
} OB_SYMLINK_TARGET, *POB_SYMLINK_TARGET;

Based on this parameter, the function creates a symbolic link object and sets its target:

NTSTATUS
ObCreateSymbolicLink (
    _Out_ PHANDLE LinkHandle,
    _In_ ACCESS_MASK DesiredAccess,
    _In_ POBJECT_ATTRIBUTES ObjectAttributes,
    _In_ POB_SYMLINK_TARGET TargetInfo,
    _In_ KPROCESSOR_MODE AccessMode
    )
{
    NTSTATUS status;
    PWCHAR linkString;
    POBJECT_SYMBOLIC_LINK symlinkObject;
    HANDLE linkHandle;

    //
    // Create the symlink object
    //
    symlinkObject = NULL;
    status = ObCreateObjectEx(AccessMode,
                              ObpSymbolicLinkObjectType,
                              ObjectAttributes,
                              AccessMode,
                              0,
                              sizeof(*symlinkObject),
                              0,
                              0,
                              &symlinkObject,
                              NULL);
    if (NT_SUCCESS(status))
    {
        KeQuerySystemTime(&symlinkObject->CreationTime);
        symlinkObject->DosDeviceDriveIndex = 0;
        symlinkObject->Flags = 0;

        //
        // If the symlink has a dynamic target, set the flags accordingly
        // and populate the callback field
        //
        if (TargetInfo->Flags & OB_SYMLINK_TARGET_DYNAMIC)
        {
            symlinkObject->Flags = OBJECT_SYMBOLIC_LINK_USE_CALLBACK;
            symlinkObject->Callback = TargetInfo->Callback;
        }
        else
        {
            //
            // If the symlink doesn't have a dynamic target, set the LinkTarget to the string
            //
            symlinkObject->LinkTarget.MaximumLength = TargetInfo->LinkTarget.MaximumLength;
            symlinkObject->LinkTarget.Length = TargetInfo->LinkTarget.Length;
            linkString = (PWCHAR)ExAllocatePoolWithTag(PagedPool,
                                                       TargetInfo->LinkTarget.MaximumLength,
                                                       'tmyS');
            symlinkObject->LinkTarget.Buffer = linkString;
            if (linkString == NULL)
            {
                status = STATUS_NO_MEMORY;
                goto Exit;
            }

            RtlCopyMemory(linkString,
                          TargetInfo->LinkTarget.Buffer,
                          TargetInfo->LinkTarget.MaximumLength);
        }

        if (RtlIsSandboxedToken(NULL, AccessMode) != FALSE)
        {
            symlinkObject->Flags |= OBJECT_SYMBOLIC_IS_SANDBOXED;
        }

        status = ObInsertObjectEx(symlinkObject,
                                  NULL,
                                  DesiredAccess,
                                  0,
                                  0,
                                  NULL,
                                  &linkHandle);
        symlinkObject = NULL;

        if (NT_SUCCESS(status))
        {
            *LinkHandle = linkHandle;
            status = STATUS_SUCCESS;
        }
    }

Exit:
    if (symlinkObject != NULL)
    {
        ObDereferenceObject(symlinkObject);
    }

    return status;
}

Unfortunately, ObCreateSymbolicLink is not exported, so we can’t call it ourselves and create a symbolic link with a callback function. It’s never that simple. The function has 2 callers – NtCreateSymbolicLinkObject and MiCreateMemoryEvent. The latter function handles the Memory Partition functionality we described earlier. It creates the various memory events as symbolic links with no target strings, and sets their callback to MiResolveMemoryEvent:

You can see these symbolic links in WinObjEx. They can be recognized by having no target string:

But MiCreateMemoryEvent is an internal function that is not very useful for us in this case. So we turn to look at NtCreateSymbolicLinkObject which gives us very little to work with:

It always sets the Flags for the OB_SYMLINK_TARGET structure to 0, meaning the target is always a string, not a function pointer. This is unfortunate, since it means we can’t create symbolic link objects containing callbacks from user mode. But we didn’t really expect that to be possible, so we weren’t devastated. Instead, we decided to try and modify an existing symbolic link – we can use this feature to hook some frequently used symlink and register our own function to be called whenever it’s used.

We chose the symbolic link for the C: volume as our target. To achieve our goal, we first needed to open the symbolic link and get its object so we could modify it:

NTSTATUS status;
HANDLE symLinkHandle = NULL;
POBJECT_SYMBOLIC_LINK symlinkObject;
UNICODE_STRING symlinkName = RTL_CONSTANT_STRING(L"\\GLOBAL??\\c:");
OBJECT_ATTRIBUTES objectAttributes =
RTL_CONSTANT_OBJECT_ATTRIBUTES(&symlinkName,
                               OBJ_KERNEL_HANDLE | OBJ_CASE_INSENSITIVE);
//
// Open a handle to the symbolic link object for C: directory,
// so we can hook it
//
status = ZwOpenSymbolicLinkObject(&symLinkHandle,
                                  SYMBOLIC_LINK_ALL_ACCESS,
                                  &objectAttributes);
if (!NT_SUCCESS(status))
{
    goto Cleanup;
}

//
// Get the symbolic link object
//
status = ObReferenceObjectByHandle(symLinkHandle,
                                   SYMBOLIC_LINK_ALL_ACCESS,
                                   NULL,
                                   KernelMode,
                                   (PVOID*)&symlinkObject,
                                   NULL);
if (!NT_SUCCESS(status))
{
    goto Cleanup;
}
//
// Save the original string that the symlink points to
// so we can change the object back when we unload
//
origStr = symlinkObj->LinkTarget;

After we got our requested symbolic link object, we needed to save either the target string or the device it would point to, in order to return it from our callback function. Retrieving the device is messy and can have some issues, while the target string is right there in the object itself. We stored it in a global variable, and then we had everything we needed to modify the symbolic link. We just needed to create our callback function:

NTSTATUS
SymLinkCallback (
    _In_ POBJECT_SYMBOLIC_LINK Symlink,
    _In_ PVOID SymlinkContext,
    _Out_ PUNICODE_STRING SymlinkPath,
    _Outptr_ PVOID* Object
    )
{
    UNREFERENCED_PARAMETER(Symlink);

    //
    // We need to either return the right object for this symlink
    // or the correct target string.
    // It's a lot easier to get the string, so we can set Object to Null.
    //
    *Object = NULL;
    *SymlinkPath = *(PUNICODE_STRING)(SymlinkContext); // OrigStr

    return STATUS_SUCCESS;
}

The symlinkCallback function receives the symbolic link object, 2 output parameters (only one of which must be set by the function) and a SymlinkContext parameter, which is controlled by whoever is registering the function. We chose to use this context to store the original LinkTarget string, so we can set the output SymlinkPath parameter to it and send the symlink to its correct destination.

After we defined our callback function, we could go back to our main function and edit the symlink object:

//
// Modify the symlink to point to our callback instead of the string
// and change the flags so the union will be treated as a callback.
// Set CallbackContext to the original string so we can
// return it from the callback and allow the system to run normally.
//
symlinkObj->Callback = SymLinkCallback;
symlinkObj->CallbackContext = &symlinkObj->LinkTarget;
_MemoryBarrier();
symlinkObj->Flags |= OBJECT_SYMBOLIC_LINK_USE_CALLBACK;

Theoretically, we were done. We could load our driver and every access to the C: volume should reach our callback. But as some of you might notice, there is actually a race condition here. Since the callback function and context are part of a union which could also be a Unicode String, the data placed there can be interpreted as the wrong type, causing a type confusion and the inevitable crash.

union
{
    UNICODE_STRING LinkTarget;
    struct
    {
        PVOID Callback;
        PVOID CallbackContext;
    };
};

The type of this data is determined by OBJECT_SYMBOLIC_LINK.Flags, but the Callback field is too far from the Flags field in the OBJECT_SYMBOLIC_LINK structure. This means that we can’t change both with a single CPU instruction, unless we go into the realm of Intel’s Transactional Synchronization eXtensions (TSX), which would allow us to perform all these accesses as a single memory transaction without any races. However, outside of side-channel bugs, there doesn’t seem to be any real-world practical uses of TSX, and we’d hate to the be the first ones to suggest any, lest this feature be actually uncancelled by Intel.

This means if we change the flags first, someone might try to access this symlink before we changed the LinkTarget string and the kernel will try to call a Unicode String as if it was executable memory, leading to a crash. Or if we change the string first and someone tries to use the symlink, the kernel will interpret the lower 2 bytes of our callback address as the string length and will try to read that many bytes as a string. That can come up to a huge number that will lead to unexpected results, but most likely to reading invalid memory and again, a crash.

We have found a way to get around this issue, and we think it was a pretty clever idea. It did require about 45 minutes of fighting the linker settings, but that’s just a price you have to pay sometimes. We also realized that a “simpler” solution is possible as well: creating our own OBJECT_SYMBOLIC_LINK with the right settings and then modifying the OBJECT_DIRECTORY_ENTRY to swap the pointer of the original object with ours. Because it’s a simple pointer, we can use InterlockedCompareExchangePointer.Additionally, the OBJECT_DIRECTORY has a lock (EX_PUSH_LOCK) we could use to make the operation totally safe. But we liked our clever way better (and wanted to show off).

As we mentioned – if we change the string pointer to our function pointer and someone tries to use the symlink before we change the flags, they are going to treat the first 2 bytes of the callback address as a string length and try to parse the “string” based on that. Therefore, we decided to just make sure the lower 2 bytes of the callback address are 0000, so the length is treated as 0 and no string parsing is attempted. This means we need to align our callback function to 64KB. Doing that required a lot of attempts and some linker magic, but what eventually worked was this:

  • Create a section named .call$0 and place in it a buffer sized 0xB000.
  • Fill this buffer with zeroes so it won’t be optimized out by the compiler. We picked 0xB000 because we noticed the .text segment was at 0x5000, which got our section to therefore be at 0xB000+0x5000 (0x1000064KB) bytes.
  • Immediately after this, create the normal .text segment, where most of our code will be.
  • After all the other functions, create an executable section named .call$1 and place our callback function there.

This is what our driver is going to look like after making these changes:

EXTERN_C_START

__declspec(code_seg(".call$1"))
NTSTATUS
SymLinkCallback (
    _In_ POBJECT_SYMBOLIC_LINK Symlink,
    _In_ PVOID SymlinkContext,
    _Out_ PUNICODE_STRING SymlinkPath,
    _Outptr_ PVOID* Object
);

EXTERN_C_END

#pragma section(".call$0", write)
__declspec(allocate(".call$0")) UCHAR buffer[0xB000] = { 0 };

#pragma code_seg(".text")
_Use_decl_annotations_
NTSTATUS
DriverEntry (
    PDRIVER_OBJECT DriverObject,
    PUNICODE_STRING RegistryPath
    )
{
    ...
}

#pragma section(".call$1", execute)
__declspec(code_seg(".call$1"))

NTSTATUS
SymLinkCallback (
    _In_ POBJECT_SYMBOLIC_LINK Symlink,
    _In_ PVOID SymlinkContext,
    _Out_ PUNICODE_STRING SymlinkPath,
    _Outptr_ PVOID* Object
    )
{
    ...
}

We compiled our driver and opened it in IDA to see the address of our callback function:

Let’s load our driver and see what happens if we try to treat the symbolic link target as a string…

We dump the symbolic link that we modified, and we can see that when trying to treat our callback address as a Unicode String we get a string with Length == 0, which fixes our race condition.

Of course, if we ever want to be able to unload our driver we also need to implement an unload routine that will change the symbolic link back to its original target. We saved the object in a global symlinkObj variable, and saved the original LinkTarget in a global origStr variable, so we can change everything back when we unload:

_Use_decl_annotations_
VOID
DriverUnload (
    _In_ PDRIVER_OBJECT DriverObject
    )
{
    UNREFERENCED_PARAMETER(DriverObject);

    symlinkObj->Flags &= ~OBJECT_SYMBOLIC_LINK_SANDBOXED;
    _MemoryBarrier();
    symlinkObj->LinkTarget = origStr;

    ObDereferenceObject(symlinkObj);
}

It’s important to first change Flags and only then the LinkTarget to avoid the same race as before. That being said, you probably noticed a second interesting line of code (and unusual in security PoC code) – the call to _MemoryBarrier(). You probably already know that compilers reserve the right to re-order any memory operation performed on non-volatile variables (or members), meaning that there’s no guarantee that the way we wrote these two lines of C would actually end up matching in assembly code.

To solve this, Visual C/C++ includes inline functions such as _ReadWriteBarrier(). However, modern processors themselves can also choose to re-order memory operations at a hardware level, meaning that these two writes could also happen in a different sequence (the same thing can sometimes happen for reads too). To solve the hardware re-ordering issue, we need to use a fence instruction, which is what _MemoryBarrier() does.

Now we have a working callback function that gets called whenever anyone tries to access the C: volume. This method is not visible and will not be detected unless specifically searched for. WinObjEx will show this symbolic link as having no target, which will only look suspicious to someone looking for this technique (although some existing legitimate symbolic links, such as the ones in \KernelObjects, already look like this).

You can get the source code for this simple rootkit driver from our GitHub repo here.

But there is one more thing we want to achieve – getting the full path requested by the caller every time the symbolic link is being accessed. This would let us, for example, return different files or directories depending on who the caller is, as well as monitor accesses. Our callback does not receive this information, and none of us wanted to implement stack parsing to find it. As such, we started looking for a different option… unfortunately, it seemed our last blog post was over 40 pages, and while we heard this was useful for people taking entire afternoons off at certain organizations, we’ll break things up this time and see you in a future Part 2!

R.I.P ROP: CET Internals in Windows 20H1

A very exciting thing happened recently in the 19H1 (Version 1903) release of Windows 10 – parts of the Intel “Control-flow Enforcement Technology” (CET) implementation finally began, after years of discussion. More of this implementation is being added in every Windows release, and this year’s release, 20H1 (Version 2004), completes support for the User Mode Shadow Stack capabilities of CET, which will be released in Intel Tiger Lake CPUs.

As a reminder, Intel CET is a hardware-based mitigation that addresses the two types of control-flow integrity violations commonly used by exploits: forward-edge violations (indirect CALL and JMP instructions) and backward-edge violations (RET instructions).

While the forward-edge implementation is less interesting (as it is essentially a weaker form of clang-cfi, similar to Microsoft’s Control Flow Guard), the backward-edge implementation relies on a fundamental change in the ISA: the introduction of a new stack called the “Shadow Stack”, which now replicates the return addresses that are pushed on the stack by the CALL instruction, with the RET instruction now validating both the stack and shadow stack values and generating an INT #21 (Control Flow Protection Fault) in case of mismatch.

Because operating systems and compilers must sometimes support control flow sequences other than CALL/RET (such as exception unwinding and longjmp), the “Shadow Stack Pointer” (SSP) must sometimes be manipulated at the system level to match the required behavior — and in turn, validated to avoid this manipulation itself from becoming a potential bypass. In this post, we’ll cover how Windows achieves this.

Before diving deeper into how Windows manipulates and validates the shadow stack for threads, there are 2 parts of its implementation that must be first understood.  The first is the actual location and permissions of the SSP, and the second is the mechanism used to store/restore SSP when context switching between threads, as well as how modifications can be done to SSP when needed (such as during exception unwinding).

To explain these mechanisms, we’ll have to delve into an Intel CPU feature that was originally introduced by Intel in order to support “Advanced Vector eXtensions” (AVX) Instructions and first supported by Microsoft in Windows 7. And since adding support for this feature required a massive restructuring of the CONTEXT structure into an undocumented CONTEXT_EX structure (and the addition of documented and native APIs to manipulate it), we’ll have to talk about the internals of that too!

Finally, we’ll even have to go through some compiler and PE file format internals, as well as new process information classes, to cover additional subtleties and requirements for CET functionality on Windows. We hope the Table of Contents, below, will help you navigate this thorough coverage of these capabilities. Additionally, when relevant, annotated source code for the various newly introduced functions is available by clicking the function names, based off our associated GitHub repository.

XState Internals

The x86-x64 architecture class processors originally began with a simple set of registers which most security researchers are familiar with — general purpose registers (RAX, RCX), control registers (RIP/RSP, for example), floating point registers (XMM, YMM, ZMM), and some control, debug, and test registers. As more processor capabilities were added, however, new registers had to be defined, as well as specific processor state associated with these capabilities. And since many of these features are local to a thread, they must be saved and restored during context switches.

In response, Intel defined the “eXtended State” (XState) specification, which associates various processor states with bits in a “State Mask”, and introduces instructions such as XSAVE and XRSTOR to read and write the requested states from an “XSAVE Area”. Since this area is now a critical piece of CET register storage for each thread, and most people have largely been ignoring XSAVE support due to its original focus on floating point, AVX, and “Memory Protection eXtensions” (MPX) features, we thought an overview of the functionality and memory layout would be helpful to readers.

XSAVE Area

As mentioned, the XSAVE Area was originally used to store some of the new floating point functionality like AVX that had been added to processors by Intel, and to consolidate the existing x87 FPU and SSE states that were previously stored through the FXSTOR and FXRSTR instructions. These first two legacy states were defined as part of the “Legacy XSAVE Area”, and any further processor registers (such as AVX) were added to an “Extended XSAVE Area”. In between, an “XSAVE Area Header” is used to describe which extended features are present through a state mask called XSTATE_BV.

At the same time, a new “eXtended Control Register” (XCR0) was added, which defines which states are supported by the operating system as part of the XSAVE functionality, and the XGETBV and XSETBV instructions were added to configure XCR0 (and potentially future XCRs as well). For example, operating systems can choose to program XCR0 not to contain the feature state bits for x87 FPU and SSE, meaning that they will save this information manually with legacy FXSTOR instructions, and only store extended feature state in their XSAVE Areas.

As the number of advanced register sets and capabilities — such as “Memory Protection Keys” (MPK), which added a “Protection Key Register User State” (PKRU) — grew, newer processors introduced a distinction between “Supervisor State” that can only be modified by CPL0 code using XSAVES and XRSRTORS as well as “compaction” and “optimization” versions (XSAVEC/XSAVEOPT) to complicate matters in Intel-typical fashion. A new “Model Specific Register” (MSR), called IA32_XSS, was added to define which states are supervisor-only.

The “optimized XSAVE” mechanism exists to ensure that only processor state which has actually been modified by another thread since the last context switch (if any) will actually be written in the XSAVE Area. An internal processor register, XINUSE, exists to track this information. When XSAVEOPT is used, the XSTATE_BV mask now includes only the bits corresponding to states which were actually saved, and not simply that of all of the states requested.

The “compacted XSAVE” mechanism, on the other hand, fixed a wasteful flaw in the XState design: as more and more extended features were added — such as AVX512 and “Intel Processor Trace” (IPT) — it meant that even for threads which did not use these capabilities, a sufficiently large XSAVE Area needed to be allocated, and written into (full of zeroes) by the processor. While optimized XSAVE would avoid these writes, it still meant that any extended features following large-yet-unused states would be at large offsets away from the base XSAVE Area buffer.

With XSAVEC, this problem is solved by only using space to save the XState features that are actually enabled (and in-use, as compaction implies optimization) by the current thread, and sequentially laying out each saved state in memory, without gaps in between (but potentially with a fixed 64-byte alignment, which is provided as part of an “Alignment Mask” through CPUID). The XSAVE Area Header shown earlier is now extended with a second state mask called XCOMP_BV, which indicates which of the requested state bits that were requested might be present in the compated area. Note that unlike XSTATE_BV, this mask does not omit the state bits that were not part of XINUSE — it includes all possible bits that could’ve been compacted — one must still check XSTATE_BV to determine which state areas are actually present. Finally, Bit 63 is always set in XCOMP_BV when the compacted instruction was used, as an indicator for which format the XSAVE Area has.

Thus, using the compacted vs. non-compacted format determines the internal layout and size of the XSAVE Area. The compacted format will only allocate memory in the XSAVE Area for processor features used by the thread, while the non-compacted one will allocate memory for all the processor features supported by the processor, but only populate the ones used by the thread. The diagram below shows an example of how the XSAVE Area will look like for the same thread but when using one vs. the other format.

To summarize, which states the XSAVE*/XRSTOR* family of instructions will work with is a combination of

  1. What state bits the OS claims it supports in XCR0 (set using the XSETBV instruction)
  2. What state bits the caller stores in EDX:EAX when using the XSAVE instruction (Intel calls this the “instruction mask”)
  3. If using the non-privileged instructions, which state bits are not set in IA32_XSS
  4. On processors that support “Optimized XSAVE”, which state bits are set in XINUSE, an internal register that tracks the actual XState-related registers that have been used by the current thread since the last transition

Once these bits are masked together, the final set of resulting state bits are written by the XSAVE instruction into the header of the XSAVE Area in a field called the XSTATE_BV. In the case where “Compacted XSAVE” is used, the resulting state bits omitting bullet 4 (XINUSE) are written into the header of the XSAVE Area in the XCOMP_BV field. The diagram below shows the resulting masks.

XState Configuration

Because each processor has its own set of XState-enabled features, potential sizes, capabilities, and mechanisms, Intel exposes all of this information through various CPUID classes that an operating system should query when dealing with XState. Windows performs these queries at boot, and stores the information in an XSTATE_CONFIGURATION structure, which is shown below (documented in Winnt.h)

typedef struct _XSTATE_CONFIGURATION
{
    ULONG64 EnabledFeatures;
    ULONG64 EnabledVolatileFeatures;
    ULONG Size;
    union
    {
        ULONG ControlFlags;
        struct
        {
            ULONG OptimizedSave:1;
            ULONG CompactionEnabled:1;
        };
    };
    XSTATE_FEATURE Features[MAXIMUM_XSTATE_FEATURES];
    ULONG64 EnabledSupervisorFeatures;
    ULONG64 AlignedFeatures;
    ULONG AllFeatureSize;
    ULONG AllFeatures[MAXIMUM_XSTATE_FEATURES];
    ULONG64 EnabledUserVisibleSupervisorFeatures;
} XSTATE_CONFIGURATION, *PXSTATE_CONFIGURATION;

After filing out this data, the kernel saves this information in the KUSER_SHARED_DATA structure, which can be accessed through the SharedUserData variable and is located at 0x7FFE0000 on all Windows platforms.

For example, here is the output of our test 19H1 system, which supports both optimized and compacted forms of XSAVE, and has the x87 FPU (0), SSE (1), AVX (2) and MPX (3, 4) feature bits enabled.

dx ((nt!_KUSER_SHARED_DATA*)0x7ffe0000)->XState
    [+0x000] EnabledFeatures  : 0x1f [Type: unsigned __int64]
    [+0x008] EnabledVolatileFeatures : 0xf [Type: unsigned __int64]
    [+0x010] Size         	: 0x3c0 [Type: unsigned long]
    [+0x014] ControlFlags 	: 0x3 [Type: unsigned long]
    [+0x014 ( 0: 0)] OptimizedSave	: 0x1 [Type: unsigned long]
    [+0x014 ( 1: 1)] CompactionEnabled : 0x1 [Type: unsigned long]
    [+0x018] Features     	[Type: _XSTATE_FEATURE [64]]
    [+0x218] EnabledSupervisorFeatures : 0x0 [Type: unsigned __int64]
    [+0x220] AlignedFeatures  : 0x0 [Type: unsigned __int64]
    [+0x228] AllFeatureSize   : 0x3c0 [Type: unsigned long]
    [+0x22c] AllFeatures  	[Type: unsigned long [64]]
    [+0x330] EnabledUserVisibleSupervisorFeatures : 0x0 [Type: unsigned __int64]

In the Features array, the size and offset of each of these five features can be found:

dx -r2 (((nt!_KUSER_SHARED_DATA*)0x7ffe0000)->XState)->Features.Take(5)
    [0]          	[Type: _XSTATE_FEATURE]
        [+0x000] Offset       	: 0x0 [Type: unsigned long]
        [+0x004] Size         	: 0xa0 [Type: unsigned long]
    [1]          	[Type: _XSTATE_FEATURE]
        [+0x000] Offset       	: 0xa0 [Type: unsigned long]
        [+0x004] Size         	: 0x100 [Type: unsigned long]
    [2]          	[Type: _XSTATE_FEATURE]
        [+0x000] Offset       	: 0x240 [Type: unsigned long]
        [+0x004] Size         	: 0x100 [Type: unsigned long]
    [3]          	[Type: _XSTATE_FEATURE]
        [+0x000] Offset       	: 0x340 [Type: unsigned long]
        [+0x004] Size         	: 0x40 [Type: unsigned long]
    [4]          	[Type: _XSTATE_FEATURE]
        [+0x000] Offset       	: 0x380 [Type: unsigned long]
        [+0x004] Size         	: 0x40 [Type: unsigned long]

Adding up these sizes gives us 0x3C0, which is the value seen above in the FeatureSize field. Note, however, that since this system supports the Compacted XSAVE capability, the offsets shown here are not relevant, and only the AllFeatures field is useful to the kernel, which contains the size of every feature, but not its offset (as this will be determined based on the compaction mask used in XCOMP_BV).

XState Policy

Unfortunately, even though a processor might claim to support a given XState feature, it oftens turns out that due to various hardware errata, certain specific processors may not fully, or correctly, support the feature after all. In order to handle this eventuality, Windows uses an XState Policy, which is information stored in the resource section of a Hardware Policy Driver that is normally called HwPolicy.sys.

As the Intel x86 architecture is a combination of multiple processor vendors all competing with variants of each other’s feature sets, the kernel must parse the XState policy and compare the current processor’s Vendor String and Microcode Version as well as its Signature, Features, and Extended Features (namely, RAX, RDX, and RCX from a CPUID 01h query), looking for a match in the policy.

This work is done at boot by the KiIntersectFeaturesWithPolicy function that’s called by KiInitializeXSave, which calls KiLoadPolicyFromImage to load the appropriate XState policy, calls KiGetProcessorInformation to get the CPU data mentioned earlier, and then validates each feature bit currently enabled in the XState Configuration through calls to KiIsXSaveFeatureAllowed.

These functions work with resource 101 in the HwPolicy.sys driver, which begins with the following data structure:

typedef struct _XSAVE_POLICY
{
    ULONG Version;
    ULONG Size;
    ULONG Flags;
    ULONG MaxSaveAreaLength;
    ULONGLONG FeatureBitmask;
    ULONG NumberOfFeatures;
    XSAVE_FEATURE Features[1];
} XSAVE_POLICY, *PXSAVE_POLICY;

For example, on our 19H1 system, the contents (which we extracted with Resource Hacker), were as follows:

dx @$policy = (_XSAVE_POLICY*)0x253d0e90000
[+0x000] Version       : 0x3 [Type: unsigned long]
[+0x004] Size          : 0x2fd8 [Type: unsigned long]
[+0x008] Flags         : 0x9 [Type: unsigned long]
[+0x00c] MaxSaveAreaLength : 0x2000 [Type: unsigned long]
[+0x010] FeatureBitmask   : 0x7fffffffffffffff [Type: unsigned __int64]
[+0x018] NumberOfFeatures : 0x3f [Type: unsigned long]
[+0x020] Features      [Type: _XSAVE_FEATURE [1]]

For each XSAVE_FEATURE, an offset to a XSAVE_VENDORS structure is found, which contains an array of XSAVE_VENDOR structures, each with a CPU Vendor String (for now, each seem to be either “GenuineIntel”, “AuthenticAMD”, or “CentaurHauls”), and an offset to an XSAVE_CPU_ERRATA structure. For example, our 19H1 test system had the following information for Feature 0:

dx -r4 @$vendor = (XSAVE_VENDORS*)((int)@$policy->Features[0].Vendors + 0x253d0e90000)
[+0x000] NumberOfVendors  : 0x3 [Type: unsigned long]
[+0x008] Vendor        [Type: _XSAVE_VENDOR [1]]
    [0]           [Type: _XSAVE_VENDOR]
        [+0x000] VendorId      [Type: unsigned long [3]]
            [0]           : 0x756e6547 [Type: unsigned long]
            [1]           : 0x49656e69 [Type: unsigned long]
            [2]           : 0x6c65746e [Type: unsigned long]
[+0x010] SupportedCpu  [Type: _XSAVE_SUPPORTED_CPU]
[+0x000] CpuInfo       [Type: XSAVE_CPU_INFO]
[+0x020] CpuErrata     : 0x4c0 [Type: XSAVE_CPU_ERRATA *]
[+0x020] Unused        : 0x4c0 [Type: unsigned __int64]

Finally, each XSAVE_CPU_ERRATA structure contains the matching processor information data that corresponds to a known errata which prevents the specified XState feature from being supported. For example, in our test system, the first errata from the offset above was:

dx -r3 @$errata = (XSAVE_CPU_ERRATA*)((int)@$vendor->Vendor[0].SupportedCpu.CpuErrata + 0x253d0e90000)
    [+0x000] NumberOfErrata   : 0x1 [Type: unsigned long]
    [+0x008] Errata       	[Type: XSAVE_CPU_INFO [1]]
        [0]          	[Type: XSAVE_CPU_INFO]
            [+0x000] Processor    	: 0x0 [Type: unsigned char]
            [+0x002] Family       	: 0x6 [Type: unsigned short]
            [+0x004] Model        	: 0xf [Type: unsigned short]
            [+0x006] Stepping     	: 0xb [Type: unsigned short]
            [+0x008] ExtendedModel	: 0x0 [Type: unsigned short]
            [+0x00c] ExtendedFamily   : 0x0 [Type: unsigned long]
            [+0x010] MicrocodeVersion : 0x0 [Type: unsigned __int64]
            [+0x018] Reserved     	: 0x0 [Type: unsigned long]

A tool which dumps your system’s hardware policy for all XState features is available on our GitHub here. For now, only one errata appears in the entire policy (the one showed above).

Finally, the following optional loader command line options (and respective BCD settings) can be used to further customize XState capabilities:

  1. The XSAVEPOLICY=n load option, set through the xsavepolicy BCD option, which sets KeXSavePolicyId, indicating which of the XState policies to load.
  2. The XSAVEREMOVEFEATURE=n load option, set through the xsaveremovefeature BCD option, which sets KeTestRemovedFeatureMask. This will be later parsed by KiInitializeXSave and elide the specified state bits from the support. Note that State 0 (x87 FPU) and State 1 (SSE) cannot be removed this way.
  3. The XSAVEDISABLE load option, set through the xsavedisable BCD option, which sets KeTestDisableXsave, and causes KiInitializeXSave to set all XState related configuration data to 0, disabling the whole XState feature entirely.

CET XSAVE Area Format

As part of its implementation of CET, Intel defined two new bits in the XState standard, called XSTATE_CET_U (11) and XSTATE_CET_S (12), corresponding to user and supervisor state, respectively. The first state is a 16-byte data structure which MSDN documents as XSAVE_CET_U_FORMAT containing the IA32_U_CET MSR (which is where the “Shadow Stack Enable” flag is configured) and the IA32_PL3_SSP MSR (where the “Privilege Level 3 SSP” is stored). The second, which does not yet have an MSDN definition, includes the IA32_PL0/1/2_SSP MSRs.

typedef struct _XSAVE_CET_U_FORMAT
{
    ULONG64 Ia32CetUMsr;
    ULONG64 Ia32Pl3SspMsr;
} XSAVE_CET_U_FORMAT, *PXSAVE_CET_U_FORMAT;

typedef struct _XSAVE_CET_S_FORMAT
{
    ULONG64 Ia32Pl0SspMsr;
    ULONG64 Ia32Pl1SspMsr;
    ULONG64 Ia32Pl2SspMsr;
} XSAVE_CET_S_FORMAT, *PXSAVE_CET_S_FORMAT;

As the field names suggest, CET-related “registers” are actually values stored in respective MSRs, which can normally only be accessed through RDMSR and WRMSR privileged instructions in Ring 0. However, unlike most MSRs which store processor-global data, CET can be enabled on a per-thread basis, and the shadow stack pointer is also obviously per-thread. For these reasons, CET-related data must be made part of the XState functionality such that operating systems can correctly handle thread switches.

Since CET registers are basically MSRs which can normally only be modified by kernel code, they are not accessible through the CPL3 XSAVE/XRSTOR instructions and their respective state bits are always set to 1 in the IA32_XSS MSR. However, what makes things harder is the fact that the operating system cannot completely block user-mode code from modifying SSP. User-mode code might legitimately need to update the SSP as part of exception handling, unwinding, setjmp/longjmp, or specific functionality such as Windows’ “Fiber” mechanism.

As such, operating systems need to provide a way for threads to modify CET state in XState through a system call, much like Windows provides SetThreadContext as a mechanism to update certain protected CPU registers such as CS and DR7, as long as certain rules are met. Therefore, in the next section we’ll see how the CONTEXT structure evolved into the CONTEXT_EX structure on more modern Windows versions in order to support XState-related information, and how CET-specific handling had to be added for legitimate exception-related scenarios, while also avoiding malicious control-flow attacks through corrupted CONTEXTs.

CONTEXT_EX Internals

In order to support the increasing number of registers that have to be saved on every context switch, new versions of Windows have the CONTEXT_EX structure, in addition to the legacy CONTEXT structure. This was needed due to the fact that CONTEXT is a fixed-size structure, while XSAVE has introduced the need for dynamically-sized processor state data that is dependent on the thread, processor, and even machine configuration policy.

CONTEXT_EX Structure

Unfortunately, although now used all over the kernel and user-mode exception handling functionality, the CONTEXT_EX structure is largely undocumented, save for the accidental release of some information in the Windows 7 header files and some Intel reference code (which might suggest Intel actually is responsible for defining this abomination). Simply take a look at this comment block and tell us if you can understand anything:

//
// This structure specifies an offset (from the beginning of CONTEXT_EX
// structure) and size of a single chunk of an extended context structure.
//
// N.B. Offset may be negative.
//
typedef struct _CONTEXT_CHUNK
{
    LONG Offset;
    DWORD Length;
} CONTEXT_CHUNK, *PCONTEXT_CHUNK;

//
// CONTEXT_EX structure is an extension to CONTEXT structure. It defines
// a context record as a set of disjoint variable-sized buffers (chunks)
// each containing a portion of processor state. Currently there are only
// two buffers (chunks) are defined:
//
// - Legacy, that stores traditional CONTEXT structure;
// - XState, that stores XSAVE save area buffer starting from
// XSAVE_AREA_HEADER, i.e. without the first 512 bytes.
//
// There a few assumptions exists that simplify conversion of PCONTEXT
// pointer to PCONTEXT_EX pointer.
//
// 1. APIs that work with PCONTEXT pointers assume that CONTEXT_EX is
// stored right after the CONTEXT structure. It is also assumed that
// CONTEXT_EX is present if and only if corresponding CONTEXT_XXX
// flags are set in CONTEXT.ContextFlags.
//
// 2. CONTEXT_EX.Legacy is always present if CONTEXT_EX structure is
// present. All other chunks are optional.
//
// 3. CONTEXT.ContextFlags unambigiously define which chunks are
// present. I.e. if CONTEXT_XSTATE is set CONTEXT_EX.XState is valid.
//
typedef struct _CONTEXT_EX
{
    //
    // The total length of the structure starting from the chunk with
    // the smallest offset. N.B. that the offset may be negative.
    //
    CONTEXT_CHUNK All;

    //
    // Wrapper for the traditional CONTEXT structure. N.B. the size of
    // the chunk may be less than sizeof(CONTEXT) is some cases (when
    // CONTEXT_EXTENDED_REGISTERS is not set on x86 for instance).
    //
CONTEXT_CHUNK Legacy;
    //

    // CONTEXT_XSTATE: Extended processor state chunk. The state is
    // stored in the same format XSAVE operation strores it with
    // exception of the first 512 bytes, i.e. staring from
    // XSAVE_AREA_HEADER. The lower two bits corresponding FP and
    // SSE state must be zero.
    //
CONTEXT_CHUNK XState;
} CONTEXT_EX, *PCONTEXT_EX;


#define
CONTEXT_EX_LENGTH ALIGN_UP_BY(sizeof(CONTEXT_EX), STACK_ALIGN)

//
// These macros make context chunks manupulations easier.
//

So while these headers do attempt to explain the layout of the CONTEXT_EX structure, the text is obtuse enough (and full of English errors) that it took us several rounds of arguments and shots until we could visualize it, and felt a diagram might be helpful.

As shown in the diagram, the CONTEXT_EX structure is always at the end of the CONTEXT structure, and has 3 fields of type CONTEXT_CHUNK called All, Legacy, and XState. Each of these define an offset and a length to the data associated with them, and various RTL_ macros exist to retrieve the appropriate data pointer.

The Legacy field refers to the beginning of the original CONTEXT structure (although the Length might be smaller on x86 if CONTEXT_EXTENDED_REGISTERS is not supplied). The All field refers to the beginning of the original CONTEXT structure as well, but its Length describes the totality of all the data, including the CONTEXT_EX itself and padding/alignment space required for the XSAVE Area. Finally, the XState field refers to the XSAVE_AREA_HEADER structure (which then defines the state mask of which state bits are enabled and thus whose data is present) and the length of the entire XSAVE Area. Due to this layout, it’s important to note that All and Legacy will have negative offsets.

Since all of this math is hard, Ntdll.dll exports various APIs to simplify building, reading, copying, and otherwise manipulating the various data that is stored in a CONTEXT_EX (some, but not all, of these APIs are internally used by Ntoskrnl.exe, but none are exported). In turn, KernelBase.dll exports documented Win32 functions which internally use these capabilities.

Initializing a CONTEXT_EX

First, callers should figure out how much memory to allocate in order to store a CONTEXT_EX, which can be done by using the following API:

NTSYSAPI
ULONG
NTAPI
RtlGetExtendedContextLength (
    _In_ ULONG ContextFlags,
    _Out_ PULONG ContextLength
);

Callers are expected to supply the appropriate CONTEXT_XXX flags to specify which registers they intend to save (and namely CONTEXT_XSTATE otherwise using a CONTEXT_EX does not really buy much). This API then reads SharedUserData.XState.EnabledFeatures and SharedUserData.XState.EnabledUserVisibleSupervisorFeatures and passes in the union of all the bits to an extended function (also exported) shown below.

NTSYSAPI
ULONG
NTAPI
RtlGetExtendedContextLength2 (
    _In_ ULONG ContextFlags,
    _Out_ PULONG ContextLength,
    _In_ ULONG64 XStateCompactionMask

);

Note how this newer API allows manually specifying which XState states to actually save, instead of grabbing all enabled features from the XState Configuration in the Shared User Data. This results in a CONTEXT_EX structure that will be smaller and won’t contain enough space for all possible XState State Data, so future usage of this CONTEXT_EX should make sure to never leverage XState State Bits outside the specified mask.

Next, a caller would allocate memory for the CONTEXT_EX (in most cases Windows will use alloca() to avoid memory exhaustion failures in exception paths) and use one of these two APIs:

NTSYSAPI
ULONG
NTAPI
RtlInitializeExtendedContext (
    _Out_ PVOID Context,
    _In_ ULONG ContextFlags,
    _Out_ PCONTEXT_EX* ContextEx

);

NTSYSAPI
ULONG
NTAPI
RtlInitializeExtendedContext2 (
    _Out_ PVOID Context,
    _In_ ULONG ContextFlags,
    _Out_ PCONTEXT_EX* ContextEx,
    _In_ ULONG64 XStateCompactionMask

);

Just like before, the newer API allows manually specifying which XState states to save in their compacted form, otherwise all features available (based on SharedUserData) are assumed to be present. Obviously, it is expected that the caller specifies the same ContextFlags as in the call to RtlGetExtendedContextLength(2), to make sure that the context structure is of the correct size as was allocated. In return, the caller now receives a pointer to the CONTEXT_EX structure, which is expected to follow the input CONTEXT buffer.

Once a CONTEXT_EX exists, a caller would likely first be interested in obtaining the legacy CONTEXT structure back from it (without making assumptions on sizes), which can be done with this next API:

NTSYSAPI
PCONTEXT
NTAPI
RtlLocateLegacyContext (
    _In_ PCONTEXT_EX ContextEx,
    _Out_opt_ PULONG Length,
);

As mentioned above, however, these are the undocumented and internal APIs that are exposed by the NT layer of Windows. Legitimate Win32 applications would instead simplify their usage of XState-compatible CONTEXT structures by using the following function(s) instead:

WINBASEAPI
BOOL
WINAPI
InitializeContext (
    _Out_writes_bytes_opt_(*ContextLength) PVOID Context,
    _In_ DWORD ContextFlags,
    _Out_ PCONTEXT_EX Context,
    _Inout_ PDWORD ContextFlags

);

WINBASEAPI
BOOL
WINAPI
InitializeContext2 (

    _Out_writes_bytes_opt_(*ContextLength) PVOID Context,
    _In_ DWORD ContextFlags,
    _Out_ PCONTEXT_EX Context,
    _Inout_ PDWORD ContextFlags,
    _In_ ULONG64 XStateCompactionMask

);

These two APIs behave similarly to a combination of using the undocumented APIs: when callers first pass in NULL as the Buffer and Context parameters, the function returns the required length in ContextLength, which callers should allocate from memory. On the second attempt, callers pass in the allocated pointer in Buffer, and receive a pointer to the CONTEXT structure in Context without any knowledge of the underlying CONTEXT_EX structure.

Controlling XState Feature Masks in CONTEXT_EX

In order to access the XSTATE_BV(the extended feature mask), which is deeply embedded in the Mask field of the XSAVE_AREA_HEADER of the CONTEXT_EX, the system exports two APIs for easily checking which XState features are enabled in the CONTEXT_EX, with a corresponding API for modifying the XState mask.

Note, however, that Windows never stores x87 FPU (0) and SSE (1) states in the XSAVE Area, and instead uses the FXSAVE instruction, meaning that the XSAVE Area will never contain the Legacy Area, and immediately start with the XSAVE_AREA_HEADER. Due to this, the Get API will always mask the bottom 2 bits out. The Set API will, in addition, also make sure that the specified feature is present in the EnabledFeatures of the XState Configuration.

Keep in mind that if a hardcoded compaction mask was specified in InitializeContext2 (or the internal native APIs), the Set API should not be used other than to elide existing state bits (since adding a new bit would imply additional, non-initialized out-of-bounds state data in the CONTEXT_EX, which would’ve already been pre-allocated without this data).

NTSYSAPI
ULONG64
NTAPI
RtlGetExtendedFeaturesMask (
    _In_ PCONTEXT_EX ContextEx
);

NTSYSAPI
ULONG64

NTAPI
RtlSetExtendedFeaturesMask (
    _In_ PCONTEXT_EX ContextEx,
    _In_ ULONG64 FeatureMask
);

The documented form of these APIs is as follows:

WINBASEAPI
BOOL
WINAPI
GetXStateFeaturesMask (
    _In_ PCONTEXT Context
    _Out_ PDWORD64 FeatureMask
);


NTSYSAPI
ULONG64

NTAPI
SetXStateFeaturesMask (
    _In_ PCONTEXT Context,
    _In_ DWORD64 FeatureMask
);

Locating XState Features in a CONTEXT_EX

Because of the complexity of the CONTEXT_EX structure, as well as the fact that XState features might be present in either compacted or non-compacted form, and that their presence is also dependent on the various state masks described earlier (especially if optimized XSAVE is supported), callers need a library function in order to quickly and easily obtain a pointer to the relevant state data in the XSAVE Area within the CONTEXT_EX.

Currently two such functions exist, shown below, with RtlLocateExtendedFeature being just a wrapper around RtlLocateExtendedFeature2, which supplies it with a pointer to the SharedUserData.XState as the Configuration parameter. As both are exported, callers can also manually specify their own custom XState Configuration in the latter API if they so choose.

NTSYSAPI
PVOID
NTAPI
RtlLocateExtendedFeature (
    _In_ CONTEXT_EX ContextEx,
    _In_ ULONG FeatureId,
    _Out_opt_ PULONG Length

);

NTSYSAPI
PVOID
NTAPI
RtlLocateExtendedFeature2 (
    _In_ CONTEXT_EX ContextEx,
    _In_ ULONG FeatureId,
    _In_ PXSTATE_CONFIGURATION Configuration,
    _Out_opt_ PULONG Length
);

Both of the two functions receive a CONTEXT_EX structure and an ID for a requested feature, and parse the XState Configuration data in order to return a pointer for where the feature is stored in the XSAVE Area. Note that they don’t validate or return any actual value for the specified feature, which is up to the caller.

To find the pointer, RtlLocateExtendedFeature2 does the following:

  • Makes sure that the Feature ID is above 2 (since x87 FPU and SSE states are never saved through XSAVE by Windows) and below 64 (the highest possible XState feature bit)

  • Gets the XSAVE_AREA_HEADER from CONTEXT_EX + CONTEXT_EX.XState.Offset

  • Reads the Configuration->ControlFlags.CompactionEnabled flag to know if using compaction or not

  • If using the non-compacted format:

    • Reads Configuration->Features[n].Offset and .Size to learn the offset and size of the requested feature in the XSAVE Area

  • If using the compacted format:

    • Reads the CompactionMask from the XSAVE_AREA_HEADER (corresponding to XCOMP_BV) and checks if it contains the requested feature

    • Reads Configuration->AllFeatures to learn the sizes of all the enabled states whose state bit comes before the requested feature ID, and calculates the offset of the requested format based on adding up these sizes, aligning the beginning of each previous state area to 64 bytes if the corresponding bit is set in Configuration->AlignedFeatures, and then finally aligning the start of the area for specified feature ID if needed as well

    • Reads the size of the requested feature from Configuration.AllFeatures[n]

  • Locates the feature in the XSAVE Area based on its computed offset from above and returns a pointer to it, optionally alongside its respective size in the output Length variable.

This means that to find the address of a certain feature with the non-compacted format, it’s enough to check in SharedUserData which features are supported by the processor. In the compacted format however, it’s impossible to rely on the offsets in SharedUserData, making it necessary to also check which features are enabled on the thread, and to calculate the right offset for the feature based on the sizes of all the previous features.

In legitimate Win32 applications, a different API is used, which internally calls the native API above, but with some pre-processing. Since state bit 0 and 1 are never saved as part of the XSAVE Area in the CONTEXT_EX, the Win32 API handles these two feature bits by grabbing them from the appropriate Legacy CONTEXT fields, namely FltSave for XSTATE_LEGACY_FLOATING_POINT and Xmm0 for XSTATE_LEGACY_SSE.

WINBASEAPI
PVOID
WINAPI
LocateXStateFeature (
    _In_ CONTEXT_EX Context,
    _In_ DWORD FeatureId,
    _Out_opt_ PDWORD Length
);

Example Usage and Output

In order to make sense out of the XState Internals, especially when combined with the CONTEXT_EX data structure, we’ve written a simple test program, available on our GitHub here. This utility demonstrates some of the API usage as well as the various offsets, sizes, and behaviors involved. Here’s the output of the program (which uses AVX registers) on a system with AVX, MPX, and Intel PT:

Among other things, note how the Legacy CONTEXT is at a negative offset, as expected, and how even though the system supports the x87 FPU State (1) and GSSE State (2), the XSAVEBV does not contain these bits as they are instead saved in the Legacy CONTEXT area (and hence, note the negative offsets of their associated state data). Following the XSAVE Header (itself at offset 0x30) which is 0x40 bytes, note that the AVX State (2) starts at offset 0x70 as the math would suggest.

CONTEXT_EX Validation

Since user-mode APIs can construct a CONTEXT_EX which eventually gets processed by the kernel and modifies privileged parts of the XSAVE area (namely, the CET state data), Windows must guard against undesirable modifications that can be done through APIs which accept a CONTEXT_EX, such as:

  • NtContinue, which is used to resume after an exception, handle longjmp CRT functionality, as well as perform stack unwinding
  • NtRaiseException, which is used to inject an exception into an existing thread
  • NtQueueUserApc, which is used to hijack execution flow of an existing thread
  • NtSetContextThread, which is used to modify the processor registers/state of an existing thread

As any of these system calls could cause the kernel to modify either the IA32_PL3_SSP or the IA32_CET_U MSRs, as well as directly modify RIP to an unexpected target, Windows must validate that the passed-in CONTEXT_EX does not violate CET guarantees.

We’ll soon cover how this is done to validate the SSP in 19H1 and the addition of the RIP validation in 20H1. First though, a small refactor had to be done to reduce the potential for misusing NtContinue: the introduction of the NtContinueEx function.

NtContinueEx and KCONTINUE_ARGUMENT

As enumerated above, the functionality of NtContinue is used in a number of situations, and for CET to be resilient in the face of an API that allows arbitrary changes to processor state, greater fine grained control had to be added to the interface. This was done through the creation of a new enumeration called KCONTINUE_TYPE, which is present in a KCONTINUE_ARGUMENT data structure that must now be passed to the enhanced version of NtContinueNtContinueEx.

This data structure also contains a new ContinueFlags field, which replaces the original TestAlert argument of NtContinue with the flag CONTINUE_FLAG_RAISE_ALERT (0x1), while also introducing a new CONTINUE_FLAG_BYPASS_CONTEXT_COPY (0x2) flag which directly delivers an APC with the new TrapFrame. This is an optimization which was previously implemented by checking if the CONTEXT record pointer was at a specific location in the user-stack, which made the function assume it was being used as part of User Mode APC delivery. Callers desiring this behavior must now explicitly set the flag in ContinueFlags instead.

Note that while the old interface continues to be supported for legacy reasons, it internally calls NtContinueEx which recognizes the input parameter as the BOOLEAN TestAlert parameter, and not a KCONTINUE_ARGUMENT. Such a case is treated as a KCONTINUE_UNWIND for purposes of the new interface.

As part of this refactor, the following four possible types exist:

  • KCONTINUE_UNWIND – This is used by legacy callers of NtContinue, such as RtlRestoreContext and LdrInitializeThunk, which is used when unwinding from exceptions.

  • KCONTINUE_RESUME – This is used by KiInitializeUserApc when building the KCONTINUE_ARGUMENT structure on the user mode stack that KiUserApcDispatcher will run on before calling NtContinueEx again.

  • KCONTINUE_LONGJUMP – This is used by RtlContinueLongJump which is called by RtlRestoreContext if the exception code in the exception record is STATUS_LONGJUMP.

  • KCONTINUE_SET – This is never passed to NtContinueEx directly, but rather used when calling KeVerifyContextIpForUserCet from within PspGetSetContextInternal in response to an NtSetContextThread API.

Shadow Stack Pointer (SSP) Validation

As we mentioned, there are legitimate cases where user-mode code will need to change the shadow stack pointer, such as exception unwinding, APCs, longjmp, etc. But the operating system has to validate the new value requested for the SSP, in order to prevent CET bypasses. In 19H1 this was implemented by the new KeVerifyContextXStateCetU function. This function receives the thread whose context is being modified and the new context for the thread, and does the following:

  • If the CONTEXT_EX does not contain any XState data, or if the XState data does not contain CET registers (checked by calling RtlLocateExtendedFeature2 with the XSTATE_CET_U state bit), no validation is needed.

  • If CET is enabled on the target thread:

    • Validate that the caller is not attempting to disable CET on this thread by masking out XSTATE_MASK_CET_U from XSAVEBV. If this is happening, the function will re-enable the state bit, set MSR_IA32_CET_SHSTK_EN (which is a flag that enables the Shadow Stack feature of CET) in Ia32CetUMsr, and set the current shadow stack as Ia32Pl3SspMsr.

    • Otherwise, call KiVerifyContextXStateCetUEnabled, to validate that CET shadow stacks are enabled (MSR_IA32_CET_SHSTK_EN is enabled), that the new SSP is 8-byte aligned, and that it is between the current SSP value and the end of the shadow stack region’s VAD. Note that since stacks grow backward, the “end” of the region is actually the beginning of the stack. Therefore, when setting a new context for a thread, any SSP value is valid as long as it is inside the part of the shadow stack that has been used so far by the thread. There is no limit on how far back a thread can go inside its shadow stack.

  • If CET is disabled on the target thread and the caller is attempting the enable it by including the XSTATE_CET_U mask in the XSAVEBV of the CONTEXT_EX, only allow both MSR values to be set to 0 (no shadow stacks, and no SSP).

Any failures in the validations described will return STATUS_SET_CONTEXT_DENIED, while STATUS_SUCCESS is returned in other cases.

Enabling CET also implicitly enables Check Stack Extents, originally implemented in Windows 8.1 together with CFG. This is visible through the CheckStackExtents bit in the ProcessFlags field of KPROCESS. This means that whenever the target SSP is being validated, KeVerifyContextRecord will also be called, and will verify that the target RSP is either part of the current thread’s TEB’s user stack limits (or the TEB32’s user stack limits, if this is a WOW64 process). These checks, implemented by RtlGuardIsValidStackPointer (and RtlGuardIsValidWow64StackPointer) have previously been documented (and shown as being insufficient) by researchers at both Tenable and enSilo.

Instruction Pointer (RIP) Validation

In 19030 another feature using Intel CET appeared – verifying that the new RIP that a caller is attempting to set for the process is a valid one. Just like SSP validation, this mitigation can only be enabled if cet is enabled for the thread. However, RIP validation is not enabled by default and must be enabled for the process (which is indicated by the UserCetSetContextIpValidation bit in the MitigationFlags2Values field of EPROCESS).

That being said, for the current builds, it appears that when calling CreateProcess and using the PROC_THREAD_ATTRIBUTE_MITIGATION_POLICY attribute, if the PROCESS_CREATION_MITIGATION_POLICY2_CET_USER_SHADOW_STACKS_ALWAYS_ON flag is enabled, the option will be set. (Note that calling the SetProcessMitgationPolicy API with the ProcessUserShadowStackPolicy value is not valid, as CET can only be enabled at process creation time).

Interestingly, however, a new mitigation option was added to the mitigation map, PS_MITIGATION_OPTION_USER_CET_SET_CONTEXT_IP_VALIDATION (32). Toggling this (undocumented) mitigation option has the effect of enabling the AuditUserCetSetContextIpValidation bit in the MitigationFlags2Values field instead, which will be described shortly. Additionally, because this is now the 32nd mitigation option (each of which takes up 4 bits for DEFERRED/OFF/ON/RESERVED), there are now thus 132 mitigation bits needed, and the PS_MITIGATION_OPTIONS_MAP has expanded to 3 64-bit array elements in the Map field (which has follow-on effects to the size of the PS_SYSTEM_DLL_INIT_BLOCK).

The new KeVerifyContextIpForUserCet function will be called whenever a thread’s context is about to be changed. It will check that both CET and the RIP mitigation are enabled for the thread, and also checks if CONTEXT_CONTROL flag set in the context parameter, meaning that RIP will be changed by this new context. If all these checks pass, it calls the internal KiVerifyContextIpForUserCet function. The purpose of this function is to validate that the target RIP is a valid value, and not one used by an exploit to run arbitrary code.

First it checks that the target RIP address is not a kernel address, and also not an address in the lower 0x10000 bytes, that should not be mapped. Then it retrieves that base trap frame and check if the target RIP is the RIP of that trap frame. This is meant to allow cases where the target RIP is the previous address in user mode. This will usually happen when this is the first time NtSetThreadContext is called for this thread, and the RIP is being set to the initial start address for the thread, but can also happen in other, less common cases.

The function receives the KCONTINUE_TYPE and based on its value, it handles the target RIP in different ways. In most cases it will iterate over the shadow stack and search for the target RIP. If it doesn’t find it, it will keep running until it hits an exception and gets to its exception handler. The exception handler will check if the KCONTINUE_TYPE supplied is KCONTINUE_UNWIND, and if it is call RtlVerifyUserUnwindTarget with the KCONTINUE_UNWIND flag. This function will try to verify RIP again, this time using more complex checks which we describe in the next section.

In any other case, it will return STATUS_SET_CONTEXT_DENIED, which will make KeVerifyContextIpForUserCet call the KiLogUserCetSetContextIpValidationAudit function in order to audit the failure if the AuditUserCetSetContextIpValidation flag is set in the EPROCESS. This “auditing” is quite interesting, as instead of being done over the usual process mitigation ETW channel, it is done by directly raising a fast fail exception through the Windows Error Reporting (WER) service (i.e.: sending a 0xC000409 exception with the information set to FAST_FAIL_SET_CONTEXT_DENIED). In order to avoid spamming WER, another EPROCESS bit, AuditUserCetSetContextIpValidationLogged, is used.

There is one case where the function will stop iterating over the shadow stack before finding the target RIP – if the thread is terminating and the current shadow stack address is page-aligned. This means that for terminating threads, the function will try to verify the target RIP only in the current page of the shadow stack as a “best effort”, but will not go any further than that. If it doesn’t find the target RIP in that page it will return STATUS_THREAD_IS_TERMINATING.

The other case in this function is when KCONTINUE_TYPE is KCONTINUE_LONGJUMP. Then the target RIP will not be validated against the shadow stack, but RtlVerifyUserUnwindTarget will be called instead with the KCONTINUE_LONGJUMP flag to verify RIP in the PE Image Load Configuration Directory’s longjmp table. We’ll describe this table and these checks in the next section of this blog post.

KeVerifyContextIpForUserCet is called by one of these 2 functions:

  • PspGetSetContextInternal – called in response to an NtSetContextThread API.
  • KiVerifyContextRecord – called in response to NtContinueEx, NtRaiseException, and in some cases NtSetContextThread APIs. Before calling KeVerifyContextIpForUserCet (Only if its received ContinueArgument is not NULL), this function checks if the caller is trying to modify the CS register, and whether the new value is valid – non-WOW64 processes are only allowed to set CS to KGDT64_R3_CODE, unless they’re pico processes, in which case they can set CS to KGDT64_R3_CODE or KGDT64_R3_CMCODE. Any other value will make KiVerifyContextRecord force the new CS value to KGDT64_R3_CODE. KiVerifyContextRecord is either called by KiContinuePreviousModeUser or by KeVerifyContextRecord. In the second case, the function validates that RSP is inside one of the process stacks (native or wow64), and that 64-bit processes will only ever set CS to KGDT64_R3_CODE.

All paths that call KeVerifyContextIpForUserCet to validate the target RIP first call KeVerifyContextXStateCetU to validate the target SSP and only perform the RIP checks if the SSP is determined to be valid.

Exception unwinding and longjmp Validation

As shown above, the handling for KCONTEXT_SET and KCONTEXT_RESUME is concerned with validating that the target RIP is part of the Shadow Stack, but the other scenarios (KCONTEXT_UNWIND and KCONTEXT_LONGJMP) require extended validation through RtlVerifyUserUnwindTarget. This second validation path contains a number of interesting complexities that required changes to the PE file format (and compiler support) as well as a new OS-level information class added to NtSetInformationProcess for JIT compiler support.

Already added due to enhancements to Control Flow Guard (CFG) support, the Image Load Configuration Directory inside of the PE file now includes information for branch valid targets used as part of a setjmp/longjmp pair, which a modern compiler is supposed to identify and pass onto the linker. With CET, this existing data is re-used, but yet another table and size is added for exception handler continuation support. While Visual Studio 2017 produces the longjmp table, only Visual Studio 2019 produces this newer table.

In this last section, we’ll look at the format of these tables, and how the kernel is able to authorize the last two types of KCONTINUE_TYPE control flows.

PE Metadata Tables

In addition to the standard GFIDS Table that is present in Control Flow Guard images, Windows 10 also added support for validation of longjmp targets through the inclusion of a Long Jump Target Table typically located in a PE section called .gljmp, whose RVA is stored in the GuardLongJumpTargetTable field of the Image Load Configuration Directory.

Whenever a call to setjmp is made in code, the RVA of the return address (which is where longjmp will branch to) is added to this table. The presence of this table is determined by the IMAGE_GUARD_CF_LONGJUMP_TABLE_PRESENT flag in the GuardFlags of the Image Load Configuration Directory, and it contains as many entries as indicated by the GuardLongJumpTargetCount field.

Each entry is a 4-byte RVA, plus n bytes of metadata, where n is taken from the result of (GuardFlags & IMAGE_GUARD_CF_FUNCTION_TABLE_SIZE_MASK) >> IMAGE_GUARD_CF_FUNCTION_TABLE_SIZE_SHIFT. For this table, no metadata is defined, so the metadata bytes are always expected to be zero. Interestingly, because this calculation is the same as the one used for the GFIDS Table (which does potentially have metadata if export suppression is enabled), suppressing at least one CFG target will result in 1 byte of empty metadata being added to every entry in the Long Jump Target Table.

For example, here’s an PE file with two longjmp targets:

Note the value 1 in the upper nibble of GuardFlags (which corresponds to IMAGE_GUARD_CF_FUNCTION_TABLE_SIZE_MASK) due to the fact this image also uses CFG Export Suppression. This tells us that one extra byte of metadata will be present in the Long Jump Target Table, which you can see below:

On Windows 10 20H1, this type of metadata is now included in one additional situation — when exception handler continuation targets are present as part of a binary’s control flow. Two new fields — GuardEHContinuationTable and GuardEHContinuationCount — are added to the end of the Image Load Configuration Directory, and a IMAGE_GUARD_EH_CONTINUATION_TABLE_PRESENT flag is now part of the GuardFlags. The layout of this table is identical to the one shown for the Long Jump Target Table — including the addition of metadata bytes based on the upper nibble of GuardFlags.

Unfortunately, not even the current preview versions of Visual Studio 2019 generate this data, so we cannot currently show you an example — this analysis is based on reverse engineering the validation code that we describe later, as well as the Ntimage.h header file in the 20H1 SDK.

User Inverted Function Table

Now that we know that control flow changes might occur in order to branch to either a longjmp target or an exception handler continuation target, the question becomes — how do we get these two tables based on the RIP address present in a CONTEXT_EX as part of a NtContinueEx call? As these operations might happen frequently in the context of certain program executions, the kernel needs an efficient way to solve this problem.

You may already be familiar with the concept of the Inverted Function Table. Such a table is used by Ntdll.dll (LdrpInvertedFunctionTable), for finding the unwind opcodes and exception data during user-mode exception handling (to wit, by locating the .pdata section). Another table is present in Ntoskrnl.exe (PsInvertedFunctionTable) and is used during kernel-mode exception handling, as well as part of PatchGuard’s checks.

In short, the Inverted Function Table is an array containing all the loaded user / kernel modules their size, and a pointer to the PE Exception Directory, sorted by virtual address. It was originally created as an optimization, since searching this array is a lot faster than parsing the PE header and then searching the loaded modules linked list – a binary search on an inverted function table will quickly locate any virtual address in its respective module in only log(n) lookups. Ken Johnson and Matt Miller, now of Microsoft fame, previously published a thorough overview as part of their article on kernel-mode hooking techniques in the Uninformed Magazine.

Previously, however, Ntdll.dll only scanned its table for user-mode exceptions, and Ntoskrnl.exe only scanned its counterpart for kernel-mode exceptions — what 20H1 changes is that the kernel will now have to scan the user table too — as part of the new logic required to handle longjmp and exception continuations. To support this, a new RtlpLookupUserFunctionTableInverted function is added, which scans the KeUserInvertedFunctionTable variable, mapping to the now exported LdrpInvertedFunctionTable symbol in Ntdll.dll.

This is an exciting forensic capability, as it means that you now have an easy way, from the kernel, to locate the user-mode modules that are loaded within the current process, without having to parse the PEB’s loader data or enumerating VADs. For example, here’s how you can see the current loaded images in Csrss.exe:

dx @$cursession.Processes.Where(p => p.Name == "csrss.exe").First().SwitchTo()

dx -r0 @$table = *(nt!_INVERTED_FUNCTION_TABLE**)&nt!KeUserInvertedFunctionTable

dx -g @$table->TableEntry.Take(@$table->CurrentSize)

That being said, there does exist, however remote, the possibility that an image does not contain an exception directory, especially on x86 systems where unwind opcodes do not exist, and .pdata is only created if /SAFESEH is used and there’s at least one exception handler.

In those situations, RtlpLookupUserFunctionTableInverted can fail, and MmGetImageBase must be used instead. Unsurprisingly, this looks up any VAD that maps the region corresponding to the input RIP, and, if it’s an Image VAD, returns the base address and size of the region (which should correspond to that of the module).

Dynamic Exception Handler Continuation Targets

One final hurdle exists in the handling of KCONTINUE_UNWIND requests — although regular processes have static exception handler continuation targets based on the __try/__except/__finally clauses in their code, Windows allows JIT engines to not only dynamically create executable code on the fly, but also to register exception handlers (and unwind opcodes) for it at runtime, such as through the RtlAddFunctionTable API. While these exception handlers were previously only needed for user-mode stack walking and exception unwinding, now the continuation handlers become legitimate control flow targets that the kernel must understand as potentially valid values for RIP. It’s this last possibility that RtlpFindDynamicEHContinuationTarget handles.

As part of the CET support and introduction of NtContinueEx, the EPROCESS structure was enhanced with two new fields called DynamicEHContinuationTargetsLock and DynamicEHContinuationTargetsTree, the first of which is an EX_PUSH_LOCK and the latter an RTL_RB_TREE, which contains all the valid exception handler addresses. This tree is managed through a call to NtSetInformationProcess with a new process information class, ProcessDynamicEHContinuationTargets, which is accompanied by a data structure of type PROCESS_DYNAMIC_EH_CONTINUATION_TARGETS_INFORMATION, containing in turn an array of PROCESS_DYNAMIC_EH_CONTINUATION_TARGET entries, that will be validated before modifying the DynamicEHContinuationTargetsTreeTo make things easier to follow, see the definitions below for these structures and flags:

#define DYNAMIC_EH_CONTINUATION_TARGET_ADD          0x01
#define DYNAMIC_EH_CONTINUATION_TARGET_PROCESSED    0x02

typedef struct
_PROCESS_DYNAMIC_EH_CONTINUATION_TARGET

    ULONG_PTR TargetAddress;
    ULONGLONG Flags;
} PROCESS_DYNAMIC_EH_CONTINUATION_TARGET, *PPROCESS_DYNAMIC_EH_CONTINUATION_TARGET;

typedef struct
_PROCESS_DYNAMIC_EH_CONTINUATION_TARGETS_INFORMATION
{
    USHORT NumberOfTargets;
    USHORT Reserved;
    ULONG Reserved2;
    PPROCESS_DYNAMIC_EH_CONTINUATION_TARGET* Targets;
} PROCESS_DYNAMIC_EH_CONTINUATION_TARGETS_INFORMATION, *PPROCESS_DYNAMIC_EH_CONTINUATION_TARGETS_INFORMATION;

The PspProcessDynamicEHContinuationTargets function is called to iterate over this data, at which point RtlAddDynamicEHContinuationTarget is called for any entry containing the DYNAMIC_EH_CONTINUATION_TARGET_ADD flag set, which allocates a data structure storing the target address, and linking its RTL_BALANCED_NODE link with the RTL_RB_TREE in EPROCESS. Conversely, if the flag is missing, then the target is looked up, and if it indeed exists, is removed and its node freed. As each entry is processed, the DYNAMIC_EH_CONTINUATION_TARGET_PROCESSED flag is OR’ed into the original input buffer, so that callers can know which entries worked and which didn’t.

Obviously, it would appear that the existence of this capability is a universal bypass of any CET/CFG-like capability, as every possible ROP gadget could simply be added as a ‘dynamic continuation target’. However, since Microsoft now only legitimately supports out-of-process JIT compilation for browsers and Flash, it’s critical to note that this API only works for remote processes. In fact, calling it on the current process will always fail with STATUS_ACCESS_DENIED.

Target Validation

Bringing all of this knowledge together, the RtlVerifyUserUnwindTarget function becomes quite easy to explain.

  1. Lookup the loaded PE module associated with the target RIP in the CONTEXT_EX structure. First, try using RtlpLookupUserFunctionTableInverted and if that fails, switch to using MmGetImageBase instead, making sure that the module is < 4GB.

  2. If a module was found, call the LdrImageDirectoryEntryToLoadConfig function to get its Image Load Configuration Directory. Then, make sure it’s large enough to contain either the Long Jump or Dynamic Exception Handler Continuation Target Table and that the guard flags contain IMAGE_GUARD_CF_LONGJUMP_TABLE_PRESENT or IMAGE_GUARD_EH_CONTINUATION_TABLE_PRESENT. If the directory is missing, too small, or the matching table is simply not present, then return STATUS_SUCCESS for compatibility reasons.

  3. Get either GuardLongJumpTargetTable or GuardEHContinuationTable from the Image Load Configuration Directory, and validate the GuardLongJumpTargetCount or GuardEHContinuationCount. If there are more than 4 billion entries, return STATUS_INTEGER_OVERFLOW. If there are more than 0 entries, then call do a binary search using bsearch_s (passing in RtlpTargetCompare as the comparator) through the table to locate the target RIP after converting it to an RVA. If it is found, return STATUS_SUCCESS.

  4. If the target RIP was not found (or if the table contained 0 entries to begin with), or if a loaded module was not found at the target RIP in the first place, then return STATUS_SET_CONTEXT_DENIED for longjmp validations (KCONTINUE_LONGJUMP).

  5. Otherwise, for exception unwinding validations (KCONTINUE_UNWIND), call RtlpFindDynamicEHContinuationTarget to check if this was a registered dynamic exception handler continuation target. If yes, return STATUS_SUCCESS, otherwise return STATUS_SET_CONTEXT_DENIED.

Conclusion

The implementation of CET and its related mitigations are a major step towards eliminating the use of ROP and other control flow hijacking techniques. Control flow integrity is obviously a complicated topic, which will probably get even more complex as additional mitigations are added to it in the future. Further compatibility concerns and one-off scenarios will likely result in more and more cases to be discovered that will need specific handling. That said, such a big step in mitigation technology, especially one that includes so much new functionality, is bound to have gaps and issues, and we are sure that as more research is done in this area, interesting things will be discovered there in the future.