Understanding a New Mitigation: Module Tampering Protection

A few months ago, I spoke at Paranoia conference about obscure and undocumented mitigations. Following the talk, a few people asked how I found out about these mitigations and how I figured out what they did and how they worked. So I thought I’d try to focus on one of those mitigations and show the full research process, as well as how the ideas behind it can be used for other purposes.

To do that I chose module tampering protection. I’ll start by explaining what it is and what it does for those of you who are only interested in the bottom line, and then show the whole process for those who would like to reproduce this work or learn some RE techniques.

TL;DR: What’s Module Tampering Protection?

Module tampering protection is a mitigation that protects against early modifications of the process main image, such as IAT hooking or process hollowing. It uses a total of three APIs: NtQueryVirtualMemory, NtQueryInformationProcess and NtMapViewOfSection. If enabled, the loader will check for changes in the main image headers and the IAT page before calling the entry point. It does that by calling NtQueryVirtualMemory with the information class MemoryWorkingSetExInformation. The returned structure contains information about the sharing status of the page, as well as whether it was modified from its original view. If the headers or the IAT have been modified from their original mappings (for example, if the main image has been unmapped and another image has been mapped in its place), the loader will call NtQueryInformationProcess with the class ProcessImageSection to get a handle to the main image section, and will then remap it using NtMapViewOfSection. From that point the new section will be used and the tampered copy of the image will be ignored.

This mitigation is available since RS3 and can be enabled on process creation using PROCESS_CREATION_MITIGATION_POLICY2_MODULE_TAMPERING_PROTECTION_MASK.

The Full Analysis

For those of you interested in the full path from knowing nothing about this mitigation to knowing everything about it, let’s start.

Discovering the Mitigation

One question I get occasionally is how people can even discover the existence of these types of mitigations when Microsoft never announces or documents them. So, one good place to look at would be the various MitigationFlags fields in the EPROCESS structure. There are currently three MitigationFlags fields (MitigationFlags, MitigationFlags2, MitigationsFlags3), each containing 32 bits. In the first two the whole 32 bits are already used, so MitigationFlags3 was recently added, and currently contains three mitigations, and I’m sure more will be added soon. These flags represent the enabled mitigations in the process. For example, we can use WinDbg to print EPROCESS.MitigationFlags for the current process:

dx @$curprocess.KernelObject.MitigationFlagsValues
@$curprocess.KernelObject.MitigationFlagsValues
    [+0x000 ( 0: 0)] ControlFlowGuardEnabled : 0x1 [Type: unsigned long]
    [+0x000 ( 1: 1)] ControlFlowGuardExportSuppressionEnabled : 0x0 [Type: unsigned long]
    [+0x000 ( 2: 2)] ControlFlowGuardStrict : 0x0 [Type: unsigned long]
    [+0x000 ( 3: 3)] DisallowStrippedImages : 0x0 [Type: unsigned long]
    [+0x000 ( 4: 4)] ForceRelocateImages : 0x0 [Type: unsigned long]
    [+0x000 ( 5: 5)] HighEntropyASLREnabled : 0x1 [Type: unsigned long]
    [+0x000 ( 6: 6)] StackRandomizationDisabled : 0x0 [Type: unsigned long]
    [+0x000 ( 7: 7)] ExtensionPointDisable : 0x0 [Type: unsigned long]
    [+0x000 ( 8: 8)] DisableDynamicCode : 0x0 [Type: unsigned long]
    [+0x000 ( 9: 9)] DisableDynamicCodeAllowOptOut : 0x0 [Type: unsigned long]
    [+0x000 (10:10)] DisableDynamicCodeAllowRemoteDowngrade : 0x0 [Type: unsigned long]
    [+0x000 (11:11)] AuditDisableDynamicCode : 0x0 [Type: unsigned long]
    [+0x000 (12:12)] DisallowWin32kSystemCalls : 0x0 [Type: unsigned long]
    [+0x000 (13:13)] AuditDisallowWin32kSystemCalls : 0x0 [Type: unsigned long]
    [+0x000 (14:14)] EnableFilteredWin32kAPIs : 0x0 [Type: unsigned long]
    [+0x000 (15:15)] AuditFilteredWin32kAPIs : 0x0 [Type: unsigned long]
    [+0x000 (16:16)] DisableNonSystemFonts : 0x0 [Type: unsigned long]
    [+0x000 (17:17)] AuditNonSystemFontLoading : 0x0 [Type: unsigned long]
    [+0x000 (18:18)] PreferSystem32Images : 0x0 [Type: unsigned long]
    [+0x000 (19:19)] ProhibitRemoteImageMap : 0x0 [Type: unsigned long]
    [+0x000 (20:20)] AuditProhibitRemoteImageMap : 0x0 [Type: unsigned long]
    [+0x000 (21:21)] ProhibitLowILImageMap : 0x0 [Type: unsigned long]
    [+0x000 (22:22)] AuditProhibitLowILImageMap : 0x0 [Type: unsigned long]
    [+0x000 (23:23)] SignatureMitigationOptIn : 0x0 [Type: unsigned long]
    [+0x000 (24:24)] AuditBlockNonMicrosoftBinaries : 0x0 [Type: unsigned long]
    [+0x000 (25:25)] AuditBlockNonMicrosoftBinariesAllowStore : 0x0 [Type: unsigned long]
    [+0x000 (26:26)] LoaderIntegrityContinuityEnabled : 0x0 [Type: unsigned long]
    [+0x000 (27:27)] AuditLoaderIntegrityContinuity : 0x0 [Type: unsigned long]
    [+0x000 (28:28)] EnableModuleTamperingProtection : 0x0 [Type: unsigned long]
    [+0x000 (29:29)] EnableModuleTamperingProtectionNoInherit : 0x0 [Type: unsigned long]
    [+0x000 (30:30)] RestrictIndirectBranchPrediction : 0x0 [Type: unsigned long]
    [+0x000 (31:31)] IsolateSecurityDomain : 0x0 [Type: unsigned long]

Towards the end, in bits 28 and 29, we can see the values EnableModuleTamperingProtection and EnableModuleTamperingProtectionNoInherit. Unfortunately, searching for these names doesn’t get any great results. There are a couple of websites that just show the structure with no explanation, one vague stack overflow answer that briefly mentions EnableModuleTamperingProtectionNoInherit with no added details, and this tweet:

Unsurprisingly, the most detailed explanation is a tweet from Alex Ionescu from 2017. This isn’t exactly full documentation, but it’s a start. If you already know and understand the concepts that make up this mitigation, this series of tweets is probably very clear and explains all there is to know about the feature. If you’re not familiar with the underlying concepts, this probably raises more questions than answers. But don’t worry, we’ll take it apart piece-by-piece.

Where Do We Look?

The first question to answer is: where is this mitigation implemented? Alex gives us some direction with the function names, but if he didn’t, or things changed since 2017 (or you choose not to believe him), where would you start?

The first place to start searching for the implementation of process mitigations is often the kernel: ntoskrnl.exe. However, this is a huge binary that’s not easy to search through. There are no function names that seem at all relevant to this mitigation, so there’s no obvious place to start.

Instead, you could try a different approach and try to find references to the MitigationFlags field of the EPROCESS with access to one of those two flags. But unless you have access to the Windows source code, there’s no easy way to do that. What you can do however, is take advantage of the fact that the EPROCESS is a large structure and that MitigationFlags exists towards the end of it, at offset 0x9D0. One very inelegant but effective way to go is to use the IDA search function and search for all references to 9D0h:

This will be very slow because it’s a large binary, and some results will have nothing to do with the EPROCESS structure so you’d have to search through the results manually. Also, just finding references to the field is not enough – MitigationFlags contains 32 bits, and only two of them are relevant in the current context. So, you’d have to search through all the results for occurrences where:

  1. 0x9D0 is used as an offset into an EPROCESS structure – you’d have to use some intuition here since there is no guaranteed way to know the type of structure used by each case, though for larger offsets there are only a handful of options that could be relevant and it can mostly be guessed by the function name and context.
  2. The MitigationFlags field is being compared or set to either 0x10000000 (EnableModuleTamperingProtection) or 0x20000000 (EnableModuleTamperingProtectionNoInherit). Or bits 28 or 29 are tested or set by bit number through assembly instructions such as bt or bts.

After running the search, the results look something like this:

You can now walk through the results and get a feeling of what mitigations flags are used by the kernel and in which cases. And then I’ll let you know that this effort was completely useless since EnableModuleTamperingProtection is referenced at exactly one place in the kernel: PspApplyMitigationOptions, called when a new process is created:

So, the kernel keeps track of whether this mitigation is enabled, but never tests it. This means the mitigation itself is implemented elsewhere. This search might have been useless for this specific mitigation, but it’s one of several ways to find out where a mitigation is implemented and can be useful for other process mitigations, so I wanted to mention it even if it’s silly and unimpressive.

But back to module tampering protection – a second location where process mitigations are sometimes implemented is ntdll.dll, the first user-mode image to be loaded in every process. This DLL contains the loader, system call stubs, and many other basic components needed by all processes. It makes sense for this mitigation to be implemented here, since the name suggests it’s related to module loads, which happen through the loader in ntdll.dll. Additionally, this is the module that contains that functions Alex mentioned in his tweet.

Even if we didn’t have this tweet, just opening ntdll and searching for “tampering” quickly finds us exactly one result: the function LdrpCheckPagesForTampering. Looking for callers to this function we see that it’s called from a single place, LdrpGetImportDescriptorForSnap:

In the first line in the screenshot, we can see two checks: the first one validates that the current entry being processed is the main image, so the module being loaded in the main image module. The second check is for two bits in LdrSystemSllInitBlock.MitigationOptionsMap.Map[1]. We can see the exact field being checked here only because I applied the correct type to LdrSystemDllInitBlock – if you look at this function without applying the correct type, you’ll see some random, unnamed memory address being referenced instead. LdrSystemDllInitBlock is a data structure containing all the global information needed by the loader, such as the process mitigation options. It’s undocumented but has the type PS_SYSTEM_DLL_INIT_BLOCK that is available in the symbols so we can use it here (notice that this structure isn’t available in the NTDLL symbols, rather you’d find it in the symbols of ole32.dll and combase.dll). The MitigationOptionsMap field is just an array of three ULONG64s containing bits that mark the mitigation options that are set for this process. We can find the value for all the mitigation flags in WinBase.h. Here are the values for module tampering protection:

//
// Define the module tampering mitigation policy options.
//

#define PROCESS_CREATION_MITIGATION_POLICY2_MODULE_TAMPERING_PROTECTION_MASK       (0x00000003ui64 << 12)
#define PROCESS_CREATION_MITIGATION_POLICY2_MODULE_TAMPERING_PROTECTION_DEFER      (0x00000000ui64 << 12)
#define PROCESS_CREATION_MITIGATION_POLICY2_MODULE_TAMPERING_PROTECTION_ALWAYS_ON  (0x00000001ui64 << 12)
#define PROCESS_CREATION_MITIGATION_POLICY2_MODULE_TAMPERING_PROTECTION_ALWAYS_OFF (0x00000002ui64 << 12)
#define PROCESS_CREATION_MITIGATION_POLICY2_MODULE_TAMPERING_PROTECTION_NOINHERIT  (0x00000003ui64 << 12)

These values are relative to the top DWORD of Map[1], so the module tampering protection bit is actually at bit 44 of Map[1] – the same one being checked in the Hex Rays screenshot (and in PspApplyMitigationOptions, shown earlier).

Now we know where this mitigation is applied the checked, so we can start looking at the implementation and understand what this mitigation does.

Implementation Details

Looking again at LdrpGetImportDescriptorForSnap: after the two checks that we already saw, the function fetches the NT headers for the main image and calls LdrpCheckPagesForTampering twice. The first time, the address being sent is imageNtHeaders->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT] – the image’s import table – and a size of 8 bytes. The second time, the function is called with the address and size of the NT headers themselves. If one of these pages is deemed to be tampered, LdrpMapCleanModuleView gets called to (judging by the name) map a clean view of the main image module.

Let’s look inside LdrpCheckPagesForTampering to see how NTDLL decides if a page is tampered:

First, this function calculates the number of pages within the requested range of bytes (in both cases we’ve seen here, that number is 1). Then it allocates memory and calls ZwQueryVirtualMemory with MemoryInformationClass == 4 (MemoryWorkingSetExInformation). This system call and information class are ones security people might not see very often – the working set is a way to manage and prioritize physical memory pages based on their current status, so not often interesting for most security people. However, the working set does carry some attributes that could interest us. Specifically, the “shared” flags.

I won’t go into the detail of mapped and shared memory here, since they’re explained in plenty of other places. But in short, the system tries to not duplicate memory, as that would mean physical memory would quickly fill up with duplicated pages, mostly those belonging to images and DLLs – system DLLs like ntdll.dll or kernel32.dll are mapped in most (if not all) of the processes in the system, so having a separate copy in physical memory for each process would simply be wasteful. So, these image pages are shared between all processes. That is, unless the images are modified in any way. Image pages use a special protection called Copy On Write, which allows the pages to be writeable, but will create a fresh copy in physical memory if the page is written into. This means any changes done to a local mapping of a DLL (for example, the writing of user-mode hooks, or any data changes), will only affect the DLL in the current process.

These settings are saved as flags that can be queried through NtQueryVirtualMemory, with the information class used here: MemoryWorkingSetExInformation. It’ll return data about the queried pages in a MEMORY_WORKING_SET_EX_INFORMATION structure:

typedef struct _MEMORY_WORKING_SET_EX_BLOCK
{
    union
    {
        struct
        {
            ULONG64 Valid : 1;
            ULONG64 ShareCount : 3;
            ULONG64 Win32Protection : 11;
            ULONG64 Shared : 1;
            ULONG64 Node : 6;
            ULONG64 Locked : 1;
            ULONG64 LargePage : 1;
            ULONG64 Priority : 3;
            ULONG64 Reserved : 3;
            ULONG64 SharedOriginal : 1;
            ULONG64 Bad : 1;
            ULONG64 Win32GraphicsProtection : 4;
            ULONG64 ReservedUlong : 28;
        };
        struct
        {
            struct
            {
                ULONG64 Valid : 1;
                ULONG64 Reserved0 : 14;
                ULONG64 Shared : 1;
                ULONG64 Reserved1 : 5;
                ULONG64 PageTable : 1;
                ULONG64 Location : 2;
                ULONG64 Priority : 3;
                ULONG64 ModifiedList : 1;
                ULONG64 Reserved2 : 2;
                ULONG64 SharedOriginal : 1;
                ULONG64 Bad : 1;
                ULONG64 ReservedUlong : 32;
            };
        } Invalid;
    };
} MEMORY_WORKING_SET_EX_BLOCK, *PMEMORY_WORKING_SET_EX_BLOCK;

typedef struct _MEMORY_WORKING_SET_EX_INFORMATION
{
    PVOID VirtualAddress;
    union
    {
        union
        {
            MEMORY_WORKING_SET_EX_BLOCK VirtualAttributes;
            ULONG64 Long;
        };
    } u1;
} MEMORY_WORKING_SET_EX_INFORMATION, *PMEMORY_WORKING_SET_EX_INFORMATION;

This structure give you the virtual address that’s been queried, and bits containing information about the state of the page, such as: its validity, protection, is it a parge page, and its sharing status. There are a few different bits related to the sharing status of a page:

  1. Shared – is the page shareable? That doesn’t necessarily mean that the page is currently shared with any other processes, but, for example, private memory will not be shared unless specifically requested by the process.
  2. ShareCount – this field tells you how many mappings exist for this page. For a page not currently shared with any other process, this will be 1. For pages shared with other processes, this will normally be higher.
  3. SharedOriginal – this flag indicates whether this is the original mapping of this page. So, if a page was modified, which led to creating a fresh copy in physical memory, this will be set to zero as this isn’t the original mapping of the page.

This SharedOriginal bit is the one checked by LdrpCheckPagesForTampering to tell if this page is the original copy or a fresh copy created due to changes. If this isn’t the original copy, this means that the page was tampered with in some way so the function will return TRUE. LdrpCheckPagesForTampering runs this check for every page that’s being queried and will return TRUE if any of them have been tampered with.

If the function returned TRUE for any of the checked ranges, LdrpMapCleanModuleView gets called:

This function is short and simple: it calls NtQueryInformationProcess with InformationClass == 89 (ProcessImageSection) to fetch the section handle for the main image, then re-maps it using NtMapViewOfSection and closes the handle. It writes the address of the new section to DataTableEntry->SwitchBackContect, to be used instead of the original tampered mapping.

Why does this feature choose to check specifically these two ranges for tampering – the import table and the NT headers?

That’s because these are two places that will often be targeted by an attacker trying to hollow the process. If the main image is unmapped and replaced by a malicious image, the NT headers will be different and be considered tampered. Process hollowing can also tamper with the import table, to point to different functions than the ones the process expects. So, this is mostly an anti-hollowing feature, targeted to spotting tampering attempts in the main image, and replacing it with a fresh copy of the image that hasn’t been tampered with.

Limitations

Unfortunately, this feature is relatively limited. You can enable or disable it, and that’s about it. The functions implementing the mitigation are internal and can’t be called externally. So, for example, extending the mitigation to other modules is not possible unless you write the code for it yourself (and map the modules manually, since the section handles for those isn’t conveniently stored anywhere). Additionally, this mitigation contains no logging or ETW events. When the mitigation notices tampering in the main image it’ll silently map and use a new copy and leave no trace for security products or teams to find. The only hint will be that NtMapViewOfSection will be called again for the main image and generate an ETW event and kernel callback. But this is likely to go unnoticed as it doesn’t necessarily mean something bad happened and will probably not lead to any alerts or significant investigation of what might be a real attack.

On the bright side, this mitigation is extremely simple and useful, and very easily to mimic if you want to implement it for other use cases, such as detecting hooks placed on your process and mapping a fresh, unhooked copy of the page to use. You can do that instead of using direct system calls!

Who Uses This?

Running a query in WinDbg, I find no results for any process enabling module tampering protection. After a bit of probing around I managed to find only one process that enables this: SystemSettingsAdminFlows.exe. This process is executed when you open Apps->Optional Features in the Windows Settings menu. I don’t know why this specific process uses this mitigation or why it’s the only one that does, but this is the only one I managed to find so far that enables module tampering protection.

Conclusion

I tried to use this post to show a bit more of the work involved in analyzing an unknown feature and demonstrating some of the steps I take to scope and learn about a new piece of code. I hope this has been helpful and gave some of you useful tips in how to approach a new research topic!

One I/O Ring to Rule Them All: A Full Read/Write Exploit Primitive on Windows 11

This blog post will cover the post-exploitation technique I presented at TyphoonCon 2022. For anyone interested in the talk itself, I’ll link the recording here when it becomes available.
This technique is a post exploitation primitive unique to Windows 11 22H2+ – there are no 0-days here. Instead, there’s a method to turn an arbitrary write, or even arbitrary increment bug in the Windows kernel into a full read/write of kernel memory.

Background

Kernel exploitation (and exploitation in general) on Windows is becoming harder with every new version. Driver Signature Enforcement made it harder for an attacker to load unsigned drivers, and later HVCI made it entirely impossible – with the added difficulty of a driver block list, preventing attackers from loading signed vulnerable drivers. SMEP and KCFG mitigate against code redirection through function pointer overwrites, and KCET makes ROP impossible as well. Other VBS features such as KDP protect kernel data, so common targets such as g_CiOptions can no longer be modified by an attacker. And on top of those, there are Patch Guard and Secure Kernel Patch Guard which validate the integrity of the kernel and many of its components.

With all the existing mitigations, just finding a user->kernel bug no longer guarantees successful exploitation. In Windows 11 with all mitigations enabled, it’s nearly impossible to achieve Ring 0 code execution. However, data-based attacks are still a viable solution

 A known technique for a data-only attack is to create a fake kernel-mode structure in user mode, then tricking the kernel to use it through a write-what-where bug (or any other bug type that can achieve that). The kernel will treat this structure like valid kernel data, allowing the attacker to achieve privilege escalation by manipulating the data in the structure, thus manipulating kernel actions that are done based on that data. There are numerous examples for this technique, which was used in different ways. For example, this blog post by J00ru demonstrates using a fake token table to turn an off-by-one bug into an arbitrary write, and later using that to run shellcode in ring 0. Many other examples take advantage of different Win32k objects to achieve arbitrary read, write or both. Some of these techniques have already been mitigated by Microsoft, other are already known and hunted for by security products, and others are still usable and most likely used in the wild.

In this post I’d like to add one more technique to the pile – using I/O ring preregistered buffers to create a read/write primitive, using 1-2 arbitrary kernel writes (or increments). This technique uses a new object type that currently has very limited visibility to security products and is likely to be ignored for a while. The method is very simple to use – once you understand the underlying mechanism of I/O ring.

I/O Ring

I already wrote several blog posts (and a talk) about I/O rings so I’ll just present the basic idea and the parts relevant to this technique. Anyone interested in learning more about it can read the previous posts on the topic or watch the talk from P99 Conf.

In short, I/O ring is a new asynchronous I/O mechanism that allows an application to queue as many as 0x10000 I/O operations and submit them all at once, using a single API call. The mechanism was modeled after the Linux io_uring, so the design of the two is very similar. For now, I/O rings don’t support every possible I/O operation yet. The available operations in Windows 11 22H2 are read, write, flush and cancel. The requested operations are written into a Submission Queue, and then submitted all together. The kernel processes the requests and writes the status codes into a Completion Queue – both queues are in a shared memory region accessible to both user mode and kernel mode, allowing sharing of data without the overhead of multiple system calls.

In addition to the available I/O operations, the application can queue two more types of operations unique to I/O ring: preregister buffers and preregister files. These options allow an application to open all the file handles or create all the input/output buffers ahead of time, register them and later reference them by index in I/O operations queued through the I/O ring. When the kernel processes an entry that uses a preregistered file handle or buffer, it fetches the requested handle/buffer from the preregistered array and passes it on to the I/O manager where it is handled normally.

For the visual learners, here’s an example of a queue entry using a preregistered file handle and buffer:

A submission queue that’s ready to be submitted to the kernel could look something like this:

The exploitation technique discussed here takes advantage of the preregistered buffers array, so let’s go into a bit more detail there:

Registered Buffers

As I mentioned, one of the operations an application can do is allocate all the buffers for its future I/O operations, then register them with the I/O ring. The preregistered buffers are referenced through the I/O ring object:

typedef struct _IORING_OBJECT
{
    USHORT Type;
    USHORT Size;
    NT_IORING_INFO UserInfo;
    PVOID Section;
    PNT_IORING_SUBMISSION_QUEUE SubmissionQueue;
    PMDL CompletionQueueMdl;
    PNT_IORING_COMPLETION_QUEUE CompletionQueue;
    ULONG64 ViewSize;
    ULONG InSubmit;
    ULONG64 CompletionLock;
    ULONG64 SubmitCount;
    ULONG64 CompletionCount;
    ULONG64 CompletionWaitUntil;
    KEVENT CompletionEvent;
    UCHAR SignalCompletionEvent;
    PKEVENT CompletionUserEvent;
    ULONG RegBuffersCount;
    PVOID RegBuffers;
    ULONG RegFilesCount;
    PVOID* RegFiles;
} IORING_OBJECT, *PIORING_OBJECT;

When the request gets processed, the following things happen:

  1. IoRing->RegBuffers and IoRing->RegBuffersCount get set to zero.
  2. The kernel validates that Sqe->RegisterBuffers.Buffers and Sqe->RegisterBuffers.Count are both not zero.
  3. If the request came from user mode, the array is probed to validate that it’s fully in the user mode address space. Array size can be up to sizeof(ULONG).
  4. If the ring previously had a preregistered buffers array and the size of the new buffer is the same as the size of the old buffer, the old buffer array is placed back in the ring and the new buffer is ignored.
  5. If the previous checks pass and the new buffer array is to be used, a new paged pool allocation is made – this will be used to copy the data from the user mode array and will be pointed to by IoRing->RegBuffers.
  6. If there’s previously been a registered buffers array pointed to by the I/O ring, it gets copied into the new kernel array. Any new buffers will be added in the same allocation, after the old buffers.
  7. Every entry in the array sent from user mode is probed to validate that the requested buffer is fully in user mode, then gets copied to the kernel array.
  8. The old kernel array (if one existed) is freed, and the operation is completed.

This whole process is safe – the data is only read from user mode once, probed and validated correctly to avoid overflows and accidental reads or writes of kernel addresses. Any future use of these buffers will fetch them from the kernel buffer.

But what if we already have an arbitrary kernel write bug?

In that case, we can overwrite a single pointer – IoRing->RegBuffers, to point it to a fake buffer that is fully under our control. We can populate it with kernel mode addresses and use those as buffers in I/O operations. When the buffers are referenced by index they don’t get probed – the kernel assumes that if the buffers were safe when they where registered, then copied to a kernel allocation, they would still be safe when they’re referenced as part of an operation.

This means that with a single arbitrary write and a fake buffer array we can get full control of the kernel address space through read and write operations.

The Primitive

Once IoRing->RegBuffers points to the fake, user controlled array, we can use normal I/O ring operations to generate kernel reads and writes into whichever addresses we want by specifying an index into our fake array to use as a buffer:

  1. Read operation + kernel address: The kernel will “read” from a file of our choice into the specified kernel address, leading to arbitrary write.
  2. Write operation + kernel address: The kernel will “write” the data in the specified address into a file of our choice, leading to arbitrary read.

Initially my primitive relied on files to read and write to, but Alex suggested the use of named pipes instead which is way cooler and a lot less visible, leaving no traces on disk. So, the rest of the post + the exploit code will be using named pipes.

As you can see, technique itself is pretty simple. So simple, in fact, it doesn’t even require the use of any (well, almost) undocumented APIs or secret data structures. It uses Win32 API and structures that are available in the public symbols of ntoskrnl.exe. The exploit primitive involves the following steps:

  1. Create two named pipes with CreateNamedPipe: one will be used for input for arbitrary kernel writes and the other for output for arbitrary kernel reads. At least the pipe that’ll be used as input should be created with flag PIPE_ACCESS_DUPLEX to allow both reading and writing. I chose to create both with PIPE_ACCESS_DUPLEX for convenience.
  2. Open client handles for both pipes with CreateFile, both with read and write permissions.
  3. Create an I/O ring: this can be done through CreateIoRing API.
  4. Allocate a fake buffers array in the heap: In 22H2, the registered buffers array is a flat array, each entry containing a buffer address and length, so this is easy to allocate and set up.
  5. Find the address of the newly created I/O ring object: since I/O rings use a new object type, IORING_OBJECT, we can leak its address through a well-known KASLR bypass technique. NtQuerySystemInformation with SystemHandleInformation leaks the kernel addresses of objects, including our new I/O ring object. Fortunately, the internal structure of IORING_OBJECT is in the public symbols so there’s no need to reverse engineer the structure to find the offset of RegBuffers. We add the two together to get the target for our arbitrary write.
    Unfortunately, this API as well as many other KASLR bypasses can only be used by processes with Medium IL or higher, so Low IL processes, sandboxed processes and browsers can’t use it and will have to find a different method.
  6. Use your preferred arbitrary write bug to overwrite IoRing->RegBuffers with the address of the fake user-mode array. Notice that if you haven’t previously registered a valid buffers array you’ll also have to overwrite IoRing->RegBuffersCount to have a non-zero value.
  7. Populate the fake buffers array with kernel pointers to read or write to: to do this you might need other KASLR bypasses in order to find your target addresses. You could use NtQuerySystemInformation with SystemModuleInformation class to find the base addresses of kernel modules, use the same technique as earlier to find kernel addresses of objects, or use the pointers available inside the I/O ring itself, which point to data structures in the paged pool.
  8. Queue read and write operations in the I/O ring through BuildIoRingReadFile and BuildIoRingWriteFile.

With this method, arbitrary reads and writes aren’t limited to a pointer size, like many other methods, but can be as large as sizeof(ULONG), reading or writing many pages of kernel data simultaneously.

Cleanup

This technique requires minimal cleanup: all that’s required it to set IoRing->RegBuffers to zero before closing the handle to the I/O ring object. As long as the pointer is zero, the kernel won’t try to free anything even if IoRing->RegBuffersCount is non-zero.

Cleanup gets slightly more complicated if you choose to first register a valid buffer array and then overwrite the existing pointer in the I/O ring object – in that case there is already an allocated kernel buffer, which also adds a reference count in the EPROCESS object. In that case, the EPROCESS RefCount will need to be decremented before the process exits to avoid leaving a stale process around. Luckily that is easy to do with one more arbitrary read + write using our existing technique.

Arbitrary Increment

A couple years ago I published a series of blogs discussing CVE-2020-1034 – an arbitrary increment vulnerability in EtwpNotifyGuid. Back then, I focused on the challenges of exploiting this bug and used it to increment the process’ token privileges – a very well known privilege escalation technique. This method works, though it’s possible to detect in real time or retroactively using different tools. Security vendors are well aware of this technique and many already detect it.

That project made me interested in other ways to exploit that specific bug class – an arbitrary increment of a kernel address, so I was very happy to find a post exploitation technique that finally fit. With the method I presented here, you can use an arbitrary increment to increment IoRing->RegBuffers from 0 to a user-mode address such as 0x1000000 (no need for 0x1000000 increments, just increment the 3rd byte by one) and increment IoRing->RegBuffersCount from 0 to 1 or 0x100 (or more). This does require you to trigger the bug twice in order to create the technique, but I recommend doing that anyway to avoid the extra cleanup required when overwriting an existing pointer.

Forensics and Detection

This post exploitation technique has very little visibility and leaves few forensic traces: I/O rings have nearly no visibility through ETW except on creation, and the technique leaves no forensic traces in memory. The only part of this technique that is  visible to security products are the named pipes operations, visible to security products who use a filesystem filter driver (and most do). However, these pipes are local and aren’t used for anything that looks too suspicious — they read and write small amounts of data with no specific format, so they’re not likely to be flagged as suspicious

Portable Features = Portable Exploits?

I/O rings on Windows were modeled after the Linux io_uring and share many of the same features, and this one is no different. The Linux io_uring also allows registering buffers or file handles, and the registered buffers are handled very similarly and stored in the user_bufs field of the ring. This means that the same exploitation technique should also work on Linux (though I haven’t personally tested it).

The main difference between the two systems in this case is mitigation: while on Windows it’s difficult to mitigate against this technique, Linux has a mitigation that makes blocking this technique (at least in its current form) trivial: SMAP. This mitigation prevents access to user-mode addresses with kernel-mode privileges, blocking any exploitation technique that involves faking a kernel structure in user-mode. Unfortunately due to the basic design of the Windows system it’s unlikely SMAP will ever be a usable mitigation there, but it’s been available and used on Linux since 2012.

Of course there are still ways to bypass SMAP, such as shaping a kernel pool allocation to be used as the fake buffers array instead of a user-mode address or editing the PTE of the user-mode page that contains the fake array, but the basic exploitation primitive won’t work on systems that support SMAP.

23H2 Changes

The preview builds for 23H2 have a change that affects this technique, but only slightly. Since Windows 11 build 22610 the buffer array in the kernel is no longer a flat array of addresses and lengths, but instead an array of pointers to a new data structure: IOP_MC_BUFFER_ENTRY:

typedef struct _IOP_MC_BUFFER_ENTRY
{
    USHORT Type;
    USHORT Reserved;
    ULONG Size;
    ULONG ReferenceCount;
    ULONG Flags;
    LIST_ENTRY GlobalDataLink;
    PVOID Address;
    ULONG Length;
    CHAR AccessMode;
    ULONG MdlRef;
    PMDL Mdl;
    KEVENT MdlRundownEvent;
    PULONG64 PfnArray;
    IOP_MC_BE_PAGE_NODE PageNodes[1];
} IOP_MC_BUFFER_ENTRY, *PIOP_MC_BUFFER_ENTRY;

This data structure is used as part of the MDL cache capability that was added in the same build. It looks complex and scary, but in our use-case most of these fields are never used and can be ignored. We still have the same Address and Length fields that we need for our technique to work, and to be compatible with the requirements of the new feature we also need to hardcode a few values in the fields Type, Size, AccessMode and ReferenceCount.

To adapt our technique to this new addition, here are the changes needed in our code:

  1. Allocate a fake buffers array, sized sizeof(PVOID) * NumberOfEntries.
  2. Allocate a IOP_MC_BUFFER_ENTRY structure for each fake buffer and place the pointer into the fake buffers array. Zero out the structure, then set the following fields:

    mcBufferEntry->Address = TargetAddress;
    mcBufferEntry->Length = Length;
    mcBufferEntry->Type = 0xc02;
    mcBufferEntry->Size = 0x80; // 0x20 * (numberOfPagesInBuffer + 3)
    mcBufferEntry->AccessMode = 1;
    mcBufferEntry->ReferenceCount = 1;

The PoC

I uploaded my PoC here. It works starting 22H2 (minimal supported version – before this build I/O rings didn’t yet support write operations) and up to the latest Windows Preview build (25415 as of today). For my arbitrary write/increment bugs I used the HEVD driver, recompiled to support arbitrary increments. The PoC supports both options, but if you use the latest HEVD release only the arbitrary write option will work.

For the arbitrary read target, I used a page from the ntoskrnl.exe data section – the offset of the section is hardcoded due to laziness, so it might break spontaneously when that offset changes.

One Year to I/O Ring: What Changed?

It’s been just over a year since the first version of I/O ring was introduced into Windows. The initial version was introduced in Windows 21H2 and I did my best to document it here, with a comparison to the Linux io_uring here. Microsoft also documented the Win32 functions. Since that initial version this feature progressed and received pretty significant changes and updates, so it deserves a follow-up post documenting all of them and explaining them in more detail.

New Supported Operations

Looking at the changes, the first and most obvious thing we can see is that two new operations are now supported – write and flush:

These allow using the I/O ring to perform write and flush operations. These new operations are processed and handled similarly to the read operation that’s been supported since the first version of I/O rings and forwarded to the appropriate I/O functions. New wrapper functions were added to KernelBase.dll to queue requests for these operations: BuildIoRingWriteFile and BuildIoRingFlushFile, and their definitions can be found in the ioringapi.h header file (available in the preview SDK):

STDAPI
BuildIoRingWriteFile (
    _In_ HIORING ioRing,
    IORING_HANDLE_REF fileRef,
    IORING_BUFFER_REF bufferRef,
    UINT32 numberOfBytesToWrite,
    UINT64 fileOffset,
    FILE_WRITE_FLAGS writeFlags,
    UINT_PTR userData,
    IORING_SQE_FLAGS sqeFlags
);

STDAPI
BuildIoRingFlushFile (
    _In_ HIORING ioRing,
    IORING_HANDLE_REF fileRef,
    FILE_FLUSH_MODE flushMode,
    UINT_PTR userData,
    IORING_SQE_FLAGS sqeFlags
);

Similarly to BuildIoRingReadFile, both of these build the submission queue entry with the requested OpCode and add it to the submission queue. Obviously, there are different flags and options needed for the new operations, such as the flushMode for flush operations or writeFlags for writes. To handle that, the NT_IORING_SQE structure now contains a union for the input data that gets interpreted according to the requested OpCode – the new structure is available in the public symbols and also at the end of this post.

One small kernel change that was added to support write operations can be seen in IopIoRingReferenceFileObject:

There are a few new arguments and an additional call to ObReferenceFileObjectForWrite. Probing of different buffers across the various functions also changed depending on the operations type.

User Completion Event

Another interesting change that was introduced as well is the ability to register a user event to be notified for every new completed operation. Unlike the I/O Ring’s CompletionEvent, that only gets signaled when all operations are complete, the new optional user event will be signaled for every newly completed operation, allowing the application to process the results as they are being written to the completion queue.

To support this new functionality, another system call was created: NtSetInformationIoRing:

NTSTATUS
NtSetInformationIoRing (
    HANDLE IoRingHandle,
    ULONG IoRingInformationClass,
    ULONG InformationLength,
    PVOID Information
);

Like other NtSetInformation* routines, this function receives a handle to the IoRing object, an  information class, length and data. Only one information class is currently valid: 1. The IORING_INFORMATION_CLASS structure is unfortunately not in the public symbols so we can’t know what it’s official name is, but I’ll call it IoRingRegisterUserCompletionEventClass. Even though only one class is currently supported, there might be other information classes supported in the future. One interesting thing here is that the function uses a global array IopIoRingSetOperationLength to retrieve the expected information length for each information class:

The array currently only has two entries: 0, which isn’t actually a valid class and returns a length of 0, and entry 1 which returns an expected size of 8. This length matches the function’s expectation to receive an event handle (HANDLEs are 8 bytes on x64). This could be a hint that more information classes are planned in the future, or just a different coding choice.

After the necessary input checks, the function references the I/O ring whose handle was sent to the function. Then, if the information class is IoRingRegisterUserCompletionEventClass, calls IopIoRingUpdateCompletionUserEvent with the supplied event handle. IopIoRingUpdateCompletionUserEvent will reference the event and place the pointer in IoRingObject->CompletionUserEvent. If no event handle is supplied, the CompletionUserEvent field is cleared:

The RE Corner

On a side note, this function might look rather large and mildly threatening, but most of it is simply synchronization code to guarantee that only one thread can edit the CompletionUserEvent field of the I/O ring at any point and prevent race conditions. And in fact, the compiler makes the function look larger than it actually is since it unpacks macros, so if we try to reconstruct the source code this function would look much cleaner:

NTSTATUS
IopIoRingUpdateCompletionUserEvent (
    PIORING_OBJECT IoRingObject,
    PHANDLE EventHandle,
    KPROCESSOR_MODE PreviousMode
    )
{
    PKEVENT completionUserEvent;
    HANDLE eventHandle;
    NTSTATUS status;
    PKEVENT oldCompletionEvent;
    PKEVENT eventObj;

    completionUserEvent = 0;
    eventHandle = *EventHandle;
    if (!eventHandle ||
        (eventObj = 0,
        status = ObReferenceObjectByHandle(
                 eventHandle, PAGE_READONLY, ExEventObjectType, PreviousMode, &eventObj, 0),
        completionUserEvent = eventObj,
        !NT_SUCCESS(status))
    {
        KeAcquireSpinLockRaiseToDpc(&IoRingObject->CompletionLock);
        oldCompletionEvent = IoRingObject->CompletionUserEvent;
        IoRingObject->CompletionUserEvent = completionUserEvent;
        KeReleaseSpinLock(&IoRingObject->CompletionLock);
        if (oldCompletionEvent)
        {
            ObDereferenceObjectWithTag(oldCompletionEvent, 'tlfD');
        }
        return STATUS_SUCCESS;
    }
    return status;
}

That’s it, around six lines of actual code. But, that is not the point of this post, so let’s get back to the topic at hand: the new CompletionUserEvent.

Back to the User Completion Event

The next time we run into CompletionUserEvent is when an IoRing entry is completed, in IopCompleteIoRingEntry:

While the normal I/O ring completion event is only signaled once all operations are complete, the CompletionUserEvent is signaled under different conditions. Looking at the code, we see the following check:

Every time an I/O ring operation is complete and written into the completion queue, the CompletionQueue->Tail field gets incremented by one (referenced here as newTail). The CompletionQueue->Head field contains the index of the last completion entry that was written, and gets incremented every time the application processes another entry (If you use PopIoRingCompletion it’ll do that internally, otherwise you need to increment it yourself). So, (newTail - Head) % CompletionQueueSize calculates the number of completed entries that have not yet been processed by the application. If that amount is one, that means that the application has processed all completed entries except the latest one, that is being completed now. In that case, the function will reference the CompletionUserEvent and then call KeSetEvent to signal it.

This behavior allows the application to follow along with the completion of all its submitted operations by creating a thread whise purpose is to wait on the user event and process every newly completed entry just as it’s completed. This makes sure that the Head and Tail of the completion queue are always the same, so the next entry to be completed will signal the event, the entry will process the entry, and so on. This way the main thread of the application can keep doing other work, but the I/O operations all get processed as soon as possible by the worker thread.

Of course, this is not mandatory. An application might choose to not register a user event and simply wait for the completion of all events. But the two events allow different applications to choose the option that works best for them, creating an I/O completion mechanism that can be adjusted to suit different needs.

There is a function in KernelBase.dll to register the user completion event: SetIoRingCompletionEvent. We can find its signature in ioringapi.h:

STDAPI
SetIoRingCompletionEvent (
    _In_ HIORING ioRing,
    _In_ HANDLE hEvent
);

Using this new API and knowing how this new event operates, we can build a demo application that would look something like this:

HANDLE g_event;

DWORD
WaitOnEvent (
    LPVOID lpThreadParameter
    )
{
    HRESULT result;
    IORING_CQE cqe;

    WaitForSingleObject(g_event, INFINITE);
    while (TRUE)
    {
        //
        // lpThreadParameter is the handle to the ioring
        //
        result = PopIoRingCompletion((HIORING)lpThreadParameter, &cqe);
        if (result == S_OK)
        {
            /* do things */
        }
        else
        {
            WaitForSingleObject(g_event, INFINITE);
            ResetEvent(g_event);
        }
    }
    return 0;
}

int
main ()
{
    HRESULT result;
    HIORING ioring = NULL;
    IORING_CREATE_FLAGS flags;

    flags.Required = IORING_CREATE_REQUIRED_FLAGS_NONE;
    flags.Advisory = IORING_CREATE_ADVISORY_FLAGS_NONE;
    result = CreateIoRing(IORING_VERSION_3, flags, 0x10000, 0x20000, &ioring);

    /* Queue operations to ioring... */

    //
    // Create user completion event, register it to the ioring
    // and create a thread to wait on it and process completed operations.
    // The ioring handle is sent as an argument to the thread.
    //
    g_event = CreateEvent(NULL, FALSE, FALSE, NULL);
    result = SetIoRingCompletionEvent(handle, g_event);
    thread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)WaitOnEvent, ioring, 0, &threadId);
    result = SubmitIoRing(handle, 0, 0, &submittedEntries);

    /* Clean up... */

    return 0;
}

Drain Preceding Operations

The user completion event is a very cool addition, but it’s not the only waiting-related improvement to I/O rings. Another one can be found by looking at the NT_IORING_SQE_FLAGS enum:

typedef enum _NT_IORING_SQE_FLAGS
{
    NT_IORING_SQE_FLAG_NONE = 0x0,
    NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS = 0x1,
} NT_IORING_SQE_FLAGS, *PNT_IORING_SQE_FLAGS;

Looking through the code, we can find a check for NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS right in the beginning of IopProcessIoRingEntry:

This check happens before any processing is done, to check if the submission queue entry contains the NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS flag. If so, IopIoRingSetupCompletionWait is called to setup the wait parameters. The function signature looks something like this:

NTSTATUS
IopIoRingSetupCompletionWait (
    _In_ PIORING_OBJECT IoRingObject,
    _In_ ULONG SubmittedEntries,
    _In_ ULONG WaitOperations,
    _In_ BOOL SetupCompletionWait,
    _Out_ PBYTE CompletionWait
);

Inside the function there are a lot of checks and calculations that are both very technical and very boring, so I’ll spare myself the need to explain them and you the need to read through the exhausting explanation and skip to the good parts. Essentially, if the function receives -1 as the WaitOperations, it will ignore the SetupCompletionWait argument and calculate the number of operations that have already been submitted and processed but not yet completed. That number gets placed in IoRingObject->CompletionWaitUntil. It also sets IoRingObject->SignalCompletionEvent to TRUE and returns TRUE in the output argument CompletionWait.

If the function succeeded, IopProcessIoRingEntry will then call IopIoRingWaitForCompletionEvent, which will until IoRingObject->CompletionEvent is signaled. Now is the time to go back to the check we’ve seen earlier in IopCompleteIoRingEntry:

If SignalCompletionEvent is set (which it is, because IopIoRingSetupCompletionWait set it) and the number of completed events is equal to IoRingObject->CompletionWaitUntil, IoRingObject->CompletionEvent will get signaled to mark that the pending events are all completed. SignalCompletionEvent also gets cleared to avoid signaling the event again when it’s not requested.

When called from IopProcessIoRingEntry, IopIoRingWaitForCompletionEvent receives a timeout of NULL, meaning that it’ll wait indefinitely. This is something that should be taken under consideration when using the NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS flag.

So to recap, setting the NT_IORING_SQE_FLAG_DRAIN_PRECEDING_OPS flag in a submission queue entry will make sure all preceding operations are completed before this entry gets processed. This might be needed in certain cases where one I/O operation relies on an earlier one.

But waiting on pending operations happens in one more case: When submitting an I/O ring. In my first post about I/O rings last year, I defined the NtSubmitIoRing signature like this:

NTSTATUS
NtSubmitIoRing (
    _In_ HANDLE Handle,
    _In_ IORING_CREATE_REQUIRED_FLAGS Flags,
    _In_ ULONG EntryCount,
    _In_ PLARGE_INTEGER Timeout
    );

My definition ended up not being entirely accurate. The more correct name for the third argument would be WaitOperations, so the accurate signature is:

NTSTATUS
NtSubmitIoRing (
    _In_ HANDLE Handle,
    _In_ IORING_CREATE_REQUIRED_FLAGS Flags,
    _In_opt_ ULONG WaitOperations,
    _In_opt_ PLARGE_INTEGER Timeout
    );

Why does this matter? Because the number you pass into WaitOperations isn’t used to process the ring entries (they are processed entirely based on SubmissionQueue->Head and SubmissionQueue->Tail), but to request the number of operations to wait on. So, if WaitOperations is not 0, NtSubmitIoRing will call IopIoRingSetUpCompletionWait before doing any processing:

However, it calls the function with SetupCompletionWait=FALSE, so the function won’t actually setup any of the wait parameters, but only perform the sanity checks to see if the number of wait operations is valid. For example, the number of wait operations can’t be higher than the number of operations that were submitted. If the checks fail, NtSubmitIoRing won’t process any of the entries and will return an error, usually STATUS_INVALID_PARAMETER_3.

Later, we see both functions again after operations have been processed:

IopIoRingSetupCompletionWait is called again to recalculate the number of operations that need to be waited on, taking into consideration any operations that might have already been completed (or waited on already if any of the SQEs had the flag mentioned earlier). Then IopIoRingWaitForCompletionEvent is called to wait on IoRingObject->CompletionEvent until all requested events have been completed.
In most cases applications will choose to either send 0 as the WaitOperations argument or set it to the total number of submitted operations, but there may be cases where an application could want to only wait on part of the submitted operations, so it can choose a lower number to wait on.

Looking at Bugs

Comparing the same piece of code in different builds is a fun way of finding bugs that were fixed. Sometimes these are security vulnerabilities that gםt patched, sometimes just regular old bugs that can affect the stability or reliability of the code. The I/O ring code in the kernel received a lot of modifications over the past year, so this seems like a good chance to go hunting for old bugs.

One bug that I’d like to focus on here is pretty easy to spot and understand, but is a fun example for the way different parts of the system that seem entirely unrelated can clash in unexpected ways. This is a functional (not security) bug that prevented WoW64 processes from using some of the I/O ring features.

We can find evidence of this bug when looking at IopIoRingDispatchRegisterBuffers and IopIoRingispatchRegisterFiles. When looking at the new build we can see a piece of code that wasn’t there in earlier versions:

This is checking whether the process that is registering the buffers or files is a WoW64 process – a 32-bit process running on top of a 64-bit system. Since Windows now supports ARM64, this WoW64 process can now be either a x86 application or an ARM32 one.

Looking further ahead can show us why this information matters here. Later on, we see two cases where isWow64 is checked:

This first case is when the array size is being calculated to check for invalid sizes if caller is UserMode.

This second case happens when iterating over the input buffer to register the buffers in the array that will be stored in the I/O ring object. In this case it’s slightly harder to understand what we’re looking at because of the way the structures are handled here, but if we look at the disassembly it might become a bit clearer:

The block on the left is the WoW64 case and the block on the right is the native case. Here we can see the difference in the offset that is being accessed in the bufferInfo variable (r8 in the disassembly). To get some context, bufferInfo is read from the submission queue entry:

bufferInfo = Sqe->RegisterBuffers.Buffers;

When registering a buffer, the SQE will contain a NT_IORING_OP_REGISTER_BUFFERS structure:

typedef struct _NT_IORING_OP_REGISTER_BUFFERS
{
    /* 0x0000 */ NT_IORING_OP_FLAGS CommonOpFlags;
    /* 0x0004 */ NT_IORING_REG_BUFFERS_FLAGS Flags;
    /* 0x000c */ ULONG Count;
    /* 0x0010 */ PIORING_BUFFER_INFO Buffers;
} NT_IORING_OP_REGISTER_BUFFERS, *PNT_IORING_OP_REGISTER_BUFFERS;

The sub-structures are all in the public symbols so I won’t put them all here, but the one to focus on in this case is IORING_BUFFER_INFO:

typedef struct _IORING_BUFFER_INFO
{
    /* 0x0000 */ PVOID Address;
    /* 0x0008 */ ULONG Length;
} IORING_BUFFER_INFO, *PIORING_BUFFER_INFO; /* size: 0x0010 */

This structure contains an address and a length. The address is of type PVOID, and this is where the bug lies. A PVOID doesn’t have a fixed size across all systems. It is a pointer, and therefore its size depends on the size of a pointer on the system. On 64-bit systems that’s 8 bytes, and on 32-bit systems that’s 4 bytes. However, WoW64 processes aren’t fully aware that they are running on a 64-bit system. There is a whole mechanism put in place to emulate a 32-bit system for the process to allow 32-bit applications to execute normally on 64-bit hardware. That means that when the application calls BuildIoRingRegisterBuffers to create the array of buffers, it calls the 32-bit version of the function, which uses 32-bit structures and 32-bit types. So instead of using an 8-byte pointer, it’ll use a 4-byte pointer, creating an IORING_BUFFER_INFO structure that looks like this:

typedef struct _IORING_BUFFER_INFO
{
    /* 0x0000 */ PVOID Address;
    /* 0x0004 */ ULONG Length;
} IORING_BUFFER_INFO, *PIORING_BUFFER_INFO; /* size: 0x008 */

This is, of course, not the only case where the kernel receives pointer-sized arguments from a user-mode caller and there is a mechanism meant to handle these cases. Since the kernel doesn’t support 32-bit execution, the WoW64 emulation later is in charge of translating system call input arguments from the 32-bit sizes and types to the 64-bit types expected by the kernel. However in this case, the buffer array is not sent as an input argument to a system call. It is written into the shared section of the I/O ring that is read directly by the kernel, never going through the WoW64 translation DLLs. This means no argument translation is done on the array, and the kernel directly reads an array that was meant for a 32-bit kernel, where the Length argument is not at the expected offset. In the early versions of I/O ring this meant that the kernel always skipped the buffer length and interpreted the next entry’s address as the last entry’s length, leading to bugs and errors.

In newer builds, the kernel is aware of the differently shaped structure used by WoW64 processes, and interprets it correctly: It assumes that the size of each entry is 8 bytes instead of 0x10, and reads only the first 4 bytes as the address and the next 4 bytes as the length.

The same issue existed when pre-registering file handles, since a HANDLE is also the size of a pointer. IopIoRingDispatchRegisterFiles now has the same checks and processing to allow WoW64 processes to successfully register file handles as well.

Other Changes

There are a couple of smaller changes that aren’t large or significant enough to receive their own section of this post but still deserve an honorable mention:

  • The successful creation of a new I/O ring object will generate an ETW event containing all the initialization information in the I/O ring.
  • IoringObject->CompletionEvent received a promotion from NotificationEvent type to SynchronizationEvent.
  • Current I/O ring version is 3, so new rings created for recent builds should use this version.
  • Since different versions of I/O ring support different capabilities and operations, KernelBase.dll exports a new function: IsIoRingOpSupported. It receives the HIORING handle and the operation number, and returns a boolean indicating whether the operation is supported on this version.

Data Structures

One more exciting thing happened in Windows 11 22H2 (build 22577): nearly all the internal I/O ring structures are available in the public symbols! This means there is no longer a need to painfully reverse engineer the structures and try to guess the field names and their purposes. Some of the structures received major changes since 21H2, so not having to reverse engineer them all over again is great.

Since the structures are in the symbols there is no real need to add them here. However, structures from the public symbols aren’t always easy to find through a simple Google search – I highly recommend trying GitHub search instead, or just directly using ntdiff. At some point people will inevitably search for some of these data structures, find the REd structures in my old post, which are no longer accurate, and complain that they are out of date. To avoid that at least temporarily, I’ll only post here the updated versions of the structures that I had in the old post but will highly encourage you to get the up-to-date structures from the symbols – the ones here are bound to change soon enough (edit: one build later, some of them already did). So, here are some of the structures from Windows 11 build 22598:

typedef struct _NT_IORING_INFO
{
    IORING_VERSION IoRingVersion;
    NT_IORING_CREATE_FLAGS Flags;
    ULONG SubmissionQueueSize;
    ULONG SubmissionQueueRingMask;
    ULONG CompletionQueueSize;
    ULONG CompletionQueueRingMask;
    PNT_IORING_SUBMISSION_QUEUE SubmissionQueue;
    PNT_IORING_COMPLETION_QUEUE CompletionQueue;
} NT_IORING_INFO, *PNT_IORING_INFO;

typedef struct _NT_IORING_SUBMISSION_QUEUE
{
    ULONG Head;
    ULONG Tail;
    NT_IORING_SQ_FLAGS Flags;
    NT_IORING_SQE Entries[1];
} NT_IORING_SUBMISSION_QUEUE, *PNT_IORING_SUBMISSION_QUEUE;

typedef struct _NT_IORING_SQE
{
    enum IORING_OP_CODE OpCode;
    enum NT_IORING_SQE_FLAGS Flags;
    union
    {
        ULONG64 UserData;
        ULONG64 PaddingUserDataForWow;
    };
    union
    {
        NT_IORING_OP_READ Read;
        NT_IORING_OP_REGISTER_FILES RegisterFiles;
        NT_IORING_OP_REGISTER_BUFFERS RegisterBuffers;
        NT_IORING_OP_CANCEL Cancel;
        NT_IORING_OP_WRITE Write;
        NT_IORING_OP_FLUSH Flush;
        NT_IORING_OP_RESERVED ReservedMaxSizePadding;
    };
} NT_IORING_SQE, *PNT_IORING_SQE;

typedef struct _IORING_OBJECT
{
    USHORT Type;
    USHORT Size;
    NT_IORING_INFO UserInfo;
    PVOID Section;
    PNT_IORING_SUBMISSION_QUEUE SubmissionQueue;
    PMDL CompletionQueueMdl;
    PNT_IORING_COMPLETION_QUEUE CompletionQueue;
    ULONG64 ViewSize;
    BYTE InSubmit;
    ULONG64 CompletionLock;
    ULONG64 SubmitCount;
    ULONG64 CompletionCount;
    ULONG64 CompletionWaitUntil;
    KEVENT CompletionEvent;
    BYTE SignalCompletionEvent;
    PKEVENT CompletionUserEvent;
    ULONG RegBuffersCount;
    PIORING_BUFFER_INFO RegBuffers;
    ULONG RegFilesCount;
    PVOID* RegFiles;
} IORING_OBJECT, *PIORING_OBJECT;

One structure that isn’t in the symbols is the HIORING structure that represents the ioring handle in KernelBase. That one slightly changed since 21H2 so here is the reverse engineered 22H2 version:

typedef struct _HIORING
{
    HANDLE handle;
    NT_IORING_INFO Info;
    ULONG IoRingKernelAcceptedVersion;
    PVOID RegBufferArray;
    ULONG BufferArraySize;
    PVOID FileHandleArray;
    ULONG FileHandlesCount;
    ULONG SubQueueHead;
    ULONG SubQueueTail;
} HIORING, *PHIORING;

Conclusion

This feature barely just shipped a few months ago, but it’s already receiving some very interesting additions and improvements, aiming to make it more attractive to I/O-heavy applications. It’s already at version 3, and it’s likely we’ll see a few more versions coming in the near future, possibly supporting new operation types or extended functionality. Still, no applications seem to use this mechanism yet, at least on Desktop systems.

This is one of the more interesting additions to Windows 11, and just like any new piece of code it still has some bugs, like the one I showed in this post. It’s worth keeping an eye on I/O rings to see how they get used (or maybe abused?) as Windows 11 becomes more widely adapted and applications begin using all the new capabilities it offers.

HyperGuard Part 3 – More SKPG Extents

Hi all! And welcome to part 3 of the HyperGuard chronicles!

In the previous blog post I introduced SKPG extents – the data structures that describe the memory ranges and system components that should be monitored by HyperGuard. So far, I only covered the initialization extent and various types of memory extents, but those are just the beginning. In this post I will cover the rest of the extent types and show how they are used by HyperGuard to protect other areas of the system.

The next extent group to look into is MSR and Control Register extents:

MSR and Control Register Extents

This group contains the following extent types:

  • 0x1003: SkpgExtentMsr
  • 0x1006: SkpgExtentControlRegister
  • 0x100C: SkpgExtentExtendedControlRegister

These extent types are received from the normal kernel, but they are never added into the array at the end of the SKPG_CONTEXT or get validated during the runtime checks that I’ll describe in one of the next posts. Instead, they are used in yet another part of SKPG initialization.

After initializing the SKPG_CONTEXT in SkpgInitializeContext, SkpgConnect performs an IPI (Inter-Processor Interrupt). It performs this IPI by calling SkeGenericIpiCall with a target function and input data, and the function will call the target function on every processor and send the requested data. In this case, the target function is SkpgxInstallIntercepts and the input data contains the number of input extents and the matching array:

I will go over intercepts in a lot more detail in a future blog post, but to give some necessary context: SKPG can ask the hypervisor to intercept certain actions in the system, like memory access, register access or instructions. HyperGuard uses that ability to intercept access to certain MSRs and Control Registers (and other things, which I will talk about later) to prevent malicious modifications. HyperGuard uses the input extents to choose which MSRs and Control Registers to intercept, out of the list of accepted options.

Since each processor has its own set of MSRs and registers, HyperGuard needs to intercept the requested one on all processors. Therefore, SkpgxInstallIntercepts is called through an IPI, to make sure it’s called in the context of each processor.

Once in SkpgxInstallIntercepts, the function iterates over the array of input extents and handles the three types included in this group based on the data supplied in them. If you remember, each extent contains 0x18 bytes of type-specific data. For this group, this data contains the number of the MSR/Register to be intercepted as well as the processor number that it should be intercepted on. This means that there might be more that one input extent for each MSR or control register, each for a different processor number. Or MSRs and control registers might only be intercepted on certain processors but not on others, if that is what the normal kernel requested. The data structure in the input extent for MSRs and control register extents looks something like this:

typedef struct _MSR_CR_DATA
{
    ULONG64 Mask;
    ULONG64 Value;
    ULONG RegisterNumber;
    ULONG ProcessorNumber;
} MSR_CR_DATA, *PMSR_CR_DATA;

While iterating over the extents, the function checks if the extent type is of one of the three in this group, and if so whether the processor number in the extent matches the current processor. If so, it checks if the number of the MSR or control register matches one of the accepted ones. If the extent matches one of the accepted registers, a mask is fetched from an array in the SKPRCB – this array contains the needed masks for all accepted MSRs and control registers so the hypervisor can be asked to intercept them. All masks are collected, and when all extents have been examined the final mask is sent to ShvlSetRegisterInterceptMasks to be installed. The mask that is used to install the intercepts is the union HV_REGISTER_CR_INTERCEPT_CONTROL. It is documented and can be found here.

Now that the general process is covered, we can look into the accepted MSRs and control registers and understand why HyperGuard might want to protect them from modifications, starting from the MSRs:

SkpgExtentMsr

Patching certain MSRs is a popular operation for exploits and rootkits, allowing them to do things such as hooking system calls or disabling security features. Some of those MSRs are already periodically monitored by PatchGuard, but there are benefits to intercepting them through HyperGuard that I will cover later. The list of MSRs that can be intercepted keeps growing over time and receives new additions as new features and registers get added to CPUs, such as the implementation of CET which added multiple MSRs that might be a target for attackers. As of Windows 11 build 22598, the MSRs that can be intercepted by SKPG are:

  1. IA32_EFER (0xC0000080) – among other things, this MSR contains the NX bit, enforcing a mitigation that doesn’t allow code execution in addresses that aren’t specifically marked as executable. It also contains flags related to virtualization support.
  2. IA32_STAR (0xC0000081) – contains the address of the x86 system call handler.
  3. IA32_LSTAR (0xC0000082) – contains the address of the x64 system call handler – should normally be pointing to nt!KiSystemCall64.
  4. IA32_CSTAR (0xC0000083) – contains the address of the system call handler on x64 when running in compatibility mode – should normally be pointing to nt!KiSystemCall32.
  5. IA32_SFMASK (0xC0000084) – system call flags mask. Any bit set here when a system call is executed will be cleared from EFLAGS.
  6. IA32_TSC_AUX (0xC0000103) – usage depends on the operating system, but this MSR is generally used to store a signature, to be read together with a time stamp.
  7. IA32_APIC_BASE (0x1B) – contains the APIC base address.
  8. IA32_SYSENTER_CS (0x174) – contains the CS value for ring 0 code when performing system calls with SYSENTER.
  9. IA32_SYSENTER_ESP (0x175) – contains the stack pointer for the kernel stack when performing system calls with SYSENTER.
  10. IA32_SYSENTER_EIP (0x176) – contains the EIP value for ring 0 entry when performing system calls with SYSENTER.
  11. IA32_MISC_ENABLE (0x1A0) – controls multiple processor features, such as Fast Strings disable, performance monitoring and disable of the XD (no-execute) bit.
  12. MSR_IA32_S_CET (0x6A2) – controls kernel mode CET setting.
  13. IA32_PL0_SSP (0x6A4) – contains the ring 0 shadow stack pointer.
  14. IA32_PL1_SSP (0x6A5) – contains the ring 1 shadow stack pointer.
  15. IA32_PL2_SSP (0x6A6) – contains the ring 2 shadow stack pointer.
  16. IA32_INTERRUPT_SSP_TABLE_ADDR (0x6A8) – contains a pointer to the interrupt shadow stack table.
  17. IA32_XSS (0xDA0) – contains a mask to be used when XSAVE and XRESTOR instructions are called in kernel-mode. For example, it controls the saving and loading of the registers used by Intel Processor Trace (IPT).

SkpgExtentControlRegister

By modifying certain control registers an attacker can disable security features or gain control of execution. Currently SKPG supports intercepts of two control registers:

  1. CR0 – controls certain hardware configuration such as paging, protected mode and write protect.
  2. CR4 – controls the configuration of different hardware features. For example, driver signature enforcement, SMEP and UMIP bits control security features that make CR4 an interesting target for attackers using an arbitrary write exploit.

SkpgExtentExtendedControlRegister

Currently only one extended control register exists – XCR0. It’s used to toggle storing or loading of extended registers such as AVX, ZMM and CET registers, and can be intercepted and protected by SKPG.

Installing the Intercepts

Now that we know that registers can be intercepted and why, we can get back and look at the installation of the intercepts through ShvlSetRegisterInterceptMasks. The function receives a HV_REGISTER_CR_INTERCEPT_CONTROL mask to know which intercepts to install, as well as the values for a few of the intercepted registers – CR0, CR4 and IA32_MISC_ENABLE MSR. These are all placed in a structure that is passed into the function, which looks like this:

struct _REGISTER_INTERCEPT_INFORMATION
{
    HV_REGISTER_CR_INTERCEPT_CONTROL InterceptControl;
    ULONG64 Cr0Value;
    ULONG64 Cr4Value;
    ULONG64 Ia32MiscEnableValue;
} REGISTER_INTERCEPT_INFORMATION, *PREGISTER_INTERCEPT_INFORMATION;

The InterceptControl mask is built while iterating over the input extents, and the values for CR0, CR4 and IA32_MISC_ENABLE are read from the SKPRCB (their values, together with the values for all other potentially-intercepted registers, are placed there in SkeInitSystem, triggered from a secure call with code SECURESERVICE_PHASE3_INIT).

This structure is sent to ShvlSetRegisterInterceptMasks which in turn calls ShvlSetVpRegister on each of the four values in the input structure to register an intercept. Setting the register values is done by initiating a fast hypercall with a code of HvCallSetVpRegisters (0x51), sending on four arguments (for anyone interested, all hypercall values are documented here). The last two arguments are of types HV_REGISTER_NAME and HV_REGISTER_VALUE – these types are documented so it’s easy to see what registers are being set:

Looking at the function, we see that it’s setting the required values for CR0, CR4 and IA32_MISC_ENABLE, and finally setting the mask for intercept control, so from this point all requested registers are intercepted by the hypervisor and any access to them will be forwarded to the SKPG intercept routine.

Secure VA Translation Extents

In the previous post I introduced the secure extents – extents indicating VTL1 memory or data structures to be protected. I also covered memory extents, including the secure memory extents. Here is another kind of secure extents, which are initialized internally in the secure kernel, without using input extents from VTL0. They are called Secure VA Translation Extents and are initialized inside SkpgCreateSecureVaTranslationExtents. These extents are used to protect Virtual->Physical address translations for different pages or memory regions that are a common target for attack:

  • 0x100B: SkpgExtentProcessorMode
  • 0x100E: SkpgExtentLoadedModule
  • 0x100F: SkpgExtentProcessorState
  • 0x1010: SkpgExtentKernelCfgBitmap
  • 0x1011: SkpgExtentZeroPage
  • 0x1012: SkpgExtentAlternateInvertedFunctionTable
  • 0x1015: SkpgExtentSecureExtensionTable
  • 0x1017: SkpgExtentKernelVAProtection
  • 0x1019: SkpgExtentSecurePool

Though they are called secure extents, the data they protect is mostly VTL0 data, such as the VTL0 mapping of the KCFG bitmap or the inverted function table. The exact validations done differ between the types: for example, the zero page should never be mapped so a successful virtual->physical address translation of the zero page should not be acceptable, while the kernel CFG bitmap should have valid translations but the VTL0 mapping of those pages should always be read-only.

Looking at SkpgCreateSecureVaTranslationExtents, we can see that the extents are initialized with no input data or memory ranges:

This is because all of these extents correlate to specific data structures which are all initialized elsewhere so the data doesn’t need to be part of the extent itself, so the type is the only part that needs to be set. We can also see that some of these extents are only initialized when KCFG is enabled, since without it they are not needed. I will cover the checks done for each of these extents in a later blog post, which will describe SKPG extent verification.

Finally, if HotPatching is enabled, two more extents are added, both with type SkpgExtentExtensionTable:

These extents protect the SkpgSecureExtension and SkpgNtExtension variables, which keep track of HotPatching data.

Per-Processor Extents

There are two more extents that are processor-specific, since the data they protect exists separately in each processor. However, unlike the MSR and Control Register extents, no intercepts need to be installed and no function needs to be executed on all processors (for now). These extents are also received from the normal kernel and added to the array of extents in the SKPG_CONTEXT structure. The data received for each of these two extents includes base address, limit and a processor number, so multiple entries might exist for these extent types, with different processor numbers:

  • 0x1004: SkpgExtentIdt
  • 0x1005: SkpgExtentGdt

These extents contain the memory range for the GDT and IDT tables on each processor, so HyperGuard will protect them from malicious modifications.

Unused Extents

Extent types 0x1007, 0x1008, 0x1013 and 0x1018 never get initialized anywhere in SecureKernel.exe and don’t seem to be used anywhere. They may be deprecated or not fully implemented yet.

An Exercise in Dynamic Analysis

Analyzing the PayloadRestrictions.dll Export Address Filtering

This post is a bit different from my usual ones. It won’t cover any new security features or techniques and won’t share any novel security research. Instead, it will guide you through the process of analyzing an unknown mitigation through a real-life example in Windows Defender Exploit Guard (formerly EMET). Because the goal here is to show a step-by-step, real life research process, the post will be a bit disorganized and will follow a more organic and messy train of thought.

A brief explanations of the Windows Defender Exploit Guard: formerly known as EMET, this is a DLL that gets injected on demand and implements several security mitigations such as Export Address Filtering, Import Address Filtering, Stack Integrity Validations, and more. These are all disabled by default and need to be manually enabled in the Windows security settings, either for a specific process or for the whole system. Since it was acquired by Microsoft, these mitigations are implemented in PayloadRestrictions.dll, which can be found in C:\Windows\System32.

This post will follow one of these mitigations, named Export Address Filtering (or EAF). This tutorial will demonstrate a step-by-step guide for analyzing this mitigation, using both dynamic analysis in WinDbg and static analysis in IDA and Hex Rays. I’ll try to highlight the things that should be focused on when analyzing a mitigation and show that even with partial information we can reach useful conclusions and learn about this feature.

First, we’ll enable EAF in calc.exe in the Windows Security settings:

We don’t know anything about this mitigation yet other than that one line descriptions in the security settings, so we’ll start by running calc.exe under a debugger to see what happens. Immediately we can see PayloadRestrictions.dll get loaded into the process:

And almost right away we get a guard page violation:

What is in this mysterious address and why does accessing it throw a guard page violation?

To start finding out the answer to the first question  we can run !address to get a few more details about the address causing the exception:

!address 00007ffe`3da6416c
 
Usage:                  Image
Base Address:           00007ffe`3d8b9000
End Address:            00007ffe`3da7a000
Region Size:            00000000`001c1000 (   1.754 MB)
State:                  00001000          MEM_COMMIT
Protect:                00000002          PAGE_READONLY
Type:                   01000000          MEM_IMAGE
Allocation Base:        00007ffe`3d730000
Allocation Protect:     00000080          PAGE_EXECUTE_WRITECOPY
Image Path:             C:\WINDOWS\System32\kernelbase.dll
Module Name:            kernelbase
Loaded Image Name:
Mapped Image Name:
More info:              lmv m kernelbase
More info:              !lmi kernelbase
More info:              ln 0x7ffe3da6416c
More info:              !dh 0x7ffe3d730000
 
 
Content source: 1 (target), length: 15e94

Now we know that this address is in a read-only page inside KernelBase.dll. But we don’t have any information that will help us understand what this page is and why it’s guarded. Let’s follow the suggestion of the command output and run !dh to dump the headers of KernelBase.dll to get some more information (showing partial output here since full output is very long):

!dh 0x7ffe3d730000

File Type: DLL
FILE HEADER VALUES
8664 machine (X64)
7 number of sections
FE317FB0 time date stamp Sat Feb 21 05:53:36 2105

0 file pointer to symbol table
0 number of symbols
F0 size of optional header
2022 characteristics
Executable
App can handle >2gb addresses
DLL

OPTIONAL HEADER VALUES
20B magic #
14.30 linker version
188000 size of code
211000 size of initialized data
0 size of uninitialized data
89FE0 address of entry point
1000 base of code
----- new -----
00007ffe3d730000 image base
1000 section alignment
1000 file alignment
3 subsystem (Windows CUI)
10.00 operating system version
10.00 image version
10.00 subsystem version
39A000 size of image
1000 size of headers
3A8E61 checksum
0000000000040000 size of stack reserve
0000000000001000 size of stack commit
0000000000100000 size of heap reserve
0000000000001000 size of heap commit
4160 DLL characteristics
High entropy VA supported
Dynamic base
NX compatible
Guard
334150 [ F884] address [size] of Export Directory
3439D4 [ 50] address [size] of Import Directory
369000 [ 548] address [size] of Resource Directory
34F000 [ 18828] address [size] of Exception Directory
397000 [ 92D0] address [size] of Security Directory
36A000 [ 2F568] address [size] of Base Relocation Directory
29B8C4 [ 70] address [size] of Debug Directory
0 [ 0] address [size] of Description Directory
0 [ 0] address [size] of Special Directory
255C20 [ 28] address [size] of Thread Storage Directory
1FB6D0 [ 140] address [size] of Load Configuration Directory
0 [ 0] address [size] of Bound Import Directory
2569D8 [ 16E0] address [size] of Import Address Table Directory
331280 [ 620] address [size] of Delay Import Directory
0 [ 0] address [size] of COR20 Header Directory
0 [ 0] address [size] of Reserved Directory

Our faulting address is 0x7ffe3da6416c, which is at offset 0x33416c inside KernelBase.dll. Looking for the closest match in the output of !dh we can find the export directory at offset 0x334150:

334150 [    F884] address [size] of Export Directory

So the faulting code is trying to access an entry in the KernelBase export table. That shouldn’t happen under normal circumstances – if you debug another process (one that doesn’t have EAF enabled) you will not see any exceptions being thrown when accessing the export table. So we can guess that PayloadRestrictions.dll is causing this, and we’ll soon see how and why it does it.

One thing to note about guard page violations is this, quoted from this MSDN page:

If a program attempts to access an address within a guard page, the system raises a STATUS_GUARD_PAGE_VIOLATION (0x80000001) exception. The system also clears the PAGE_GUARD modifier, removing the memory page’s guard page status. The system will not stop the next attempt to access the memory page with a STATUS_GUARD_PAGE_VIOLATION exception.

So this guard page violation should only happen once and then get removed and never happen again. However, if we continue the execution of calc.exe, we’ll soon see another page guard violation on the same address:

This means the guard page somehow came back and is set on the KernelBase export table again.

The best guess in this case would probably be that someone registered an exception handler which gets called every time a guard page violation happens and immediately sets the PAGE_GUARD flag again, so that the same exception happens next time anything accesses the export table. Unfortunately, there is no good way to view registered exception handlers in WinDbg (unless setting the “enable exception logging” in gflags, which enables the !exrlog extension but I won’t be doing that now). However, we know that the DLL registering the suspected exception handler is most likely PayloadRestrictions.dll, so we’ll open it in IDA and take a look.

When looking for calls to RtlAddVectoredExceptionHandler, the function used to register exception handlers, we only see two results:

Both register the same exception handler — MitLibExceptionHandler:

(on a side note – I don’t often choose to use the IDA disassembler instead of the Hex Rays decompiler but PayloadRestrictions.dll uses some things that the decompiler doesn’t handler too well so I’ll be switching between the disassembler and decompiler code in this post)

We can set a breakpoint on this exception handler and see that it gets called from the same address that threw the page guard violation exception earlier (ntdll!LdrpSnapModule+0x23b):

Looking at the exception handler itself we can see it’s quite simple:

It only handles two exception codes:

  1. STATUS_GUARD_PAGE_VIOLATION
  2. STATUS_SINGLE_STEP

When a guard page violation happens, we can see MitLibValidateAccessToProtectedPage get called. Looking at this function, we can tell that a lot of it is dedicated to checks related to Import Address Filtering. We can guess that based on the address comparisons to the global IatShadowPtr variable and calls to various IAF functions:

Some of the code here is relevant for EAF, but for simplicity we’ll skip most of it (for now). Just by quickly scanning through this function and all the ones called by it, it doesn’t look like anything here is resetting the PAGE_GUARD modifier on the export table page.

What might give us a hint is to go back to WinDbg and continue program execution:

We’re immediately hitting another exception at the next instruction, this time its one of type single step exception. A single step exception is one normally triggered by debuggers when requesting a single step, such as when walking a function instruction by instruction. But in this case I asked the debugger to continue the execution, not do a single step, so it wasn’t WinDbg that triggered this exception.

The way a single step instruction is triggered is by setting the Trap Flag (bit 8) in the EFLAGS register inside the context record. And if we look towards the end of MitLibValidateAccessToProtectedPage we can see it doing exactly that:

So far we’ve seen PayloadRestrictions.dll do the following:

  1. Set the PAGE_GUARD modifier on the export table page.
  2. When the export table page is accessed, catch the exception with MitLibExceptionHandler and call MitLibValidateAccessToProtectedPage if this is a guard page violation.
  3. Set the Trap Flag in EFLAGS to generate a single step exception on the next instruction once execution resumes.

This matches the fact that MitLibExceptionHandler handles exactly two exception codes – guard page violations and single steps. So on the next instruction we receive the now expected single step exception and go right into MitLibHandleSingleStepException:

This is obviously a cleaned-up version of the original output. I saved you some of the work of checking what the global variables are and renaming them since this isn’t an especially interesting step – for example to check what function is pointed to by the variable I named pNtProtectVirtualMemory I simply dumped the pointer in WinDbg and saw it pointing to NtProtectVirtualMemory.

Back to the point – there are some things in this function that we’ll ignore for now and come back to later. What we can focus on is the call to NtProtectVirtualMemory, which (at least through one code path) sets the protection to PAGE_GUARD and PAGE_READONLY. Even without fully understanding everything we can make an educated guess and say that this is most likely the place where the KernelBase.dll export table guard page flag gets reset.

Now that we know the mechanism behind the two exceptions we’re seeing, we can go back to MitLibValidateAccessToProtectedPage to go over all the parts we skipped earlier and see what happens when a guard page violation occurs. First thing we see is a check to see if the faulting address in inside the IatShadow page. We can keep ignoring this one since it’s related to another feature (IAF) that we haven’t enabled for this process. We move on to the next section, which I titled FaultingAddressIsNotInShadowIat:

I already renamed some of the variables used here for convenience, but we’ll go over how I reached those names and titles and what this whole section does. First, we see the DLL using three global variables – g_MitLibState, a large global structure that contains all sorts of data used by PayloadRestrictions.dll, and two unnamed variables that I chose to call NumberOfModules and NumberOfProtectedRegions – we’ll soon see why I chose those names.

At a first glance, we can tell that this code is running in a loop. In each iteration it accesses some structure in g_MitLibState+0x50+index. This means there is some array at g_MitLibState+0x50, where each entry is some unknown structure. From this code, we can tell that each structure in the array in sized 0x28 bytes. Now we can either try to statically search for the function in the DLL that initializes this array and try to figure out what the structure contains, or we can go back to WinDbg and dump the already-initialized array in memory:

When dumping unknown memory it’s useful to use the dps command to check if there are any known symbols in the data. Looking at the array in memory we can see there are 3 entries. Using the we see that the first field in each of the structures is the base address of one module: Ntdll, KernelBase and Kernel32. Immediately following it there is a ULONG. Based on the context and the alignment we can guess that this might be the size of the DLL. A quick WinDbg query shows that this is correct:

0:007> dx @$curprocess.Modules.Where(m => m.Name.Contains("ntdll.dll")).Select(m => m.Size)
@$curprocess.Modules.Where(m => m.Name.Contains("ntdll.dll")).Select(m => m.Size)                
    [0x19]           : 0x211000
0:007> dx @$curprocess.Modules.Where(m => m.Name.Contains("kernelbase.dll")).Select(m => m.Size)
@$curprocess.Modules.Where(m => m.Name.Contains("kernelbase.dll")).Select(m => m.Size)                
    [0x7]            : 0x39a000
0:007> dx @$curprocess.Modules.Where(m => m.Name.Contains("kernel32.dll")).Select(m => m.Size)
@$curprocess.Modules.Where(m => m.Name.Contains("kernel32.dll")).Select(m => m.Size)                
    [0xc]            : 0xc2000

Next we have a pointer to the base name of the module:

0:007> dx -r0 (wchar_t*)0x00007ffe1a4926b0
(wchar_t*)0x00007ffe1a4926b0                 : 0x7ffe1a4926b0 : "ntdll.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a7d68
(wchar_t*)0x00000218f42a7d68                 : 0x218f42a7d68 : "kernelbase.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a80c8
(wchar_t*)0x00000218f42a80c8                 : 0x218f42a80c8 : "kernel32.dll" [Type: wchar_t *]

And another pointer to the full path of the module:

0:007> dx -r0 (wchar_t*)0x00000218f42a7970
(wchar_t*)0x00000218f42a7970                 : 0x218f42a7970 : "C:\WINDOWS\SYSTEM32\ntdll.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a7d40
(wchar_t*)0x00000218f42a7d40                 : 0x218f42a7d40 : "C:\WINDOWS\System32\kernelbase.dll" [Type: wchar_t *]
0:007> dx -r0 (wchar_t*)0x00000218f42a80a0
(wchar_t*)0x00000218f42a80a0                 : 0x218f42a80a0 : "C:\WINDOWS\System32\kernel32.dll" [Type: wchar_t *]

Finally we have a ULONG that is used in this function to indicate whether or not to check this range, so I named it CheckRipInModuleRange. When put together, we can build the following structure:

typedef struct _MODULE_INFORMATION {
    PVOID ImageBase;
    ULONG ImageSize;
    PUCHAR ImageName;
    PUCHAR FulleImagePath;
    ULONG CheckRipInModuleRange;
} MODULE_INFORMATION, *PMODULE_INFORMATION;

We could define this structure in IDA and get a much nicer view of the code but I’m trying to keep this post focused on analyzing this feature so I just annotated the idb with the field names.

Now that we know what this array contains we can have a better idea of what this code does – It iterates over the structures in this array and checks if the instruction pointer that accessed the guarded page is inside one of those modules. When the loop is done – or the code found that the faulting RIP is in one of those modules – it sets r8 to the index of the module (or leaves it as -1 if a module is not found) and moves on to the next checks:

Here we have another loop, this time iterating over an array in g_MitLibState+0x5D0, where each structure is sized 0x18, and comparing it to the address that triggered the exception (in our case, the address inside the KernelBase export table). Now we already know what to do so we’ll go and dump that array in memory:

We have here three entries, each containing what looks like a start address, end address and some flag. Let’s see what each of these ranges are:

  1. First range starts at the base address of NTDLL and spans 0x160 bytes, so pretty much covers the NTDLL headers.
  2. Second range is one we’ve been looking at since the beginning of the post – this is the KernelBase.dll export table.
  3. Third range is the Kernel32.dll export table (I won’t show how we can find this out because we’ve done this for KernelBase earlier in the post).

It’s safe to assume these are the memory regions that PayloadRestrictions.dll protects and that this check is meant to check that this guard page violation was triggered for one of its protected ranges and not some other guarded page in the process.

I won’t go into as many details for the other checks in this function because that would mostly involve repeating the same steps over and over and this post is pretty long as it is. Instead we’ll look a bit further ahead at this part of the function:

This code path is called if the instruction pointer is found in one of the registered modules. Even without looking inside any of the functions that are called here we can guess that MitLibMemReaderGadgetCheck looks at the instruction that accessed the guarded page and compares them to the expected instructions, and MitLibReportAddressFilterViolation is called to report unexpected behavior if the instructions is considered “bad”.

A different path is taken if the saved RIP is not in one of the known modules, which involved two final checks. The first checks if the saved RSP is inside the stack, and if it isn’t MitLibReportAddressFilterViolation is called to report potential exploitation:

The second calls RtlPcToFileHeader to get the base address of the module that the saved RIP is in and reports a violation if one is not found since that means the guarded page was accessed from within dynamic code and not an image:

All cases where MitLibReportAddressFilterViolation is called will eventually lead to a call to MitLibTriggerFailFast:

This ends up terminating the process, therefore blocking the potential exploit. If no violation is found, the function enables a single step exception for the next instruction that’ll run and the whole cycle begins again.

Of course we can keep digging into the DLL to learn about the initialization of this feature, the gadgets being searched for or what happens when a violation is reported, but I’ll leave those as assignments for someone else. For now we managed to get a good understanding of what EAF is and how it works that will allow us to further analyze it or search for potential bypasses, as well as getting some tools for analyzing similar mechanisms in PayloadRestrictions.dll or other security products.

HyperGuard – Secure Kernel Patch Guard: Part 2 – SKPG Extents

Welcome to Part 2 of the series about Secure Kernel Patch Guard, also known as HyperGuard. This part will start describing the data structure and components of SKPG, and more specifically the way it’s activated. If you missed Part 1, you can find it right here.

Inside HyperGuard Activation

In Part 1 of the series I introduced HyperGuard and described its different initialization paths. Whichever path we went through, we end up reaching SkpgConnect when the normal kernel finished its initialization. This is when all important data structures in the kernel have already been initialized and can start being monitored and protected by PatchGuard and HyperGuard.

After a couple of standard input validations, SkpgConenct acquires SkpgConnectionLock and checks the SkpgInitialized global variable to tell if HyperGuard has already been initialized. If the variable is set, the function will return STATUS_ACCESS_DENIED or STATUS_SUCCESS, depending on the information received. In either of those cases, it will do nothing else.

If SKPG has not been initialized yet, SkpgConnect will start initializing it. First it calculates and saves multiple random values to be used in several different checks later on. Then it allocates and initializes a context structure, saved in the global SkpgContext. Before we move on to other SKPG areas, it’s worth spending a bit of time talking about the SKPG context.

SKPG Context

This SKPG context structure is allocated and Initialized in SkpgConnect and will be used in all SKPG checks. It contains all the data needed for HyperGuard to monitor and protect the system, such as the NT PTE information, encryption algorithms, KCFG ranges, and more, as well as another timer and callback, separate to the ones we saw in the first part of the series. Unfortunately, like the rest of HyperGuard, this structure, which I’ll call SKPG_CONTEXT, is not documented and so we need to do our best to figure out what it contains and how it’s used.

First, the context needs to be allocated. This context has a dynamic size that depends on the data received from the normal kernel. Therefore, it is calculated at runtime using the function SkpgComputeContextSize. The minimal size of the structure is 0x378 bytes (this number tends to increase every few Windows builds as the context structure gains new fields) and to that will be added a dynamic size, based on the data sent from the normal kernel.

That input data, which is only sent when SKPG is initialized through the PatchGuard code paths, is an array of structures named Extents. These extents describe different memory regions, data structures and other system components to be protected by HyperGuard. I will cover all of these in more detail later in the post, but a few examples include the GDT and IDT, data sections in certain protected modules and MSRs with security implications.

After the required size is calculated, the SKPG_CONTEXT structure is allocated and some initial fields are set in SkpgAllocateContext. A couple of these fields include another secure timer and a related callback, whose functions are set to SkpgHyperguardTimerRoutine and SkpgHyperguardRuntime. It also sets fields related to PTE addresses and other paging-related properties, since a lot of the HyperGuard checks validate correct Virtual->Physical page translations.

Afterwards, SkpgInitializeContext is called to finish initializing the context using the extents provided by the normal kernel. This basically means iterating over the input array, using the data to initialize internal extent structures, that I’ll call SKPG_EXTENT, and sticking them at the end of the SKPG_CONTEXT structure, with a field I chose to call ExtentOffset pointing to the beginning of the extent array (notice that none of these structures are documented, so all structure and field names are made up):

SKPG Extents

There are many different types of extents, and each SKPG_EXTENT structure has a Type field indicating its type. Each extent also has a hash, used in some cases to validate that no changes were done to the monitored memory region. Then there are fields for the base address of the monitored memory and the number of bytes, and finally a union that contains data unique to each extent type. For reference, here is the reverse engineered SKPG_EXTENT structure:

typedef struct _SKPG_EXTENT
{
    USHORT Type;
    USHORT Flags;
    ULONG Size;
    PVOID Base;
    ULONG64 Hash;
    UCHAR TypeSpecificData[0x18];
} SKPG_EXTENT, *PSKPG_EXTENT;

I mentioned that the input extents used by HyperGuard were provided by the PatchGuard initializer function in the normal kernel. But SKPG initializes another kind of extents as well – secure extents. To initialize those, SkpgInitializeContext calls into SkpgCreateSecureKernelExtents, providing the SKPG_CONTEXT structure and the address where the current extent array ends – so the secure extents can be placed there. Secure extents use the same SKPG_EXTENT structure as regular extents and protect data in the secure kernel, such as modules loaded into the secure kernel and secure kernel memory ranges.

Extent Types

Like I mentioned, there are many different types of extents, each used by HyperGuard to protect a different part of the system. However, we can split them into a few groups that share similar traits and are handled in a similar way. For clarity and to separate normal extents from  secure extents, I will use the naming convention SkpgExtent for normal extent types and SkpgExtentSecure for secure extent types.

The first extent that I’d like to cover is a pretty simple one that always gets sent to SkpgInitializeContext regardless of other input:

Initialization Extent

There is one extent that doesn’t belong in any of the groups since it is not involved in any of the HyperGuard validations. This is extent 0x1000: SkpgExtentInit – this extent is not copied to the array in the context structure. Instead, this extent type is created by SkpgConnect and sent into SkpgInitializeContext to set some fields in the context structure itself that were previously unpopulated. These fields have additional hashes and information related to hotpatching, such as whether it is enabled and the addresses of the retpoline code pages. It also sets some flags in the context structure to reflect some configuration options in the machine.

Memory and Module Extents

This group includes the following extent types:

  • 0x1001: SkpgExtentMemory
  • 0x1002: SkpgExtentImagePage
  • 0x1009: SkpgExtentUnknownMemoryType
  • 0x100A: SkpgExtentOverlayMemory
  • 0x100D: SkpgExtentSecureMemory
  • 0x1014: SkpgExtentPartialMemory
  • 0x1016: SkpgExtentSecureModule

The thing all these extent types have in common is that they all indicate some memory range to be protected by HyperGuard. Most of these contain memory ranges in the normal kernel, however SkpgExtentSecureMemory and SkpgExtentSecureModule have VTL1 memory ranges and modules. Still, all these extent types are handled in a similar way regardless of the memory type or VTL so I grouped them together.

When normal memory extents are being added to the SKPG Context, all normal kernel address ranges get validated to ensure that the pages have a valid mapping for SKPG protection. For a normal kernel page to be valid for SKPG protection, the page can’t be writable. SKPG will monitor all requested pages for changes, so a writable page, whose contents can change at any time, is not a valid “candidate” for this kind of protection. Therefore, SKPG can only monitor pages whose protection is either “read” or “execute”. Obviously, only valid pages (as indicated by the Valid bit in the PTE) can be protected. There are slight differences to some of the memory extents when HVCI is enabled as SKPG can’t handle certain page types in those conditions.

Once mapped and verified, each memory page that should be protected gets hashed, and the hash gets saved into the SKPG_EXTENT structure where it will be used in future HyperGuard checks to validate that the page wasn’t modified.

Some memory extents describe a generic memory range, and some, like SkpgExtentImagePage, describe a specific memory type that needs to be treated slightly differently. This extent type mentions a specific image in the normal kernel, but HyperGuard should not be protecting the whole image, only a part of it. So the input extent has the image base, the page offset inside the image where the protection should start and the requested size. Here too the memory region to be protected will be hashed and the hash will be saved into the SKPG_EXTENT to be used in future validations.

But the SKPG_EXTENT structures that get written into the SKPG Context normally only describe a single memory page while the system might want to protect a much larger area in an image. It is simply easier for HyperGuard to handle memory validations one page at a time, to make for more predictable processing time and avoid taking up too much time while hashing large memory ranges, for example. So, when receiving an input extent where the requested size is larger than a page (0x1000 bytes), SkpgInitializeContext iterates over all the pages in the requested range and creates a new SKPG_EXTENT for each of them. Only the first extent, describing the first page in the range, receives the type SkpgExtentImage. All the other ones that describe the following pages receive a different type, 0x1014, which I chose to call SkpgExtentPartialMemory, and the original extent type is placed in the first 2 bytes in the type-specific data inside the SKPG_EXTENT structure.

Every extent in the array can be marked by different flags. One of these is the Protected flag, which can only be applied to normal kernel extents, meaning that the specified address range should be protected from changes by SKPG. In this case, SkpgInitializeContext will call SkmmPinNormalKernelAddressRange on the requested address range to pin in and prevent it from being freed by VTL0 code:

The secure memory extents essentially behave very similar to the normal memory extent, with the main differences being that they are initialized by the secure kernel itself and the details of what they are protecting.

Extents of type SkpgExtentSecureModule are generates to monitor all images loaded into the secure kernel space. This is done by iterating the SkLoadedModuleList global list, which, like the normal kernel’s PsLoadedModuleList, is a linked list of KLDR_DATA_TABLE_ENTRY structures representing all loaded modules. For each one of those modules, SkpgCreateSecureModuleExtents is called to generate the extents.

To do so, SkpgCreateSecureModuleExtents receives a KLDR_DATA_TABLE_ENTRY for one loaded DLL at a time, validates that it exists in PsInvertedFunctionTable (a table containing basic information for all loaded DLLs, mostly used for quick search for exception handlers) and then enumerates all the sections in the module. Most sections in a secure module are monitored using an SKPG_EXTENT but are not protected from modifications. Only one section is being protected, the TABLERO section:

The TABLERO section is a data section that exists in only a handful of binaries. In the normal kernel it exists in Win32k.sys, where it contains the win32k system service table. In the secure kernel a TABLERO section exists in securekernel.exe, where it contains global variables such as SkiSecureServiceTable, SkiSecureArgumentTable, SkpgContext, SkmiNtPteBase, and others:

When SkpgCreateSecureModuleExtents encounters a TABLERO section, it calls SkmmProtectKernelImageSubsection to change the PTE for the section pages from the default read-write to read only.

Then for each section, regardless of its type, an extent with type SkpgExtentSecureModule is created. Each memory region gets hashed a flag in the extent marks if the section is executable. The number of extents generated per section can vary: If HotPatching is enabled on the machine a separate extent will be generated for every page in the protected image ranges. Otherwise, every protected section generates one extent that might cover multiple pages, all of them with type SkpgExtentSecureModule:

If HotPatching is enabled, one last secure module extent gets created for each secure module. The variable SkmiHotPatchAddressReservePages will indicate how many pages are reserved for HotPatch use at the end of the module, and an extent gets created for each of those pages. Similar to the way described earlier for normal kernel module extents, each extent describes a single page, the extent type is SkpgExtentPartialMemory and the type SkpgExtentSecureModule is placed in one of the type-specific fields of the extent.

Another secure extent type is SkpgExtentSecureMemory. This is a generic extent type used to indicate any memory range in the secure kernel. However, for now it is only used to monitor the GDT pointed to by the secure kernel processor block – the SKPRCB. This is an internal structure that is similar in its purpose to the normal kernel’s KPRCB (and similarly, an array of them exists in SkeProcessorBlock). There will be one extent of this type for each processor in the system. Additionally, the function sets a bit in the Type field of each KGDTENTRY64 structure to indicate that this entry has been accessed and prevent it from being modified later on – but the entry for the TSS at offset 0x40 gets skipped:

This pretty much covers the initialization and uses of the memory extents. But this is just the first group of extents, and there are many others that monitor various different parts of the system. In the next post I’ll talk about more of these other extent types, which interact with system components like MSRs, control registers, the KCFG bitmap and more!

HyperGuard – Secure Kernel Patch Guard: Part 1 – SKPG Initialization

This will be a multi-part series of posts describing the internal mechanisms and purpose of Secure Kernel Patch Guard, also known as HyperGuard. This first part will focus on what SKPG is and how it’s being initialized.

Overview

In the world of Windows security, PatchGuard is a uniquely undocumented and hardly any “unofficial” documentation. Thus, there are conflicting opinions and rumors about the way it operates and different “PatchGuard bypasses” that get published aren’t very reliable. Still, every few years some helpful PG analysis gets published, shedding some light on this mysterious feature. This blog post is not about PatchGuard so we won’t go into much detail about it, but it discusses a similar and related feature, so some basic knowledge of PatchGuard is needed. Here are a couple of things needed to understand of the rest of the post:

  • The purpose of PatchGuard is to monitor the system for changes in kernel space that should not happen on a normal system and crash it when those are detected. This doesn’t mean any unusual data change – PatchGuard monitors a pre-determined list of data structures that are common targets for kernel exploitation or rootkits, such as modifications to HalDispatchTable or callback arrays, or changes to control registers or MSRs to disable security features. The full list of monitored structures and pointers is not documented and the information that does get published by Microsoft is left vague on purpose.
  • PatchGuard doesn’t monitor everything, all the time. It runs periodically, checking for certain changes every time it runs – it won’t necessarily crash the system right when a malicious change is done and a system might run for a long time with such changes. There is no guarantee that PatchGuard will ever detect and crash the system. This also means it is hard to validate potential bypasses.

The main weakness of PatchGuard and the reason for all the obscurity around its implementation is the fact that it monitors Ring 0 code and data – from code that runs in Ring 0. There is nothing preventing a rootkit that already gained Ring 0 code execution privileges from patching the code for PatchGuard itself and disabling or bypassing it. The only thing stopping this scenario is PatchGuard’s obscurity and the fact that its code is hard to find and uses a range of obfuscation techniques to make itself hard to analyze and disable.

There is a lot more to say about PatchGuard but, like I mentioned, this is not the topic of the post. So, I’ll skip right to discussing PatchGuard’s newer sibling – HyperGuard, also known as Secure Kernel Patch Guard, or SKPG. This new feature leverages the existence of Hyper-V and VBS to create a new monitoring and protection capability that is similar to PatchGuard but not susceptible to the same weaknesses since it is not running as normal Ring 0 code and cannot be tampered by normal rootkits.

Finding HyperGuard

HyperGuard takes advantage of VBS – Virtualization Based Security. This capability that was added in the past few years is made possible by the creation of Hyper-V and Virtual Trust Levels (VTLs). The hypervisor allows creating a system where most things run in VTL0, but some, more privileged things, run in higher VTLs (currently the only one implemented is VTL1) where they are not accessible to normal processes regardless of their privilege level – including VTL0 kernel code. Put simply, no VTL0 code can interact with memory in VTL1 in any way.

Having memory that cannot be tampered with even from normal kernel code allows for many new security features, some of which I’ve written about in the past and others are documented in other blogs, conference talks and official Microsoft documentation. A few examples include KCFG, HVCI and KDP.

This is also what allows Microsoft to implement HyperGuard – a feature similar to PatchGuard that can’t be tampered with even by malicious code that managed to elevate itself to run in the kernel. For this reason, HyperGuard doesn’t need to hide or obfuscate itself in any way, and it’s so much easier to analyze using static analysis tools.

The VTL1 kernel, also known as the secure kernel, is managed through SecureKernel.exe. This is also the binary where HyperGuard is implemented. If we open securekernel.exe in IDA we can easily find all the code implementing HyperGuard, which all uses the prefix Skpg:

This series will cover some of those functions, starting from the first ones being called during boot: SkpgInitSystem:

HyperGuard Initialization

HyperGuard initialization mostly happens during the normal kernel’s Phase 1 initialization, but requires multiple steps. The first step starts with a secure call where SKSERVICE=SECURESERVICE_PHASE3_INIT. This leads to SkInitSystem which will initialize SKCI (Secure Kernel Code Integrity) and call into SkpgInitSystem. This function sets up the basic components of SKPG – its callback, timer, extension table and intercept functions, all of which I’ll discuss in more detail later in this series. At this point SKPG is not fully initialized – that only happens later in response to another request from the normal kernel. For now, only a few SKPG globals are being set:

Some interesting components to notice at this stage are:

  • SkpgPatchGuardCallback – a callback which is going to be called every time HyperGuard checks need to run and will invoke the target function SkpgPatchGuardCallbackRoutine.
  • SkpgPatchGuardTimer – a secure kernel timer object that is going to control the execution of some HyperGuard checks. It gets set to run at a random time so checks will happen at different intervals, making periodic checks harder to avoid. The function set its callback function to SkpgPatchGuardTimerRoutine.
  • Intercept function pointers – other than the periodic checks controlled by the timer, HyperGuard also has a few intercept functions, which execute every time a certain operation is being intercepted by the Hypervisor. The operation being intercepted is pretty clear from the function names, but I’ll cover them in more detail later anyway. The global variables for these are:
    • ShvlpHandleMsrIntercept – points to SkpgxInterceptMsr
    • ShvlpHandleRegisterIntercept – points to SkpgxInterceptRegister
    • ShvlpHandleRepHypercallIntercept – points to SkpgInterceptRepHypercall
  • Optional variables – there are a few other global variables that did not fit in the screenshot and get initialized based on the flags received as part of the input argument, or other optional configuration:
    • SkpgInhibitKernelVaProtection
    • SkpgNtKvaShadow
    • SkpgSecureExtension

After initializing all the global variables, the function returns and the rest of the secure kernel initialization continues. For now, the timer is not scheduled and HyperGuard is effectively “dormant”. HyperGuard is only fully “activated” later – through a call to SkpgConnect.

There are three ways to call SkpgConnect and all start from a call by the normal kernel:

HyperGuard Activation

Connect Software Interrupt – the PatchGuard Path

The most interesting HyperGuard activation path is through PatchGuard. This SKPG activation path, like all others, begins with a secure call. This secure call, with SKSERVICE= SECURESERVICE_CONNECT_SW_INTERRUPT, originates from the normal kernel function VslConnectSwInterrupt. This leads, as usual, to the secure kernel handler which calls into IumpConnectSwInterrupt and from there to SkpgConnect, passing it all the data that was sent by the normal kernel.

When we search for calls to VslConnectSwInterrupt we see two calls – one from PsNotifyCoreDriversInitialized that I’ll cover soon and a second one from KiConnectSwInterrupt:

KiConnectSwInterrupt is only called by one caller – an anonymous function in ntoskrnl.exe that has no name in the public symbols. This is an extremely large function that calls into other anonymous functions and has a lot of weird and seemingly unrelated functionality. This is one of the PatchGuard initialization routines, which does the “real” activation of HyperGuard, supplying the secure kernel with memory protection ranges and targets which I will discuss later when talking about SKPG extents.

I encourage you to follow the call stack yourselves and get a bit of insight into the mysteries of PatchGuard initialization, but if I start covering PatchGuard details this series will quickly become a book so I will skip the details here. Let’s just trust me when I say that this all also happens in the context of Phase 1 initialization and is the first point where HyperGuard is activated.

Once HyperGuard is fully activated, a global variable SkpgInitialized is set to TRUE. This variable is checked every time SkpgConnect is called, and if set the function will return immediately and not make any changes to any SKPG initialization data. This means that the two other activation paths that will be described here will only activate HyperGuard if PatchGuard is not running and will result in less thorough protection of the machine. If PatchGuard is active, then the other two activation paths will return without doing anything.

Connect Software Interrupt – Phase1 Initialization

The second code path into VslConnectSwInterrupt goes through PsNotifyCoreDriversInitialized. This is also happening as part of Phase 1 initialization, but later than the PatchGuard path:

As we can see here, the call to VslConnectSwInterrupt is done with empty input variables, meaning no memory ranges or extra data is sent to HyperGuard and it will only use its basic functionality. If PatchGuard is running, then at this point SKPG should already be initialized and the call will return with no changes to SKPG, so this path is only needed if PatchGuard is not active.

Phase3 Initialization

The last case where HyperGuard is activated happens during Phase 3 initialization. This happens in response to a secure call with SKSERVICE=SECURESERVICE_REGISTER_SYSTEM_DLLS. It will also call into SkpgConnect with no input data, simply to initialize it if nothing else has already.

On the normal kernel side: In PspInitPhase3 the system checks the VslVsmEnabled global variable to learn whether Hyper-V is running and VSM is enabled. If it is, the system calls VslpEnterIumSecureMode – a common function to generate a secure call with a given service code and arguments packed into an MDL. The system enters secure mode with service code SECURESERVICE_REGISTER_SYSTEM_DLLS:

Once a secure call reaches the secure kernel it is handled by IumInvokeSecureService, which is pretty much just a big switch statement, calling the correct function or functions for each service code. In the case of code SECURESERVICE_REGISTER_SYSTEM_DLLS, it calls SkpgConnect and then uses the data passed in by the kernel to register system DLLs:

As I mentioned, this is the last time SkpgConnect is called, right at the end of system initialization. This is done in case SKPG hasn’t been initialized at an earlier stage already. In this case, SkpgConnect is called with almost no input data, to only initialize the most basic SKPG functionality. If SKPG has already been initialized earlier, this call will return without changing anything.

HyperGuard Activation – Diagram

This is it for part 1 of this series. So far, we only covered the general idea of what HyperGuard is and its initialization paths. Next time we will dive into SkpgConnect to see what happens during SKPG activation and learn more about the types of data SKPG protects and how.

IoRing vs. io_uring: a comparison of Windows and Linux implementations

A few months ago I wrote this post about the introduction of I/O Rings in Windows. After publishing it a few people asked for a comparison of the Windows I/O Ring and the Linux io_uring, so I decided to do just that. The short answer – the Windows implementation is almost identical to the Linux one, especially when using the wrapper function provided by helper libraries. The long answer is what I’ll be covering in the rest of this post.
The information about the io_uring implementation was gathered mostly from here – a paper documenting the internal implementation and usage of io_uring on Linux and explaining some of the reasons for its existence and the way it was built.
As I said, the basic implementation of both mechanisms is very similar – both are built around a submission queue and a completion queue that have shared views in both user and kernel address spaces. The application writes the requested operation data into the submission queue and submits it to the kernel, which processes the requested number of entries and writes the results into the completion queue. In both cases there is a maximum number of allowed entries per ring and the completion queue can have up to However, there are some differences in the internal structures as well as the way the application is expected to interact with the I/O ring.

Initialization and Memory Mapping

One such difference is the initialization stage and mapping of the queues into user space: on Windows the kernel fully initializes the new ring, including the creation of both queues and creating a shared view in the application’s user-mode address space, using an MDL. However, in the Linux io_uring implementation, the system creates the requested ring and the queues but does not map them into user space. The application is expected to call mmap(2) using the appropriate file descriptors to map both queues into its address space, as well as the SQE array, which is separate from the main queue.
This is another difference worth noticing – on Linux the completion ring (or queue) directly contains the array of CQEs, but the submission ring does not. Instead, the sqes field in the submission ring is a pointer to another memory region containing the array of SQEs, that has to be mapped separately. To index this array, the sqring has an additional array field which contains the index into the SQEs array. Not being a Linux expert, I won’t try to explain the reasoning behind this design and will simply quote the reasoning given in the paper mentioned above:

This might initially seem odd and confusing, but there’s some reasoning behind it. Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one operation. That in turns allows for easier conversion of said applications to the io_uring interface.

On Windows there are only two important regions since the SQEs are part of the submission ring. In fact both rings are allocated by the system in the same memory region so there is only one shared view between the user and kernel space, containing two separate rings.
One more difference exists when creating a new I/O ring: on Linux the number of entries in a submission ring can be between 1 and 0x1000 (4096) while on Windows it can be between 1 and 0x10000, but at least 8 entries will always be allocated. In both cases the completion queue will have twice the number of entries as the submission queue. There is one small difference regarding the exact number of entries requested for the ring: For technical reasons the number of entries in both rings has to be a power of two. On Windows, the system takes the requested ring size and aligns it to the nearest power of two to receive the actual size that will be used to allocate the ring memory. On Linux the system does not do that, and the application is expected to request a size that is a power of two.

Versioning

Windows puts far more focus on compatibility than Linux does, putting a lot of effort into making sure that when a new feature ships, applications using it will be able to work properly across different Windows builds even as the feature changes. For that reason, Windows implements versioning for its structures and features and Linux does not. Windows also implements I/O rings in phases, marked by those versions, where the first versions only implemented read operations, the next version will implement write and flush operations, and so on. When creating an I/O ring the caller needs to pass in a version to indicate which version of I/O rings it wants to use.
On Linux, however, the feature was implemented fully from the beginning and does not require versioning. Also, Linux doesn’t put as much focus on compatibility and users of io_uring are expected to use and support the latest features.

Waiting for Operation Completion

On both Windows and Linux the caller can choose to not wait on the completion of events in the I/O ring and simply get notified when all operations are complete, making this feature fully asynchronous. In both systems the caller can also choose to wait on all events in a fully synchronous way, specifying a timeout in case processing the events takes too long. Everything in between is the area where the systems differ.
On Linux, a caller can request a wait on the completion of a specific number of operations in the ring, a capability Windows doesn’t allow. This capability allows applications to start processing the results after a certain amount of operations were completed, instead of waiting for all of them. In newer builds Windows did add a similar yet slightly more limited option – registering a notification event that will be set when the first entry in the ring gets completed to signal to the waiting application that it’s safe to start processing the results now.

Helper Libraries

In both systems it is possible for an application to manage its rings itself through system calls. This is an option that’s accepted on Linux and highly discouraged on Windows, where the NT API is undocumented and officially should not be used by non-Microsoft code. However, in both systems most applications have no need to manage the rings themselves and a lot of a generic ring management code can be abstracted and managed by a separate component. This is done through helper libraries – KernelBase.dll on Windows and liburing on Linux.
Both libraries export generic functionality like creating, initializing and deleting an I/O ring, creating submission queue entries, submitting a ring and getting a result from the completion queue.
Both libraries use very similar functions and data structures, making the task of porting code from one platform to the other much easier.

Conclusion

The implementation of I/O rings on Windows is so similar to the Linux io_uring that it looks like some headers were almost copied from the io_uring implementation. There are some differences between the two features, mostly due to philosophical differences between the two systems and the role and responsibilities they give the user. The Linux io_uring was added a couple of years ago, making it a more mature feature than the new Windows implementation, though still a relatively young one and not without issues. It will be interesting to see where these two features will go in the future and what parity will exist in them in a few years.

I/O Rings – When One I/O Operation is Not Enough

Introduction

I usually write about security features or techniques on Windows. But today’s blog is not directly related to any security topics, other than the usual added risk that any new system call introduces. However, it’s an interesting addition to the I/O world in Windows that could be useful for developers and I thought it would be interesting to look into and write about. All this is to say – if you’re looking for a new exploit or EDR bypass technique, you should save yourselves the time and look at the other posts on this website instead.

For the three of you who are still reading, let’s talk about I/O rings!

I/O ring is a new feature on Windows preview builds. This is the Windows implementation of a ring buffer – a circular buffer, in this case used to queue multiple I/O operations simultaneously, to allow user-mode applications performing a lot of I/O operations to do so in one action instead of transitioning from user to kernel and back for every individual request.

This new feature adds a lot of new functions and internal data structures, so to avoid constantly breaking the flow of the blog with new data structures I will not put them as part of the post, but their definitions exist in the code sample at the end. I will only show a few internal data structures that aren’t used in the code sample.

I/O Ring Usage

The current implementation of I/O rings only supports read operations and allows queuing up to 0x10000 operations at a time. For every operation the caller will need to supply a handle to the target file, an output buffer, an offset into the file and amount of memory to be read. This is all done in multiple new data structures that will be discussed later. But first, the caller needs to initialize its I/O ring.

Create and Initialize an I/O Ring

To do that, the system supplies a new system call – NtCreateIoRing. This function creates an instance of a new IoRing object type, described here as IORING_OBJECT:

typedef struct _IORING_OBJECT
{
  USHORT Type;
  USHORT Size;
  NT_IORING_INFO Info;
  PSECTION SectionObject;
  PVOID KernelMappedBase;
  PMDL Mdl;
  PVOID MdlMappedBase;
  ULONG_PTR ViewSize;
  ULONG SubmitInProgress;
  PVOID IoRingEntryLock;
  PVOID EntriesCompleted;
  PVOID EntriesSubmitted;
  KEVENT RingEvent;
  PVOID EntriesPending;
  ULONG BuffersRegistered;
  PIORING_BUFFER_INFO BufferArray;
  ULONG FilesRegistered;
  PHANDLE FileHandleArray;
} IORING_OBJECT, *PIORING_OBJECT;

NtCreateIoRing receives one new structure as an input argument – IO_RING_STRUCTV1. This structure contains information about current version, which currently can only be 1, required and advisory flags (both don’t currently support any values other than 0) and the requested size for the submission queue and completion queue.

The function receives this information and does the following things:

  1. Validates all the input and output arguments – their addresses, size alignment, etc.
  2. Checks the requested submission queue size and calculate the amount of memory needed for the submission queue based on the requested number of entries.
    1. If SubmissionQueueSize is over 0x10000 a new error status STATUS_IORING_SUBMISSION_QUEUE_TOO_BIG gets returned.
  3. Checks the completions queue size and calculates the amount of memory needed for it.
    1. The completion queue is limited to 0x20000 entries and error code STATUS_IORING_COMPLETION_QUEUE_TOO_BIG is returned if a larger number is requested.
  4. Creates a new object of type IoRingObjectType and initializes all fields that can be initialized at this point – flags, submission queue size and mask, event, etc.
  5. Creates a section for the queues, maps it in system space and creates an MDL to back it. Then maps the same section in user-space. This section will contain the submission space and completion space and will be used by the application to communicate the parameters for all requested I/O operations with the kernel and receive the status codes.
  6. Initializes the output structure with the submission queue address and other data to be returned to the caller.

After NtCreateIoRing returns successfully, the caller can write its data into the supplied submission queue. The queue will have a queue head, followed by an array of NT_IORING_SQE structures, each representing one requested I/O operation. The header describes which entries should be processed at this time:

The queue header describes which entries should be processed using the Head and Tail fields. Head specifies the index of the last unprocessed entry, and Tail specifies the index to stop processing at. Tail - Head has to be lower that total number of entries, as well as equal to or highrt than the number of entries that will be requested in the call to NtSubmitIoRing.

Each queue entry contains data about the requested operation: file handle, file offset, output buffer base, offset and amount of data to be read.  It also contains an OpCode field to specify the requested operation.

I/O Ring Operation Codes

There are four possible operation types that can be requested by the caller:

  1. IORING_OP_READ: requests that the system reads data from a file into an output buffer. The file handle will be read from the FileRef field in the submission queue entry. This will either be interpreted as a file handle or as an index into a pre-registered array of file handles, depending on whether the IORING_SQE_PREREGISTERED_FILE flag (1) is set in the queue entry Flags field. The output will be written into an output buffer supplied in the Buffer field of the entry. Similar to FileRef, this field can instead contain an index into a pre-registered array of output buffers if the IORING_SQE_PREREGISTERED_BUFFER flag (2) is set.
  2. IORING_OP_REGISTERED_FILES: requests pre-registration of file handles to be processed later. In this case the Buffer field of the queue entry points to an array of file handles. The requested file handles will get duplicated and placed in a new array, in the FileHandleArray field of the queue entry. The FilesRegistered field will contain the number of file handles.
  3. IORING_OP_REGISTERED_BUFFERS: requests pre-registration of output buffers for file data to be read into. In this case, the Buffer field in the entry should contain an array of IORING_BUFFER_INFO structures, describing addresses and sizes of buffers into which file data will be read:

    typedef struct _IORING_BUFFER_INFO
    {
        PVOID Address;
        ULONG Length;
    } IORING_BUFFER_INFO, *PIORING_BUFFER_INFO;

    The buffers’ addresses and sizes will be copied into a new array and placed in the BufferArray field of the submission queue. The BuffersRegistered field will contain the number of buffers.

  4. IORING_OP_CANCEL: requests the cancellation of a pending operation for a file. Just like the in IORING_OP_READ, the FileRef can be a handle or an index into the file handle array depending on the flags. In this case the Buffer field points to the IO_STATUS_BLOCK to be canceled for the file.

All these options can be a bit confusing so here are illustrations for the 4 different reading scenarios, based on the requested flags:

Flags are 0, using the FileRef field as a file handle and the Buffer field as a pointer to the output buffer:

Flag IORING_SQE_PREREGISTERED_FILE (1) is requested, so FileRef is treated as an index into an array of pre-registered file handles and Buffer is a pointer to the output buffer:

Flag IORING_SQE_PREREGISTERED_BUFFER (2) is requested, so FileRef is a handle to a file and Buffer is treated as an index into an array of pre-registered output buffers:

Both IORING_SQE_PREREGISTERED_FILE and IORING_SQE_PREREGISTERED_BUFFER flags are set, so FileRef is treated as an index into a pre-registered file handle array and Buffer is treated as index into a pre-registered buffers array:

Submitting and Processing I/O Ring

Once the caller set up all its submission queue entries, it can call NtSubmitIoRing to submit its requests to the kernel to get processed according to the requested parameters. Internally, NtSubmitIoRing iterates over all the entries and calls IopProcessIoRingEntry, sending the IoRing object and the current queue entry. The entry gets processed according to the specified OpCode and then calls IopIoRingDispatchComplete to fill in the completion queue. The completion queue, much like the submission queue, begins with a header, containing a Head and a Tail, followed by an array of entries. Each entry is an IORING_CQE structure – it has the UserData value from the submission queue entry and the Status and Information from the IO_STATUS_BLOCK for the operation:

typedef struct _IORING_CQE
{
    UINT_PTR UserData;
    HRESULT ResultCode;
    ULONG_PTR Information;
} IORING_CQE, *PIORING_CQE;

Once all requested entries are completed the system sets the event in IoRingObject->RingEvent. As long as not all entries are complete the system will wait on the event using the Timeout received from the caller and wake up when all requests are completed, causing the event to be signaled, or when the timeout expires.

Since multiple entries can be processed, the status returned to the caller will either be an error status indicating a failure to process the entries or the return value of KeWaitForSingleObject. Status codes for individual operations can be found in the completion queue – so don’t confuse receiving a STATUS_SUCCESS code from NtSubmitIoRing with successful read operations!

Using I/O Ring – The Official Way

Like other system calls, those new IoRing functions are not documented and not meant to be used directly. Instead, KernelBase.dll offers convenient wrapper functions that receive easy-to-use arguments and internally handle all the undocumented functions and data structures that need to be sent to the kernel. There are functions to create, query, submit and close the IoRing, as well as helper functions to build queue entries for the four different operations, which were discussed earlier.

CreateIoRing

CreateIoRing receives information about flags and queue sizes, and internally calls NtCreateIoRing and returns a handle to an IoRing instance:

HRESULT
CreateIoRing (
    _In_ IORING_VERSION IoRingVersion,
    _In_ IORING_CREATE_FLAGS Flags,
    _In_ UINT32 SubmissionQueueSize,
    _In_ UINT32 CompletionQueueSize,
    _Out_ HIORING* Handle
);

This new handle type is actually a pointer to an undocumented structure containing the structure returned from NtCreateIoRing and other data needed to manage this IoRing instance:

typedef struct _HIORING
{
    ULONG SqePending;
    ULONG SqeCount;
    HANDLE handle;
    IORING_INFO Info;
    ULONG IoRingKernelAcceptedVersion;
} HIORING, *PHIORING;

All the other IoRing functions will receive this handle as their first argument.

After creating an IoRing instance, the application needs to build queue entries for all the requested I/O operations. Since the internal structure of the queues and the queue entry structures are not documented, KernelBase.dll exports helper functions to build those using input data supplied by the caller. There are four functions for this purpose:

  1. BuildIoRingReadFile
  2. BuildIoRingRegisterBuffers
  3. BuildIoRingRegisterFileHandles
  4. BuildIoRingCancelRequest

Each function create adds a new queue entry to the submission queue with the required opcode and data. Their names make their purposes pretty obvious but lets go over them one by one just for clarity:

BuildIoRingReadFile

HRESULT
BuildIoRingReadFile (
    _In_ HIORING IoRing,
    _In_ IORING_HANDLE_REF FileRef,
    _In_ IORING_BUFFER_REF DataRef,
    _In_ ULONG NumberOfBytesToRead,
    _In_ ULONG64 FileOffset,
    _In_ ULONG_PTR UserData,
    _In_ IORING_SQE_FLAGS Flags
);

The function receives the handle returned by CreateIoRing followed by two pointers to new data structures. Both of these structures have a Kind field, which can be either IORING_REF_RAW, indicating that the supplied value is a raw reference, or IORING_REF_REGISTERED, indicating that the value is an index into a pre-registered array. The second field is a union of a value and an index, in which the file handle or buffer will be supplied.

BuildIoRingRegisterFileHandles and BuildIoRingRegisterBuffers

HRESULT
BuildIoRingRegisterFileHandles (
    _In_ HIORING IoRing,
    _In_ ULONG Count,
    _In_ HANDLE const Handles[],
    _In_ PVOID UserData
);

HRESULT
BuildIoRingRegisterBuffers (
    _In_ HIORING IoRing,
    _In_ ULONG Count,
    _In_ IORING_BUFFER_INFO count Buffers[],
    _In_ PVOID UserData
);

These two functions create submission queue entries to pre-register file handles and output buffers. Both receive the handle returned from CreateIoRing, the count of pre-registered files/buffers in the array, an array of the handles or buffers to register and UserData.

In BuildIoRingRegisterFileHandles, Handles is a pointer to an array of file handles and in BuildIoRingRegisterBuffers, Buffers is a pointer to an array of IORING_BUFFER_INFO structures containing Buffer base and size.

BuildIoRingCancelRequest

HRESULT
BuildIoRingCancelRequest (
    _In_ HIORING IoRing,
    _In_ IORING_HANDLE_REF File,
    _In_ PVOID OpToCancel,
    _In_ PVOID UserData
);

Just like the other functions, BuildIoRingCancelRequest receives as its first argument the handle that was returned from CreateIoRing. The second argument is again a pointer to an IORING_REQUEST_DATA structure that contains the handle (or index in the file handles array) to the file whose operation should be canceled. The third and fourth arguments are the output buffer and UserData to be placed in the queue entry.

After all queue entries were built with those functions, the queue can be submitted:

SubmitIoRing

HRESULT
SubmitIoRing (
    _In_ HIORING IoRingHandle,
    _In_ ULONG WaitOperations,
    _In_ ULONG Milliseconds,
    _Out_ PULONG SubmittedEntries
);

The function receives the same handle as the first argument that was used to initialize the IoRing and submission queue. Then it receives the amount of entries to submit, time in milliseconds to wait on the completion of the operations, and a pointer to an output parameter that will receive the number of entries that were submitted.

GetIoRingInfo

HRESULT
GetIoRingInfo (
    _In_ HIORING IoRingHandle,
    _Out_ PIORING_INFO IoRingBasicInfo
);

This API returns information about the current state of the IoRing with a new structure:

typedef struct _IORING_INFO
{
  IORING_VERSION IoRingVersion;
  IORING_CREATE_FLAGS Flags;
  ULONG SubmissionQueueSize;
  ULONG CompletionQueueSize;
} IORING_INFO, *PIORING_INFO;

This contains the version and flags of the IoRing as well as the current size of the submission and completion queues.

Once all operations on the IoRing are done, it needs be closed using CloseIoRing which receives the handle as its only argument and closes the handle to the IoRing object and frees the memory used for the structure.

So far I couldn’t find anything on the system that makes use of this feature, but once 21H2 is released I’d expect to start seeing I/O-heavy Windows applications start using it, probably mostly in server and azure environments.

Conclusion

So far, no public documentation exists for this new addition to the I/O world in Windows, but hopefully when 21H2 is released later this year we will see all of this officially documented and used by both Windows and 3rd party applications. If used wisely, this could lead to significant performance improvements for applications that have frequent read operations. Like every new feature and system call this could also have unexpected security effects. One bug was already found by hFiref0x, who was the first to publicly mention this feature and managed to crash the system by sending an incorrect parameter to NtCreateIoRing – a bug that was fixed since then. Looking more closely into these functions will likely lead to more such discoveries and interesting side effects of this new mechanism.

Code

Here’s a small PoC showing two ways to use I/O rings – either through the official KernelBase API, or through the internal ntdll API. For the code to compile properly make sure to link it against onecoreuap.lib (for the KernelBase functions) or ntdll.lib (for the ntdll functions):

#include <ntstatus.h>
#define WIN32_NO_STATUS
#include <Windows.h>
#include <cstdio>
#include <ioringapi.h>
#include <winternal.h>

typedef struct _IO_RING_STRUCTV1
{
    ULONG IoRingVersion;
    ULONG SubmissionQueueSize;
    ULONG CompletionQueueSize;
    ULONG RequiredFlags;
    ULONG AdvisoryFlags;
} IO_RING_STRUCTV1, *PIO_RING_STRUCTV1;

typedef struct _IORING_QUEUE_HEAD
{
    ULONG Head;
    ULONG Tail;
    ULONG64 Flags;
} IORING_QUEUE_HEAD, *PIORING_QUEUE_HEAD;

typedef struct _NT_IORING_INFO
{
    ULONG Version;
    IORING_CREATE_FLAGS Flags;
    ULONG SubmissionQueueSize;
    ULONG SubQueueSizeMask;
    ULONG CompletionQueueSize;
    ULONG CompQueueSizeMask;
    PIORING_QUEUE_HEAD SubQueueBase;
    PVOID CompQueueBase;
} NT_IORING_INFO, *PNT_IORING_INFO;

typedef struct _NT_IORING_SQE
{
    ULONG Opcode;
    ULONG Flags;
    HANDLE FileRef;
    LARGE_INTEGER FileOffset;
    PVOID Buffer;
    ULONG BufferSize;
    ULONG BufferOffset;
    ULONG Key;
    PVOID Unknown;
    PVOID UserData;
    PVOID stuff1;
    PVOID stuff2;
    PVOID stuff3;
    PVOID stuff4;
} NT_IORING_SQE, *PNT_IORING_SQE;

EXTERN_C_START
NTSTATUS
NtSubmitIoRing (
    _In_ HANDLE Handle,
    _In_ IORING_CREATE_REQUIRED_FLAGS Flags,
    _In_ ULONG EntryCount,
    _In_ PLARGE_INTEGER Timeout
    );

NTSTATUS
NtCreateIoRing (
    _Out_ PHANDLE pIoRingHandle,
    _In_ ULONG CreateParametersSize,
    _In_ PIO_RING_STRUCTV1 CreateParameters,
    _In_ ULONG OutputParametersSize,
    _Out_ PNT_IORING_INFO pRingInfo
    );

NTSTATUS
NtClose (
    _In_ HANDLE Handle
    );

EXTERN_C_END

void IoRingNt ()
{
    NTSTATUS status;
    IO_RING_STRUCTV1 ioringStruct;
    NT_IORING_INFO ioringInfo;
    HANDLE handle;
    PNT_IORING_SQE sqe;
    LARGE_INTEGER timeout;
    HANDLE hFile = NULL;
    ULONG sizeToRead = 0x200;
    PVOID *buffer = NULL;
    ULONG64 endOfBuffer;

    ioringStruct.IoRingVersion = 1;
    ioringStruct.SubmissionQueueSize = 1;
    ioringStruct.CompletionQueueSize = 1;
    ioringStruct.AdvisoryFlags = IORING_CREATE_ADVISORY_FLAGS_NONE;
    ioringStruct.RequiredFlags = IORING_CREATE_REQUIRED_FLAGS_NONE;

    status = NtCreateIoRing(&handle,
                            sizeof(ioringStruct),
                            &ioringStruct,
                            sizeof(ioringInfo),
                            &ioringInfo);
    if (!NT_SUCCESS(status))
    {
        printf("Failed creating IO ring handle: 0x%x\n", status);
        goto Exit;
    }

    ioringInfo.SubQueueBase->Tail = 1;
    ioringInfo.SubQueueBase->Head = 0;
    ioringInfo.SubQueueBase->Flags = 0;

    hFile = CreateFile(L"C:\\Windows\\System32\\notepad.exe",
                       GENERIC_READ,
                       0,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL);

    if (hFile == INVALID_HANDLE_VALUE)
    {
        printf("Failed opening file handle: 0x%x\n", GetLastError());
        goto Exit;
    }

    sqe = (PNT_IORING_SQE)((ULONG64)ioringInfo.SubQueueBase + sizeof(IORING_QUEUE_HEAD));
    sqe->Opcode = 1;
    sqe->Flags = 0;
    sqe->FileRef = hFile;
    sqe->FileOffset.QuadPart = 0;
    buffer = (PVOID*)VirtualAlloc(NULL, sizeToRead, MEM_COMMIT, PAGE_READWRITE);
    if (buffer == NULL)
    {
        printf("Failed allocating memory\n");
        goto Exit;
    }
    sqe->Buffer = buffer;
    sqe->BufferOffset = 0;
    sqe->BufferSize = sizeToRead;
    sqe->Key = 1234;
    sqe->UserData = nullptr;

    timeout.QuadPart = -10000;

    status = NtSubmitIoRing(handle, IORING_CREATE_REQUIRED_FLAGS_NONE, 1, &timeout);
    if (!NT_SUCCESS(status))
    {
        printf("Failed submitting IO ring: 0x%x\n", status);
        goto Exit;
    }
    printf("Data from file:\n");
    endOfBuffer = (ULONG64)buffer + sizeToRead;
    for (; (ULONG64)buffer < endOfBuffer; buffer++)
    {
        printf("%p ", *buffer);
    }
    printf("\n");

Exit:
    if (handle)
    {
        NtClose(handle);
    }
    if (hFile)
    {
        NtClose(hFile);
    }
    if (buffer)
    {
        VirtualFree(buffer, NULL, MEM_RELEASE);
    }
}

void IoRingKernelBase ()
{
    HRESULT result;
    HIORING handle;
    IORING_CREATE_FLAGS flags;
    IORING_HANDLE_REF requestDataFile;
    IORING_BUFFER_REF requestDataBuffer;
    UINT32 submittedEntries;
    HANDLE hFile = NULL;
    ULONG sizeToRead = 0x200;
    PVOID *buffer = NULL;
    ULONG64 endOfBuffer;

    flags.Required = IORING_CREATE_REQUIRED_FLAGS_NONE;
    flags.Advisory = IORING_CREATE_ADVISORY_FLAGS_NONE;
    result = CreateIoRing(IORING_VERSION_1, flags, 1, 1, &handle);
    if (!SUCCEEDED(result))
    {
        printf("Failed creating IO ring handle: 0x%x\n", result);
        goto Exit;
    }

    hFile = CreateFile(L"C:\\Windows\\System32\\notepad.exe",
                       GENERIC_READ,
                       0,
                       NULL,
                       OPEN_EXISTING,
                       FILE_ATTRIBUTE_NORMAL,
                       NULL);
    if (hFile == INVALID_HANDLE_VALUE)
    {
        printf("Failed opening file handle: 0x%x\n", GetLastError());
        goto Exit;
    }
    requestDataFile.Kind = IORING_REF_RAW;
    requestDataFile.Handle = hFile;
    requestDataBuffer.Kind = IORING_REF_RAW;
    buffer = (PVOID*)VirtualAlloc(NULL,
                                  sizeToRead,
                                  MEM_COMMIT,
                                  PAGE_READWRITE);
    if (buffer == NULL)
    {
        printf("Failed to allocate memory\n");
        goto Exit;
    }
    requestDataBuffer.Buffer = buffer;
    result = BuildIoRingReadFile(handle,
                                 requestDataFile,
                                 requestDataBuffer,
                                 sizeToRead,
                                 0,
                                 NULL,
                                 IOSQE_FLAGS_NONE);
    if (!SUCCEEDED(result))
    {
        printf("Failed building IO ring read file structure: 0x%x\n", result);
        goto Exit;
    }

    result = SubmitIoRing(handle, 1, 10000, &submittedEntries);
    if (!SUCCEEDED(result))
    {
        printf("Failed submitting IO ring: 0x%x\n", result);
        goto Exit;
    }
    printf("Data from file:\n");
    endOfBuffer = (ULONG64)buffer + sizeToRead;
    for (; (ULONG64)buffer < endOfBuffer; buffer++)
    {
        printf("%p ", *buffer);
    }
    printf("\n");

Exit:
    if (handle != 0)
    {
        CloseIoRing(handle);
    }
    if (hFile)
    {
        NtClose(hFile);
    }
    if (buffer)
    {
        VirtualFree(buffer, NULL, MEM_RELEASE);
    }
}

int main ()
{
    IoRingKernelBase();
    IoRingNt();
    ExitProcess(0);
}

Thread and Process State Change

a.k.a: EDR Hook Evasion – Method #4512

Every couple of weeks a new build of Windows Insider gets released. Some have lots of changes and introduce completely new features, some only have minor bug fixes, and some simply insist on crashing repeatedly for no good reason. A few months ago one of those builds had a few surprising changes — It introduced 2 new object types and 4 new system calls, not something that happens every day. So of course I went investigating. What I discovered is a confusingly over-engineered feature, which was added to solve a problem that could have been solved by much simpler means and which has the side effect of supplying attackers with a new way to evade EDR hooks.

Suspending and Resuming Threads – Now With 2 Extra Steps!

The problem that this feature is trying to solve is this: what happens if a process suspends a thread and then terminates before resuming it? Unless some other part of the system realizes what happened, the thread will remain suspended forever and will never resume its execution. To solve that, this new feature allows suspending and resuming threads and processes through the new object types, which will keep track of the suspension state of the threads or processes. That way, when the object is destroyed (for example, when the process that created it is terminated), the system will reset the state of the target process or thread by suspending or resuming it as needed.

This feature is pretty easy to use – the caller first needs to call NtCreateThreadStateChange (or NtCreateProcessStateChange. Both cases are almost identical but we’ll stay with the thread case for simplicity) to create a new object of type PspThreadStateChangeType. This object type is not documented, but its internal structure looks something like this:

struct _THREAD_STATE_OBJECT
{
    PETHREAD Thread;
    EX_PUSH_LOCK Lock;
    ULONG ThreadSuspendCount;
} THREAD_STATE_OBJECT, *PTHREAD_STATE_OBJECT;

NtCreateThreadStateChange has the following prototype:

NTSTATUS
NtCreateThreadStateChange (
    _Out_ PHANDLE StateChangeHandle,
    _In_ ACCESS_MASK DesiredAccess,
    _In_ POBJECT_ATTRIBUTES ObjectAttributes,
    _In_ HANDLE ThreadHandle,
    _In_ ULONG Unused
);

The 2 arguments we are interested in are the first one, which will receive a handle to the new object, and the fourth — a handle to the thread that will be referenced by the structure. Any future suspend or resume operation that will be done through this object can only work on the thread that’s being passed into this function. NtCreateProcessStateChange will create a new object instance, set the thread pointer to the requested thread, and initialize the lock and count fields to zero.

When calling NtCreateProcessStateChange to operate on a process, the thread handle will be replaced with a process handle and the object that will be created will be of type PspProcessStateChangeType. The only change in the structure is that the ETHREAD pointer is replaced with an EPROCESS pointer.

The next step is calling NtChangeThreadState (or NtChangeProcessState, if operating on a process). This function receives a handle to the thread state change object, a handle to the same thread that was passed when creating the object, and an action, which is an enum value:

typedef enum _THREAD_STATE_CHANGE_TYPE
{
    ThreadStateChangeSuspend = 0,
    ThreadStateChangeResume = 1,
    ThreadStateChangeMax = 2,
} THREAD_STATE_CHANGE_TYPE, *PTHREAD_STATE_CHANGE_TYPE;

typedef enum _PROCESS_STATE_CHANGE_TYPE
{
    ProcessStateChangeSuspend = 0,
    ProcessStateChangeResume = 1,
    ProcessStateChangeMax = 2,
} PROCESS_STATE_CHANGE_TYPE, *PPROCESS_STATE_CHANGE_TYPE;

It also receives an “Extended Information” variable and its length, both of which are unused and must be zero, and another reserved argument that must also be zero. The function will validate that the thread pointed to by the thread state change object is the same as the thread whose handle was passed into the function, and then call the appropriate function based on the requested action – PsSuspendThread or PsMultiResumeThread. Then it will increment or decrement the ThreadSuspendCount field based on the action that was performed. There are 2 limitations enforced by the suspend count:

  1. A thread cannot be resumed if the object’s ThreadSuspendCount is zero, even if the thread is currently suspended. It must be suspended and resumed using the state change API, otherwise things will start acting funny.
  2. A thread cannot be suspended if ThreadSuspendCount is 0x7FFFFFFF. This is meant to avoid overflowing the counter. However, this is a weird limitation since KeSuspendThread (the internal function called from PsSuspendThread) already enforces a suspension limit of 127 through the thread’s SuspendCount field, and will throw an error STATUS_SUSPEND_COUNT_EXCEEDED if the count exceeds that.

So far this works like the classic suspend and resume mechanism, just with a few extra steps. A caller still needs to make an API call to suspend a thread or process and another one to resume it.  But the benefit of having new object types is that objects can have kernel routines that get called for certain operations related to the object, such as open, close and delete:

dx (*(nt!_OBJECT_TYPE**)&nt!PspThreadStateChangeType)->TypeInfo
    (*(nt!_OBJECT_TYPE**)&nt!PspThreadStateChangeType)->TypeInfo                 [Type: _OBJECT_TYPE_INITIALIZER]
    [+0x000] Length           : 0x78 [Type: unsigned short]
    [+0x002] ObjectTypeFlags  : 0x6 [Type: unsigned short]
    [+0x002 ( 0: 0)] CaseInsensitive  : 0x0 [Type: unsigned char]
    [+0x002 ( 1: 1)] UnnamedObjectsOnly : 0x1 [Type: unsigned char]
    [+0x002 ( 2: 2)] UseDefaultObject : 0x1 [Type: unsigned char]
    [+0x002 ( 3: 3)] SecurityRequired : 0x0 [Type: unsigned char]
    [+0x002 ( 4: 4)] MaintainHandleCount : 0x0 [Type: unsigned char]
    [+0x002 ( 5: 5)] MaintainTypeList : 0x0 [Type: unsigned char]
    [+0x002 ( 6: 6)] SupportsObjectCallbacks : 0x0 [Type: unsigned char]
    [+0x002 ( 7: 7)] CacheAligned     : 0x0 [Type: unsigned char]
    [+0x003 ( 0: 0)] UseExtendedParameters : 0x0 [Type: unsigned char]
    [+0x003 ( 7: 1)] Reserved         : 0x0 [Type: unsigned char]
    [+0x004] ObjectTypeCode   : 0x0 [Type: unsigned long]
    [+0x008] InvalidAttributes : 0x92 [Type: unsigned long]
    [+0x00c] GenericMapping   [Type: _GENERIC_MAPPING]
    [+0x01c] ValidAccessMask  : 0x1f0001 [Type: unsigned long]
    [+0x020] RetainAccess     : 0x0 [Type: unsigned long]
    [+0x024] PoolType         : PagedPool (1) [Type: _POOL_TYPE]
    [+0x028] DefaultPagedPoolCharge : 0x70 [Type: unsigned long]
    [+0x02c] DefaultNonPagedPoolCharge : 0x0 [Type: unsigned long]
    [+0x030] DumpProcedure    : 0x0 [Type: void (__cdecl*)(void *,_OBJECT_DUMP_CONTROL *)]
    [+0x038] OpenProcedure    : 0x0 [Type: long (__cdecl*)(_OB_OPEN_REASON,char,_EPROCESS *,void *,unsigned long *,unsigned long)]
    [+0x040] CloseProcedure   : 0x0 [Type: void (__cdecl*)(_EPROCESS *,void *,unsigned __int64,unsigned __int64)]
    [+0x048] DeleteProcedure  : 0xfffff80265650d20 [Type: void (__cdecl*)(void *)]
    [+0x050] ParseProcedure   : 0x0 [Type: long (__cdecl*)(void *,void *,_ACCESS_STATE *,char,unsigned long,_UNICODE_STRING *,_UNICODE_STRING *,void *,_SECURITY_QUALITY_OF_SERVICE *,void * *)]
    [+0x050] ParseProcedureEx : 0x0 [Type: long (__cdecl*)(void *,void *,_ACCESS_STATE *,char,unsigned long,_UNICODE_STRING *,_UNICODE_STRING *,void *,_SECURITY_QUALITY_OF_SERVICE *,_OB_EXTENDED_PARSE_PARAMETERS *,void * *)]
    [+0x058] SecurityProcedure : 0xfffff802656bffd0 [Type: long (__cdecl*)(void *,_SECURITY_OPERATION_CODE,unsigned long *,void *,unsigned long *,void * *,_POOL_TYPE,_GENERIC_MAPPING *,char)]
    [+0x060] QueryNameProcedure : 0x0 [Type: long (__cdecl*)(void *,unsigned char,_OBJECT_NAME_INFORMATION *,unsigned long,unsigned long *,char)]
    [+0x068] OkayToCloseProcedure : 0x0 [Type: unsigned char (__cdecl*)(_EPROCESS *,void *,void *,char)]
    [+0x070] WaitObjectFlagMask : 0x0 [Type: unsigned long]
    [+0x074] WaitObjectFlagOffset : 0x0 [Type: unsigned short]
    [+0x076] WaitObjectPointerOffset : 0x0 [Type: unsigned short]

PspThreadStateChangeType has 2 registered procedures – the security procedure, which is SeDefaultObjectMethod and not too interesting to look at in this case as it is the default function, and the delete procedure, which is PspDeleteThreadStateChange. This function will get called every time a thread state change object is destroyed, and does a pretty simple thing:

If the target thread has a non-zero ThreadSuspendCount, the function will resume it as many times as it was suspended. As you can imagine, the process state change object also registers a delete procedure, PspDeleteProcessStateChange, which does something very similar.

New System Calls == New EDR Bypass

This is a nice, if slightly over-complicated, solution to the problem, but it has the unexpected side-effect of creating new and undocumented APIs to suspend and resume processes and threads. Since suspend and resume are very useful operations for attackers wishing to inject code, the well-known NtSuspendThread/Process and NtResumeThread/Process APIs are some of the first system calls that are hooked by security solutions, hoping to detect those attacks.

Having new APIs that perform the same operations without going through the well-known and often-monitored system calls is a great chance for attackers to avoid detection by security solutions that don’t keep up with recent changes (though I’m sure all EDR solutions have already started monitoring these new functions and have been doing so since this build was released. Right…?).

There is still a way to keep those same detections without following all of Microsoft’s recent code changes – even though this feature adds new system calls, the internal kernel mechanism invoked by them remains the same. And in Windows 10, this mechanism is using a feature whose sole purpose is to help security solutions gain more information about the system and get them away from relying on user-mode hooks – ETW tracing. And more specifically, the Thread Intelligence ETW channel that was added specifically for security purposes. That channel notifies about events that are often interesting to security products, such as virtual memory protection changes, virtual memory writes, driver loads, and, as you probably already guessed, suspending and resuming threads and processes. EDRs that register for these ETW events and use them as part of their detection will not miss any event due to the new state change APIs since these events will be received in either case. Those that don’t use them yet should probably open some Jira tickets that will be forgotten until this technique is found in the wild.

1 EDR Bypass + Windows Internals = 2 EDR Bypasses

However, this feature does create another interesting EDR bypass. As I mentioned, the suspended process or thread will automatically be resumed when the state change object gets destroyed. Normally, this would happen when the process that created the object either closes the only handle to it or exits – this automatically destroys all open handles held by the process. But an object only gets destroyed when all handles to it are closed and there are no more references to it. This means that if another process has an open handle to the state change object it won’t get destroyed when the process that created it exits, and the suspended process or thread won’t be resumed until the second process exits. This shouldn’t happen under normal circumstances, but if a process duplicates its handle to a state change object into another process, it can safely exit without resuming the suspended process or thread.

But why would a process want to do that?

The ETW events that report that a process is being suspended or resumed contain a process ID of the process that performed the action – this way the EDR that consumes the event can correlate different events together and attribute them to a potentially malicious process. In this case, the PID would be the ID of the process in whose context the action happened. So let’s say we create a process that suspends another process through a state change object, then duplicates the handle into a third process and exits. The process state change object doesn’t get destroyed et since there is still a running process with an open handle to it. Only when the other process exits, the duplicated handle gets closed and the suspended process gets resumed. But since the resume action happened in the context of the second process, which had nothing to do with the suspend action, that is the PID that will appear in the ETW event.

So, in this proposed scenario, a process will get suspended and later resumed, and ETW events will still be thrown for both actions. But these events will have happened in the context of 2 different processes so they will be difficult to link together, and it will be even more difficult to attribute the resume action to the first process without knowledge of this exact scenario. And we can be even smarter – a lot of security products ignore operations that are attributed to certain system processes. This makes sense, since those processes are not expected to be malicious but might have suspicious-looking activity, so it is easier to ignore them unless there is clear indication of code injection, to avoid false positives.

So we can even choose an innocent-looking Windows process to duplicate our handle into, to maximize the chances that the resume operation will be ignored completely. We just need to find a process that we can open a handle to and that will terminate at some point, to resume our suspended process.

Finally, Code!

In this PoC I simply create 2 notepad.exe processes. One will be suspended using a state change object, and the other will have the handle duplicated inside it. Then the PoC process exits but the suspended notepad remains suspended until the other notepad process is terminated:

#include <Windows.h>
#include <stdio.h>

EXTERN_C_START
NTSTATUS
NtCreateProcessStateChange (
    _Out_ PHANDLE StateChangeHandle,
    _In_ ACCESS_MASK DesiredAccess,
    _In_ PVOID ObjectAttributes,
    _In_ HANDLE ProcessHandle,
    _In_ ULONG Unknown
    );

NTSTATUS
NtChangeProcessState (
    _In_ HANDLE StateChangeHandle,
    _In_ HANDLE ProcessHandle,
    _In_ ULONG Action,
    _In_ PVOID ExtendedInformation,
    _In_ SIZE_T ExtendedInformationLength,
    _In_ ULONG64 Reserved
    );
EXTERN_C_END

int main ()
{
    HANDLE stateChangeHandle;
    PROCESS_INFORMATION procInfo;
    PROCESS_INFORMATION procInfo2;
    STARTUPINFOA startInfo;
    BOOL result;
    NTSTATUS status;

    stateChangeHandle = nullptr;

    ZeroMemory(&startInfo, sizeof(startInfo));
    startInfo.cb = sizeof(startInfo);
    result = CreateProcess(L"C:\\Windows\\System32\\notepad.exe",
                           NULL,
                           NULL,
                           NULL,
                           FALSE,
                           0,
                           NULL,
                           NULL,
                           &startInfo,
                           &procInfo);
    if (result == FALSE)
    {
        goto Exit;
    }
    CloseHandle(procInfo.hThread);
    result = CreateProcess(L"C:\\Windows\\System32\\notepad.exe",
                           NULL,
                           NULL,
                           NULL,
                           FALSE,
                           0,
                           NULL,
                           NULL,
                           &startInfo,
                           &procInfo2);
    if (result == FALSE)
    {
        goto Exit;
    }
    CloseHandle(procInfo2.hThread);

    status = NtCreateProcessStateChange(&stateChangeHandle,
                                        MAXIMUM_ALLOWED,
                                        NULL,
                                        procInfo.hProcess,
                                        0);
    if (!NT_SUCCESS(status))
    {
        printf("Failed creating process state change. Status: 0x%x\n", status);
        goto Exit;
    }
    //
    // Action == 0 means Suspend
    //
    status = NtChangeProcessState(stateChangeHandle,
                                  procInfo.hProcess,
                                  ProcessStateChangeSuspend,
                                  NULL,
                                  0,
                                  0);
    if (!NT_SUCCESS(status))
    {
        printf("Failed changing process state. Status: 0x%x\n", status);
        goto Exit;
    }

    result = DuplicateHandle(GetCurrentProcess(),
                             stateChangeHandle,
                             procInfo2.hProcess,
                             NULL,
                             NULL,
                             TRUE,
                             DUPLICATE_SAME_ACCESS);
    if (result == FALSE)
    {
        printf("Failed duplicating handle: 0x%x\n", GetLastError());
        goto Exit;
    }

Exit:
    if (procInfo.hProcess != NULL)
    {
        CloseHandle(procInfo.hProcess);
    }
    if (procInfo2.hProcess != NULL)
    {
        CloseHandle(procInfo2.hProcess);
    }
    if (stateChangeHandle != NULL)
    {
        CloseHandle(stateChangeHandle);
    }
    return 0;
}

Like a lot of other cases, this feature started out as a well-intentioned attempt to solve a minor system issue. But an over-engineered design led to multiple security concerns and whole new EDR evasion techniques which turned the relatively small issue into a much larger one.