cogs

64 and 32bit code in the same process

This all started with a crash.. A normal crash outside the debugger in a 32bit application while running on the 64bit version of windows.  I just happened to be running under the kernel debugger when it crashed so before I tried to reproduce the crash in visual studio I probed around to get an idea of what caused it.  The crash was inside windows and the address was a typical system component address in the lower 2gb of the process, however when I looked at the code I was surprised to see it was within 64bit code. I wasn’t expecting 64bit code even within the OS components of a 32bit application. Was I looking at the correct process in the debugger? Yes I had the correct process and my assumption was wrong about the 64bit code, there is lots of 64bit code in the 32bit system dlls of every 32bit application. So if the OS can use 64bit code in a 32bit process then can I do the same? Can I do it the other way around and use 32bit code in a 64bit process?

Why would this be useful? Well, consider a 64bit application that loads plugins. These plugins have to be 64bit if the host application is 64bit and for an application like Photoshop this causes problems. A lot of older plugins are made by companies that no longer exist so there will never be a 64bit version and they are closed source so compiling your own is not an option. If you have to use that old plugin you are stuck with the 32bit version of the application. This is not a problem as long as the 32bit application exists which might not be much longer for Adobe apps considering Premier no longer has a 32bit version.

Wouldn’t it be easier if you could mix and match? Agreed, there are address space issues but there is no real reason why a 64bit application can’t load a 32bit DLL and marshal the data through the bottom 4gigs of address space which the 32bit DLL can access. What if you have a JIT that compiles in 64bit but only generates 32bit code? Your only option is to rewrite the code generator to output 64bit opcodes which isn’t impossible but if you call other functions the entire ABI has to change. In both of these cases you have to be careful with the addresses but there is no reason why this cannot work – other OSes manage it. The only real problem is Microsoft doesn’t want you do this.

Due to how the memory segmentation is setup it is guaranteed that  bottom 32bits of the 64bit address space is the same as the normal 32bit address space – a pointer that only has 32bits will access the same memory address regardless of whether the code executing is 32 or 64bit. In 32bit hardware segmentation is part of x86 protected mode but Windows uses a flat memory model such that the user mode segments start at 0×00000000 and end at 0xffffffff, ie segmentation isn’t used and all protection is done via memory page protection. In 64bit the flat memory model is enforced and cannot be disabled, segmentation cannot be used even if you want it. Segments start at 0×00000000’00000000 and end at 0xffffffff’ffffffff.  This applies all segments except FS and GS, in 64bit these are the only two segments that support any form of segmentation. These registers are used as local thread pointers and they are needed for the OS to function properly and having them be flat segments would have been a drastic change for windows and you would have lost a GPR because something has to have a pointer to the thread data.

Thread Information Block

The Thread Information Block or TIB plays an important role in how Windows executes 64bit code so its worth understanding.  Every thread needs a place to store working data, this includes internal OS data such as error codes and exception status as well as the things like the current OpenGL context.  Thread local user data is stored here too and its either setup by the TlsXXX APIs within the OS or via the compiler.

All modern operating systems have some form of  TIB, its usually stored in a general purpose register and documented in the ABI as not usable by application code.  On x86 there aren’t enough general purpose registers to permanently remove one for the TIB pointer. Instead Windows uses a memory segment which is a little more awkward to access but achieves the same result.  In 32bit windows (including 32bit apps running in 64bit) FS is the Thread Information Block and GS, the other spare segment register, isn’t used. In 64bit windows running 64bit apps the GS segment is the Thread Information Block and FS isn’t used.  This can cause some compatibility issues as there

In 32bit windows you can access the TIB with instructions such as ‘mov eax,fs:[0x0]‘.   The segment override makes it tricky for the compiler to generate accesses to the TIB so the TIB stores its own address so it can be accessed by regular C code once the base address has been located. Use code such as this to get the linear address of the TIB:

void*tib;
_asm
{
  mov eax,fs:[0x18]
  mov tib,eax
}

For the thread in my test application the linear address is 0x7efdd000, accessing this address is equivalent to address fs:[0x0]. With the linear address you don’t need to use the FS segment override and instead you can use normal C code to access the data

void* win32_seh_head_ptr = (void*)(tib[0]);

This is the equivalent to the following asm using the FS segment

_asm
{
  mov eax,fs:[0x0]
  mov win32_seh_head_ptr,eax
}

In 64bit  the TIB has changed locations to the GS segment register but because you you are not allowed to use inline asm you either have to use standalone 64bit assembly or the compiler intrinsics __readgsword(), __readgddword() __readgsqword(). To get the linear address of the 64bit TIB you have to use:

void* tib = (void*)__readgsqword(0x30);

The 64bit TIB is very similar to the 32bit version other than pointers are now 8 bytes

 

Segments registers

Looking in the debugger at the segment registers and for a 32bit application they always contain the following:

CS = 0023 DS = 002B ES = 002B SS = 002B FS = 0053 GS = 002B

These resisters don’t hold the segments but instead hold an index into a table and also have a few flags. These tables, called descriptor tables, are protected operating system tables and applications have little control over them. There are two tables we are concerned with called the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT).

bits 0-1 contain the current protection level (00 = ring 0 aka kernel mode, 11 = ring 3 aka user mode)

bit 2 selects between the LDT and GDT (1 = LDT, 0 = GDT)

bits 3-15 contain the index into the table

NOTE: The segment registers appear to be 16bits and all instructions that operate on segment registers treat them as 16bits but in reality they are much bigger. The hidden bits contain cached information that the processor read from the descriptor, it does this so it doesn’t have to continually read the descriptor tables. The hidden data is difficult to read, the only reliable way is to enter system management mode and then read from the stored context. This hidden data is not of any concern to us at the moment but its worth know it exists and it explains why changing descriptors has to be done carefully.

In 32bit mode there are 3 descriptors used by an application. 0×23 is the code segment, 0×53 is the TIB segment and 0x2B is for all the others. Immediately we can see that all data selectors (DS,ES,SS and GS) all access the same memory because they all use the same descriptor, GDT descriptor 5 and we can dump it with the kernel debugger:

                                    P  Si  Gr Pr Lo
Sel  Base     Limit      Type       l  ze  an es ng  Flags
---- -------- --------   ---------- -  --- -- -- --  --------
002B 00000000 ffffffff   Data RW Ac 3  Bg  Pg P  Nl  00000cf3

You can see from the descriptor that this segment starts at 0×00000000 and ends at 0xffffffff which is the entire 32bit address space, it contains Read Write data, is accessible from ring3.

The code segment has to use a selector that is configured for code, using the descriptor described above would generate a general protection fault so another descriptor has to be used, CS uses GDT descriptor 4.

                                    P  Si  Gr Pr Lo
Sel  Base     Limit      Type       l  ze  an es ng  Flags
---- -------- --------   ---------- -  --- -- -- --  --------
0023 00000000 ffffffff   Code RE Ac 3  Bg  Pg P  Nl  00000cfb

You can see this segment is pretty much identical to the data segment, it spans the entire 32bit address space but its marked as code, read only and execute. With these two descriptors code and data can live anywhere in the 32bit address space.

Finally FS, the TIB segment, uses GDT descriptor 10.

                                    P  Si  Gr Pr Lo
Sel  Base     Limit      Type       l  ze  an es ng  Flags
---- -------- --------   ---------- -  --- -- -- --  --------
0053 7efdd000 00000fff   Data RW Ac 3  Bg  By P  Nl  000004f3

This is a 4K data descriptor that starts at 0x7efdd000, this is the same address you saw earlier as the linear address of the TIB. This shows that segments don’t do anything special, they are no more than an offset mechanism with some addition protections.

When an application has multiple threads the FS selector is always 0×53, in the background when windows schedules threads it is changing the GDT so the TIB block is unique to that thread. GDT descriptor 10 from the context of a different thread.

                                    P  Si  Gr Pr Lo
Sel  Base     Limit      Type       l  ze  an es ng  Flags
---- -------- --------   ---------- -  --- -- -- --  --------
0053 7efda000 00000fff   Data RW Ac 3  Bg  By P  Nl  000004f3

 

64bit segments

Looking at the segments in 64bit things are very different.

CS = 0033 DS = 0000 ES = 0000 SS = 002B FS = 0000 GS = 0000

The first problem is the 64bit debugger for Visual Studio doesn’t even show the segment registers other than the code and stack segment, the zero descriptor also known as the NULL descriptor has special meaning in 32bit but in 64bit but I wish Visual Studio would show the proper contents.  The actual segments registers contain the same values as in the 32bit case. As you’ll see later this is a huge benefit to us. Using a real debugger the segments are as follows:

CS = 0023 DS = 002B ES = 002B SS = 002B FS = 0053 GS = 002B

While in 64bit most of a the descriptor is not used because segmentation is defined to be disabled by the hardware.  Therefore if you look at a data descriptor while in 64bit mode you’ll see they are the same as they were in 32bit, descriptor 5 (0x2b) in 64bit:

                                                    P Si Gr Pr Lo
Sel  Base              Limit             Type       l ze an es ng Flags
---- ----------------- ----------------- ---------- - -- -- -- -- --------
002B 00000000`00000000 00000000`ffffffff Data RW Ac 3 Bg Pg P  Nl 00000cf3

Other than the 64bit addresses this descriptor is identical to the 0x2B selector in 32bit and it maps the bottom 32bits of the address space, because it isn’t used in 64bit a native 64bit application can still access the entire 64bit address space.

Even in 64bit even though most of the descriptor isn’t used the code segment still requires a descriptor that is designated as code. 64bit code descriptors have an extra bit, the L bit, which was previously unused to indicate whether the segment is 64bit or 32bit:

A 64bit processor can be in true 32bit mode, such as when a 32bit operating system is installed and in this mode its identical to any other 32bit x86 processor. When in 64bit mode the entire processor is in 64bit, to enable backwards compatibility a 64bit descriptor can be flagged as 32bit and this is called 32bit compatibility mode or 32e. This is what the L bit controls. The L bit pretty much toggles everything that changes across the two execution modes including the most important things for us; extended registers, instruction decoding,  and how segments are handled.

Win64 defines 2 code segments, 0×23 which is identical to 32bit and has the L bit clear (32e) and 0×33 which has the L bit set (64bit). Given that both the 0×23 and 0×33 selectors are always available to user mode code, could it be that changing between 32bit and 64bit code execution, regardless of the type of application, is as simple as changing CS?

Another thing to consider is how windows gets in to kernel mode. All recent processors and all 64bit compatible processors use the instruction ‘sysenter’ (sometimes called syscall). This instruction works as documented if the processor is in32bit mode, it also works as documented in 64bit mode but its an illegal instruction in 32e mode – how does the 32bit versions of  ntdll.dll and kernel32.dll get in to kernel mode?

As mentioned earlier, on Win64 the GS register holds the thread information block and not FS.  FS and GS are special in 64bit because unlike the other segments they still perform a base offset calculation from the descriptor base. However, if you look at segment registers GS is set the same as the other data descriptors (0x2b) and FS holds 0×53 which was the TIB descriptor in 32bit. Whats going on and how does an access via GS end up in the TIB?? The reality is that the hidden bits of the FS and GS segment registers map directly to Model Specific Registers which allows the segment base to be anywhere in the 64bit address space. A new instruction ‘swapgs’ can be used to load the base address of GS.  Given that this new instruction only exists in the GS variant, there is no equivalent for FS, it is the reason why the TIB has moved from the FS register in 32bit to the GS register. With this new instruction there is no need for TIB descriptor within the GDT to be modified every time the OS switches a 64bit thread.  The FS segment isn’t used by user mode apps in 64bit applications.

 

Loading a 32bit process

A quick note on directories… 32bit version of windows put the OS in windows\system32, the 64bit version also puts the OS components in windows\system32 but in this case these are exclusively 64bit components. This is said to be for compatibility reasons when recompiling 32bit apps to 64bits, I think its ridiculous and in the long run will cause more problems than it fixes. Win64 has a folder called windows\syswow64, again badly named but the 32bit system DLLs live here. When you load a system DLL from a 32 bit application it will be fetched from syswow64 folder.

Lets load a simple 32bit executable in the debugger and figure out what happens. I am using 64bit version WinDBG in kernel mode with USB2.0 debug support as its by far the best debugger. Visual Studio will only debug the type of the platform that was built, therefore if you build a 32bit application there is no way to even disassemble 64bit instructions.

ModLoad: 00000000`778c0000 00000000`77a69000 ntdll.dll
ModLoad: 00000000`77aa0000 00000000`77c20000 ntdll32.dll
ModLoad: 00000000`755a0000 00000000`755df000 C:\Windows\SYSTEM32\wow64.dll
ModLoad: 00000000`75540000 00000000`7559c000 C:\Windows\SYSTEM32\wow64win.dll
ModLoad: 00000000`75530000 00000000`75538000 C:\Windows\SYSTEM32\wow64cpu.dll
ModLoad: 00000000`763c0000 00000000`764d0000 C:\Windows\syswow64\kernel32.dll
ModLoad: 00000000`76b30000 00000000`76b76000 C:\Windows\syswow64\KERNELBASE.dll

The items in red are all 64bit DLLs, looking at kernel info for ntdll.dll we get the following which I have abbreviated:

Loaded Module Info: [ntdll]
Module: ntdll
Base Address: 00000000778c0000
Image Name: ntdll.dll
Machine Type: 34404 (X64)
Time Stamp: 4ec4aa8e Wed Nov 16 23:32:46 2011
Size: 1a9000
CheckSum: 1ac7ee
Characteristics: 2022 perf
Debug Data Dirs: Type Size VA Pointer
CODEVIEW 22, 101258, 100658 RSDS – GUID: {15EB43E2-3B12-409C-84E3-CC7635BAF5A3}
Age: 2, Pdb: ntdll.pdb
CLSID 4, 101254, 100654 [Data not mapped]
Image Type: FILE – Image read successfully from debugger.
C:\Windows\SYSTEM32\ntdll.dll

There is one myth down.. It is possible to load a 64bit DLL in to a 32bit process, every single 32bit application does it.  Lets look what happens when you execute a simple system call such as Sleep(100);

Kernel32.dll!_SleepStub@4

KernelBase.dll!_Sleep@4

KernelBase.dll!_SleepEx@8

ntdll.dll32!_ZwDelayExecution@8

As you can see from the call stack above, which is all 32bit code, a call to Sleep ultimately ends at _ZwDelayExecution() which is the native api for anything that needs to delay the execution of a thread. The 32bit implementation of this function is where things get good:

_ZwDelayExecution@8:
77ABFD5C mov eax,31h
77ABFD61 mov ecx,6
77ABFD66 lea edx,[esp+4]
77ABFD6A call dword ptr fs:[C0h]
77ABFD71 add esp,4
77ABFD74 ret 8

This function is some sort of dispatch table, all the kernel entry points do something identical, here is the code for _NtOpenFile:

_NtOpenFile@24:
77ABFD44 mov eax,30h
77ABFD49 xor ecx,ecx
77ABFD4B lea edx,[esp+4]
77ABFD4F call dword ptr fs:[C0h]
77ABFD56 add esp,4
77ABFD59 ret 18h

The call to fs:[0xc0] is interesting, this is 32bit code so FS is the Thread Information Block for the calling thread, offset 0xc0 is documented as reserved for wow32 (16 bit windows apps running in 32bit windows) but it seems like it used for wow64 in a pretty much identical way. The data at fs:[0xc0] is 0×75532320, the code at this address is a far jump:

wow64cpu!X86SwitchTo64BitMode:
75532320 jmp 0033:7553271E

Note the symbol name! This jumps to a 32bit address using selector 0×33 (the same selector used by default in 64bit apps), if you use visual studio it won’t go any further than this jump. If you look at the instruction in memory at the jump destination, and you can do this with visual studio, you will see this:

7553271E inc esp
75532720 mov eax,dword ptr [esp]
75532723 inc ebp
75532724 mov dword ptr [ebp+000000BCh],eax
7553272A inc ecx
7553272B mov dword ptr [ebp+000000C8h],esp
75532731 dec ecx
75532732 mov esp,dword ptr [esp+00001480h]
75532739 dec ecx
7553273A and dword ptr [esp+00001480h],0

This looks like code but the inc/dec after every instruction is a giveaway that the code you are looking at is 64bit that has been decoded as 32bit, the inc/dec instructions are the prefix bytes on the 64bit instructions being incorrectly decoded. Consider this 3 byte op code:   0×48 0x 8b 0xc8. In 64bit this is ‘mov rcx,rax‘ but in 32bit those same bytes are ‘dec eax   mov ecx,eax‘. If you disassemble the code at the target address as 64bit you see what is really happening.

wow64cpu!CpupReturnFromSimulatedCode:
00000000`7553271e 67448b0424 mov r8d,dword ptr [esp] ds:00000000`0018fdcc=77abfd71
00000000`75532723 458985bc000000 mov dword ptr [r13+0BCh],r8d
00000000`7553272a 4189a5c8000000 mov dword ptr [r13+0C8h],esp
00000000`75532731 498ba42480140000 mov rsp,qword ptr [r12+1480h]
00000000`75532739 4983a4248014000000 and qword ptr [r12+1480h],0
00000000`75532742 448bda mov r11d,edx

This code is in wow64cpu.dll and is a thunking layer for the system call and ultimately calls wow64cpu!CpupSyscallStub which executes sysenter to enter kernel mode, we are now in a 64bit code segment so its legal to use this instruction.

wow64cpu!CpupSyscallStub:
00000000`75532e00 4189adb8000000 mov dword ptr [r13+0B8h],ebp ds:00000000`0008fdd8=0018feac
00000000`75532e07 0f05 syscall
00000000`75532e09 c3 ret

The code that returns to 32bit mode from 64bit is this, the last instruction is  far jump

00000000`75532cfa mov r9d,dword ptr [r13+0BCh] ds:00000000`0008fddc=77abfd71
00000000`75532d01 mov dword ptr [r14],r9d
00000000`75532d04 jmp fword ptr [r14]

Now we know how all 64bit segments work and we know everything is setup to support this, all we need to do in order to execute 64bit code from a 32bit app is a far jump/call with 0×33 as the selector! Don’t get any ideas of just changing a thread context with something like the following:

    SuspendThread(thread);
    CONTEXT context;
    context.ContextFlags = (CONTEXT_FULL);
    GetThreadContext(thread,&context);
    context.SegCs = 0x33;
    SetThreadContext(thread,&context);
    ResumeThread(thread);

Doesn’t work and remember you can’t load the CS selector directly, mov CS,ax doesn’t exist as an instruction. The only reliable user mode method to load CS is via flow control instructions, jumps, calls, returns etc.  Executing 32bit code from a 64bit application is just as easy, perform a far jump or call with 0×23 as the selector and you are good to go. Obviously anything that is 32bit has to be in the bottom 4gigs and when you enter 32bit mode the selectors are once again, fortunately they are already setup as they would be in a 32bit app which is how want them, the TIB is always in the lower 4 gigs and most important of all, the stack is always in the lower 4 gigs. Do a far jump/call to an address in the lower 32bits and the machine will stay perfectly stable.

The following C code demonstrates how to execute 64bit code in your own 32bit application:

#include <stdio.h>
#include <string.h>
#include <windows.h>
BYTE far_jump[8];
BYTE x64code[20];
int main(int argc, char *argv[])
{
  // make a far jump instruction that jumps through segment 0x33,
  // this kicks it in to 64bit mode.
  far_jump[0] = 0x9a; //0xea for a jmp, 0x9a for a call
  far_jump[5] = 0x33;
  far_jump[6] = 0x00;
  *(DWORD*)(far_jump+1) = (DWORD)x64code;
  far_jump[7] = 0xc3; //ret
  // in a 32bit app we cannot easily compile 64bit instructions as none of the compile/build tools support it
  // we'll hardcode some 64bit instructions..
  x64code[0] = 0x48;
  x64code[1] = 0x8B; //mov rcx,rax
  x64code[2] = 0xC8;
  x64code[3] = 0xcb; //retf to pull the return segment (32bit) from the stack as well as the address
  void (*func)() = (void (*)(void))(void*)far_jump;
  func();
}

The biggest problem in supporting this on a larger scale is compiling the code. All the build tools either generate 32bit code or 64bit code, none of them will generate mixed code. Likewise the linker won’t cross link modules and the operating system loaders, at least at the highest level, won’t load DLLs that don’t match the process type.. Windows gets around this because by effectively using a global pointer to a dispatch table, it controls what is in the TIB but I think this is a horrible hack.. Wow64cpu is a 64bit DLL and in its “thread attach section” of DllMain() which is called for every thread a process creates, it writes  fs:[0xc0] with the address of the far jump, this jump is the only 32bit jump in the entire DLL and its hard coded similar to the code above.  The remainder of the DLL is normal 64bit code which builds a 64bit DLL.  Now the 32bit components of the system which are built as 32bit DLLs use the hard coded fs:[0xc0] has the 64bit handler and they use it as needed. There is never any need for the 32bit and 64bit sections to communicate at the source level. A user application can do the same thing with an unused entry in the TIB of which there are plenty.

For now you can use a homemade code generator to get the code in to memory without relying on the system. The big unanswered question is how do you force the system to load a 64bit DLL in to a 32bit application? Its only possible with the native API and there are a lot of gotchas but I’ll cover this in another post.

This becomes none trivial if the loaded components use other system components, for example how would you load the 32bit version of kernel.dll in an application that has already loaded the 64bit version? From how close all of this is to just working it seems like Microsoft originally was going to support mixed mode application in some way but they gave up when working through the higher level issues.

Finally, the instruction at fs:[0xco] which contains the far jump to the 64bit dispatch code allows all 32bit kernel APIs to be easily hooked and modified. Win64 is notoriously hard to patch, even from kernel mode and it will blue screen when it detects any of its system structures and tables have been modified. fs:[0xc0] is not part of the checked state and the far jump can be modified with WriteProcessMemory. More on this later…

3 thoughts on “64 and 32bit code in the same process

  1. Myria

    > From how close all of this is to just working it seems like Microsoft originally was going to support mixed mode application in some way but they gave up when working through the higher level issues.

    This is actually untrue. WoW64 originated as the program the Itanium version of Windows used to run x86-32 applications. If you look at the design and symbol names of wow64cpu.dll, you’d see that wow64cpu.dll is designed as an x86 CPU emulator. When running an x86-32 program on Itanium Windows, wow64.dll would call the wow64cpu.dll export CpuSimulate to start executing emulated x86 instructions.

    When Microsoft made Windows for AMD64, they kept the design of WOW64 the same. Of course, instead of wow64cpu.dll emulating the x86-32 instruction set, it just jumps to it directly.

    The only reason SetThreadContext can’t set SegCs to the 64-bit CS is because SetThreadContext is simulated by wow64.dll. If you call the real 64-bit NtSetContextThread on a 32-bit thread, you can set the SegCs value in the 64-bit CONTEXT structure to switch it to 64-bit code. This is what happens when, for example, you use the 64-bit WinDbg to trace a 32-bit program’s system call through WoW64 and ntdll.dll. WinDbg is setting the segment registers as a 64-bit program.

    Reply
  2. khalfan

    Thank you for this nice article. Incidentally an x32 program running on WIN x64, although the segment register values are as shown above, it is only the FS segment that the GetThreadSelectorEntry function will translate from
    Fs: 00000053 into VA Fs: 7EFDA000
    and the VA’s of the rest are all zero.

    I also reckon NtQueryInformationThread function with (THREADINFOCLASS) 0×6, (ThreadDescriptorTableEntry) is also broken.

    This articles has helped me clear up a lot of queries of which are addressed by MS documentation as if someone is trying to conceal some critical information intentionally.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>