Table of contents

Int movs

FP Transcendental Operation Analysis (msrom-612, 0x1FFE-0x21E1)

Overview
Cross-Domain Transfer UOPs
FP Field Extract UOPs
FP Computation UOPs
Integer Operations on FP Exponents
U2 Flag Bit Analysis
Execution Flow Pattern
Special Cases Handled
Key Insight

SYSENTER/SYSEXIT

SYSENTER works. SYSEXIT does not.
A different calling convention
EFLAGS are not restored
The STI timing problem
Why Intel did not fix it
The correct CPUID check

Int movs

Macro Operation x86 Opcode Notes

macro_252D NOT [mem+8] F7 /2 Bitwise complement

macro_2535 NEG [mem+8] F7 /3 Two's complement negate

macro_2509 MOV [mem+8], reg 89 Register to memory

macro_251C MOV [mem+8], imm C7 /0 Immediate to memory

macro_2512 CMOVcc synthesis or SETcc 0F 40-4F or 0F 90-9F Conditional with register

macro_2524 Conditional with imm Synthetic No direct x86 equivalent

macro_12A5 MOV [mem+8], 1 (?) C7 /0 or test/bit op Unclear, needs more context

FP Transcendental Operation Analysis (msrom-612, 0x1FFE-0x21E1)

This code block implements a floating-point transcendental function (likely FPATAN or similar) using polynomial approximation. The code demonstrates critical patterns for transferring data between TMP registers (computational domain) and ST registers (architectural FP stack).

Cross-Domain Transfer UOPs

UOP.020(source_constant, value_register, U2_flags)

Transfer value to FP stack register
Examples from code:

0x2001: UOP.020(CONST_00+024, ST0, U2.80) - prepare ST0
0x2002: ST7 = UOP.020(CONST_00+03C, ST7, U2.80) - update ST7
0x20C6: ST7 = UOP.020(CONST_00+014, TMP1, U2.80) - write TMP to ST7
0x20CD: TMP0 = UOP.020(CONST_0, ST0) - read ST0 to TMP
0x20D1: ST0 = UOP.020(CONST_00+032, TMP0, U2.80) - write TMP to ST0

CONST values likely specify conversion mode or precision control
Can operate bidirectionally: TMP to ST or ST to TMP
U2.80 flag present when writing to architectural ST registers

UOP.220(constant, ST_register, U2_flags)

Prepare FP stack register for operation
Examples from code:

0x2004: UOP.220(CONST_0, ST0) - prepare ST0
0x20D8: UOP.220(CONST_00+024, ST0, U2.80) - prepare with cross-domain flag

Appears before reading ST into TMP registers
May mark ST register as readable or lock it for operation

UOP.7EE(operand1, operand2, operation_code, U2_flags)

Complex FP operation producing ST result
Examples from code:

0x207A: ST0 = UOP.7EE(TMP5, TMP6, CONST_00+032, U2.C9) - with EOM_Fl3
0x2086: ST0 = UOP.7EE(TMP6, TMP5, CONST_00+032, U2.C9) - with EOM_Fl3

Takes two TMP operands, produces ST result
CONST_00+032 appears to be operation/mode selector
Always appears with U2.C9 flag at subroutine exit
Always marked with EOM_Fl3 (subroutine return)

UFPOP_7X8(operand1, operand2, U2_flags)

FP operation with stack pop
Examples from code:

0x2139: ST0 = UFPOP_7X8(TMP0, TMP5, U2.49) - with EOM_Fl3
0x215C: ST0 = UFPOP_7X8(TMP7, TMP5, U2.49) - with EOM_Fl3
0x219E: ST0 = UFPOP_7X8(TMP3, TMP2, U2.49) - with EOM_Fl3
0x21AE: ST0 = UFPOP_7X8(TMP4, TMP2, U2.49) - with EOM_Fl3

Takes TMP register inputs
Produces ST0 result AND pops FP stack
Always uses U2.49 flag
Always marked with EOM_Fl3 (subroutine return)

UOP.262(operand1, operand2)

Simple TMP to ST transfer or merge
Examples from code:

0x21B8: ST0 = UOP.262(TMPA, ST0) - with EOM_Fl3
0x21C2: ST0 = UOP.262(TMPA, ST0) - with EOM_Fl3

Used in special case handling (denormals, infinities)
Appears to merge or conditionally update ST0

FP Field Extract UOPs

UOP.029(ST_register, ST_register)

Extract field from FP register
Example from code:

0x200A: TMPD = UOP.029(ST0, ST0)
0x20DE: TMPD = UOP.029(ST0, ST0)

Result immediately masked with AND (0x004 in examples)
Likely extracts FP classification bits (NaN, Inf, denormal flags)

UOP.060(FP_value, CONST_0)

Extract FP field to integer
Examples from code:

0x2019: TMPE = UOP.060(TMP0, CONST_0)
0x2020: TMPD = UOP.060(TMPA, CONST_0)
0x20ED: TMPE = UOP.060(TMP0, CONST_0)

Extracts exponent or other FP fields for range reduction

UOP.061(FP_value, CONST_0)

Extract FP field (variant of UOP.060)
Examples from code:

0x2009: TMPC = UOP.061(ST0, CONST_0)
0x2036: TMPC = UOP.061(TMP0, CONST_0)
0x20DD: TMPC = UOP.061(TMP0, CONST_0)

Result used for range comparisons
Possibly extracts biased exponent

UOP.063(FP_value, CONST_0)

Extract FP field (another variant)
Example from code:

0x2035: TMPE = UOP.063(TMP3, CONST_0)

Used after FXORS operation

UOP.064(FP_value, CONST_0)

Extract FP field (exponent?)
Examples from code:

0x2006: TMPC = UOP.064(ST0, CONST_0)
0x202E: TMPD = UOP.064(TMP0, CONST_0)
0x20DA: TMPC = UOP.064(ST0, CONST_0)

Result often shifted left by 3 or 4 bits
Likely extracts exponent for classification

FP Computation UOPs

UOP.0A1(CONST_0, FP_value)

FP operation on value
Example from code:

0x2034: TMP4 = UOP.0A1(CONST_0, TMP3)

Purpose unclear - possibly absolute value or normalize

UOP.223(operand1, operand2)

FP arithmetic operation
Example from code:

0x201C: TMPA = UOP.223(TMP2, TMP0)

Used in range reduction sequence

UOP.227(CONST_0, operand)

FP operation
Example from code:

0x201A: TMP0 = UOP.227(CONST_0, TMP0)

Part of argument reduction

UOP.228(operand1, operand2)

FP operation
Example from code:

0x2018: TMP0 = UOP.228(TMP0, TMP1)

Used before range reduction

UOP.267(operand1, operand2)

FP operation
Example from code:

0x2024: TMP0 = UOP.267(TMP2, TMP1)

Part of computation sequence

Integer Operations on FP Exponents

UOP.124(operand1, operand2)

Integer operation on FP exponent/sign
Examples from code:

0x2090: TMPB = UOP.124(TMPB, TMPD)
0x20BE: TMPB = UOP.124(CONST_0, TMPD)
0x2134: TMPB = UOP.124(TMPB, TMPE)

Result used with FXORS to apply sign changes
Likely constructs sign/exponent bits for result

U2 Flag Bit Analysis

Based on observed patterns in code:

U2.08 - Read architectural state flag

0x2005: TMP0 = FXORS(ST0, ST0, U2.08)
0x20D9: TMP0 = FXORS(ST0, ST0, U2.08)
Allows reading ST registers in computational domain

U2.20 - Write preparation or intermediate result flag

0x2072: TMP9 = ADD.DSZ32(EIP_30, REG.31, U2.20)
0x2191: TMP9 = ADD.DSZ32(EIP_30, REG.31, U2.20)
Marks operations that prepare for architectural commit

U2.49 - FP stack pop with result commit

Always used with UFPOP_7X8
Combines result commit with stack management
Bit pattern: 0100 1001

U2.4A - Exception or special completion flag

0x20D5: UOP.120(CONST_16+004, CONST_16+004, U2.4A) - with EOM_Fl3
0x21C6: UOP.120(CONST_16+004, CONST_16+004, U2.4A) - with EOM_Fl3
Used at error/exception exits

U2.4B - Normal setup/initialization flag

0x2000: UOP.120(CONST_0, CONST_0, U2.4B)
0x20D6: UOP.120(CONST_0, CONST_0, U2.4B)
Appears at start of computation sequences

U2.50 - Precision or mode control flag

0x201D: UOP.120(CONST_16+010, CONST_16+010, U2.50)
0x20CC: UOP.120(CONST_16+010, CONST_16+010, U2.50)
Used before final result computation

U2.80 - Cross-domain write enable (architectural commit)

Used extensively with UOP.020 when writing to ST registers
This is the key "make visible" flag
Allows computational results to affect architectural state
Bit pattern: 1000 0000

U2.C9 - Complex cross-domain operation with commit

Always used with UOP.7EE at subroutine returns
Bit pattern: 1100 1001 (includes U2.80 bit)
Indicates full architectural state update

Execution Flow Pattern

Step 1: Read ST registers into TMP domain

UOP.220 prepares ST register
FXORS, UOP.064, UOP.029 extract fields with U2.08 flag
Exponent and special case checks

Step 2: Range reduction and argument preparation

UOP.061 extracts exponent
Compare against ROM constants (CONSTROM.03D, CONSTROM.047, CONSTROM.03E)
Branch to special case handlers if needed

Step 3: Polynomial approximation in TMP domain

Multiple FREADROM to load coefficients
UOP.6E9 (FP multiply) and UOP.768 (FP add) for Horner's method
All computation stays in TMP registers (invisible to architecture)

Step 4: Result finalization

UOP.120 operations with various U2 flags for mode setup
FXORS with TMPB to apply final sign

Step 5: Commit to architectural state

UOP.0D8 updates next instruction pointer
UOP.0D4 synchronizes pipeline
UOP.7EE or UFPOP_7X8 writes result to ST0 with U2.C9 or U2.49
EOM_Fl3 marks subroutine return

Special Cases Handled

Denormals: Check UOP.029 result & 0x004 at 0x200D, 0x20E1
Underflow range: Compare exponent < CONSTROM.03D at 0x200E, 0x20E2
Overflow range: Compare exponent >= CONSTROM.047 at 0x2012, 0x20E6
Large arguments: Compare exponent >= CONSTROM.03E at 0x2038, 0x2110
Each case has dedicated exit path with appropriate result handling

Key Insight

The U2.80 bit is the "architectural visibility" flag. Operations without this bit execute in a shadow computational domain where:

TMP registers can be freely modified
FP operations compute intermediate results
No architectural state is changed
Exceptions cannot occur (computation is speculative)

Only operations with U2.80 (or composite flags like U2.C9 containing it) can:

Write to architectural ST registers
Update FP status flags
Trigger FP exceptions
Make results visible to subsequent instructions

SYSENTER/SYSEXIT

The Pentium Pro SYSENTER/SYSEXIT Bug: A Microcode Analysis

The Pentium Pro implemented SYSENTER and SYSEXIT instructions that Intel quietly left undocumented at launch. When Linux 2.6 later enabled these instructions based on the documented Pentium II behavior, Pentium Pro systems crashed. The reason has now been confirmed through direct analysis of the processor's microcode.

SYSENTER works. SYSEXIT does not.

SYSENTER on the Pentium Pro behaves correctly and is functionally equivalent to the Pentium II version. It reads the kernel entry point and stack from the SYSENTER MSRs, switches to ring 0, and clears the interrupt flag before transferring control. A kernel using SYSENTER for the call half and IRET for the return would have worked fine on Pentium Pro all along, as was suspected by Linux developers at the time.

SYSEXIT, however, is a completely different implementation from what Intel later documented for the Pentium II.

A different calling convention

The most fundamental problem is that SYSEXIT on the Pentium Pro reads its inputs from different registers than the Pentium II. The documented Pentium II SYSEXIT takes the return address from EDX and the user stack pointer from ECX, and derives the user-mode code and stack segment selectors automatically from the SYSENTER_CS MSR (adding fixed offsets to produce the ring-3 CS and SS). This is the design that operating systems implemented against.

The Pentium Pro SYSEXIT works differently. It still takes the stack pointer from ECX, but it reads the new instruction pointer from ESI. More critically, it reads the user-mode code segment selector directly from the DI register and the stack segment selector from BX, rather than computing them from the MSR. This was apparently intended to give the operating system explicit control over the user-mode segment descriptors, enabling non-flat memory models. In practice it was fatal.

The null selector crash

In normal kernel code, DI and BX frequently contain zero or arbitrary values left over from system call argument handling. When DI is zero, SYSEXIT loads the null descriptor into CS. The null descriptor (GDT entry 0) is architecturally reserved and must never be loaded into a code segment register. The Pentium Pro microcode checks that the SYSENTER_CS MSR is nonzero, but performs no validity check on the value in DI.

The result: SYSEXIT completes without error, CPL is set to 3, and the CPU begins executing in user mode with a null CS. The very first instruction fetch causes a General Protection Fault. The fault handler tries to report the error and return via IRET, but the error frame on the stack contains the null CS selector that caused the fault in the first place. IRET restores that null CS, causing an immediate second General Protection Fault. Two consecutive General Protection Faults produce a Double Fault, which is what Linux users observed.

This is also the exact behavior described by Intel's own erratum for the Pentium Pro: "SYSENTER/SYSEXIT instructions can implicitly load null segment selector to SS and CS registers." Intel published the erratum, apparently without fully acknowledging that SYSEXIT was the culprit and that DI was the vector.

EFLAGS are not restored

The Pentium II SYSEXIT clears most processor flags before returning to user mode, including the interrupt flag. The Pentium Pro SYSEXIT clears nothing. A kernel that disabled interrupts during syscall handling would return to user mode with interrupts still disabled, causing the system to gradually freeze as no timer or device interrupts could be serviced.

The STI timing problem

The Linux kernel, like many operating systems, uses a STI instruction immediately before SYSEXIT to re-enable interrupts before returning to user mode. The x86 architecture guarantees that an interrupt enabled by STI will not be taken until after the following instruction completes. On Pentium II, SYSEXIT honors this guarantee cleanly because the microcode contains an explicit pipeline checkpoint partway through the instruction, after all segment registers and the privilege level have been committed to a consistent state. If an interrupt arrives, it is held until that safe point.

The Pentium Pro SYSEXIT contains no such checkpoint. An interrupt that arrives during the execution of SYSEXIT may find the processor in a partially-updated state: the privilege level may already be set to ring 3 while the stack still points to a kernel address, or the code segment may be committed while the stack segment is not. Interrupt delivery in this half-updated state produces a stack fault or a protection fault, both of which escalate to a Double Fault.

Why Intel did not fix it

The Pentium Pro was already in production when this problem was identified. The fix Intel implemented for the Pentium II was architectural: rather than reading CS and SS from general-purpose registers, SYSEXIT computes them automatically from the SYSENTER_CS MSR, eliminating the possibility of a null selector being specified and removing the dependence on register state that kernel code cannot reliably control. This change made the instruction safe to document and use.

Intel's decision not to fix later Pentium Pro steppings was likely a cost/timeline judgment. No software used SYSEXIT at the time, so there was no pressure to patch a product that was already shipping. The workaround — not documenting the instruction — was cheap. The consequence surfaced years later when Linux began exploiting the Pentium II behavior and discovered the hard way that Pentium Pros behaved differently.

The correct CPUID check

Intel's documented check for SYSEXIT support — "family 6, model less than 3, stepping less than 3" — excludes only the earliest Pentium Pro models. Later Pentium Pros with model 1, stepping 9 pass the check and are incorrectly identified as supporting the Pentium II SYSEXIT behavior. The corrected check, using a combined model-stepping value, excludes all Pentium Pro processors. The discrepancy between these two checks is what made the Linux 2.6 crashes dependent on the specific CPU stepping and caused confusion about which processors were actually affected.

This project is an independent, unofficial work based on publicly available information and reverse-engineering research, and is not affiliated with, endorsed by, sponsored by, or associated with Intel Corporation or its affiliates. It is provided "as is", without warranty of any kind. The author assumes no responsibility or liability for any use, misuse, damage, data loss, hardware failure, or other consequences arising from its use. Intel, Pentium, Core and related trademarks are the property of their respective owners and are used solely for identification and informational purposes.