This issue’s crash dump analysis was done using a friend’s laptop computer that had crashed; I had gleefully set it up in hopes that this would happen and I would get an interesting dump from a production single user system. Having actually crashed while I was using it I had an opportunity to analyze the dump.
The entire debugging process is now just amazing. The laptop wasn’t really set up for debugging, so I downloaded the latest debugger. No need to fumble around and find the correct symbols for the version of the OS in use (including hot fixes). I could just point it at the symbol server and it just worked. My hat is off (again) to the Microsoft WinDBG team for making the debugging experience truly pleasant. At this rate maybe I’ll have my mother debugging her computer when it crashes.
I started off by using the !analyze –v command and it gave me a garden variety IRLQ_NOT_LESS_OR_EQUAL bug check (0xA):
kd> !analyze -v
* Bugcheck Analysis
An attempt was made to access a pagable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arg1: 8b000054, memory referenced
Arg2: 00000002, IRQL
Arg3: 00000001, value 0 = read operation, 1 = write operation
Arg4: 804ec962, address which referenced memory
The stack was nothing less than stupendous, since it was not a stack trace I think I’ve ever seen before. For more detailed information, the stack trace using the kv command is below:
ChildEBP RetAddr Args to Child
80541d28 804dce53 0000000a 8b000054 00000002 nt!KeBugCheckEx+0x19 (FPO: [Non-Fpo])
80541d28 804ec962 0000000a 8b000054 00000002 nt!KiTrap0E+0x2ad (FPO: [0,0] TrapFrame @ 80541d44)
80541db4 804ec78a f65d6e80 f65d6e88 00000100 nt!KiUnlinkThread (FPO: [0,0,1])
80541dc8 804ecf28 00000000 80541de4 f65d6ec0 nt!KiUnwaitThread+0x10 (FPO: [Non-Fpo])
80541df4 804ec1dc 471470a0 0000000b 80541f1c nt!KiWaitTest+0xac (FPO: [Non-Fpo])
80541f00 804ec448 8054a900 8054a6a0 ffdff000 nt!KiTimerListExpire+0x78 (FPO: [Non-Fpo])
80541f2c 804f1e02 8054ace0 00000000 0000d60a nt!KiTimerExpiration+0xb0 (FPO: [Non-Fpo])
80541f48 804f1d67 00000000 804f1d7b 00000000 nt!KiRetireDpcList+0x46 (FPO: [0,0,0])
80541f50 804f1d7b 00000000 0000000e 00000000 nt!KiIdleLoop+0x12 (FPO: [0,0,0])
A crash in KiUnlinkThread is certainly most unusual. From my basic knowledge of the OS, I guessed that this routine must remove a thread from some queue (brilliant analysis of the name, no?). The real hint here was the call list in front: KiTimerExpiration calling KiTimerListExpire calling KiWaitTest calling KiUnwaitThread which then called KiUnlinkThread.
So, this was a satisfied wait condition and hence the unlink of the thread would be from a dispatcher object – in this case a timer. Whatever that timer was, it probably expired and removed a thread that was blocked waiting for it. Unfortunately, it looks like either the thread or the wait block (a KWAIT_BLOCK structure, which is used for all blocked/waiting objects in Windows) must not have been valid. I decided to go hunting for the wait block with a little memory groveling. The key here was to find the timer object itself, since that seemed most likely to lead back to the driver.
My big guess? That KiUnlinkThread was a KWAIT_BLOCK. It seemed difficult for me to think that it could be almost anything else – after all, what else would be using a routine called KiUnlinkThread and involve timer expiration? Of course, if I was right, the wait block would lead me to the invalid thread. Otherwise it would give me an invalid value. So I guessed that the first parameter to KiUnlinkThread was the dispatcher object (timer) which would start with a DISPATCHER_HEADER:
kd> dt nt!_DISPATCHER_HEADER f65d6e80
+0x000 Type : 0x8 ''
+0x001 Absolute : 0 ''
+0x002 Size : 0xa ''
+0x003 Inserted : 0 ''
+0x004 SignalState : 1
+0x008 WaitListHead : _LIST_ENTRY [ 0xf65f6e88 - 0xf65d6e88 ]
The wait blocks are threaded on the dispatcher header list:
kd> dt nt!_KWAIT_BLOCK f65f6e88
+0x000 WaitListEntry : _LIST_ENTRY [ 0x8bf66107 - 0x66e853d8 ]
+0x008 Thread : 0x8b000000
+0x00c Object : 0x74f685f0
+0x010 NextWaitBlock : 0x0c558b46
+0x014 WaitKey : 0x9689
+0x016 WaitType : 0xc8
I knew at this point that I was on the right track because the Thread field (0x8b000000) points to the right address range as the first parameter of the bug check code (0x8b000054). In fact, when I look at the register state from the trap frame I see this address in the ECX register:
kd> .trap 80541d44
ErrCode = 00000002
eax=80541de4 ebx=f65d6e88 ecx=8b000000 edx=00000100 esi=8b000000 edi=f65d6ec0
eip=804ec962 esp=80541db8 ebp=80541dc8 iopl=0 nv up ei pl nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202
This made me curious as to what would be at offset 0x54 in the KTHREAD structure:
kd> dt nt!_KTHREAD
+0x000 Header : _DISPATCHER_HEADER
+0x010 MutantListHead : _LIST_ENTRY
+0x018 InitialStack : Ptr32 Void
+0x01c StackLimit : Ptr32 Void
+0x020 Teb : Ptr32 Void
+0x024 TlsArray : Ptr32 Void
+0x028 KernelStack : Ptr32 Void
+0x02c DebugActive : UChar
+0x02d State : UChar
+0x02e Alerted :  UChar
+0x030 Iopl : UChar
+0x031 NpxState : UChar
+0x032 Saturation : Char
+0x033 Priority : Char
+0x034 ApcState : _KAPC_STATE
+0x04c ContextSwitches : Uint4B
+0x050 IdleSwapBlock : UChar
+0x051 Spare0 :  UChar
+0x054 WaitStatus : Int4B
+0x058 WaitIrql : UChar
+0x059 WaitMode : Char
+0x05a WaitNext : UChar
+0x05b WaitReason : UChar
+0x05c WaitBlockList : Ptr32 _KWAIT_BLOCK
+0x060 WaitListEntry : _LIST_ENTRY
+0x060 SwapListEntry : _SINGLE_LIST_ENTRY
+0x068 WaitTime : Uint4B
+0x06c BasePriority : Char
+0x06d DecrementCount : UChar
+0x06e PriorityDecrement : Char
+0x06f Quantum : Char
+0x070 WaitBlock :  _KWAIT_BLOCK
+0x0d0 LegoData : Ptr32 Void
+0x0d4 KernelApcDisable : Uint4B
+0x0d8 UserAffinity : Uint4B
+0x0dc SystemAffinityActive : UChar
+0x0dd PowerState : UChar
+0x0de NpxIrql : UChar
+0x0df InitialNode : UChar
+0x0e0 ServiceTable : Ptr32 Void
+0x0e4 Queue : Ptr32 _KQUEUE
+0x0e8 ApcQueueLock : Uint4B
+0x0f0 Timer : _KTIMER
+0x118 QueueListEntry : _LIST_ENTRY
+0x120 SoftAffinity : Uint4B
+0x124 Affinity : Uint4B
+0x128 Preempted : UChar
+0x129 ProcessReadyQueue : UChar
+0x12a KernelStackResident : UChar
+0x12b NextProcessor : UChar
+0x12c CallbackStack : Ptr32 Void
+0x130 Win32Thread : Ptr32 Void
+0x134 TrapFrame : Ptr32 _KTRAP_FRAME
+0x138 ApcStatePointer :  Ptr32 _KAPC_STATE
+0x140 PreviousMode : Char
+0x141 EnableStackSwap : UChar
+0x142 LargeStack : UChar
+0x143 ResourceIndex : UChar
+0x144 KernelTime : Uint4B
+0x148 UserTime : Uint4B
+0x14c SavedApcState : _KAPC_STATE
+0x164 Alertable : UChar
+0x165 ApcStateIndex : UChar
+0x166 ApcQueueable : UChar
+0x167 AutoAlignment : UChar
+0x168 StackBase : Ptr32 Void
+0x16c SuspendApc : _KAPC
+0x19c SuspendSemaphore : _KSEMAPHORE
+0x1b0 ThreadListEntry : _LIST_ENTRY
+0x1b8 FreezeCount : Char
+0x1b9 SuspendCount : Char
+0x1ba IdealProcessor : UChar
+0x1bb DisableBoost : UChar
So it appears to be setting the WaitStatus field itself. When looking at the other fields from the KWAIT_BLOCK it is clear that the block itself is invalid (note that the Object field (0x74f685f0)does not point back to the timer object (f65d6e80).
Thus the next question was “to what driver does this timer object belong?” This led me to look at the location of the original timer, shown in Figure 7.
kd> ln f65d6e80
(f65d6e80) rdbss!s_RxTimer | (f65d7858) rdbss!s_RxRecurrentWorkItemsList
Interestingly enough, the invalid memory block also appears to reside inside a driver (shown in Figure 8).
kd> ln f65f6e88
(f65f6e44) netbt!GetTracker+0x40 | (f65fc31e) netbt!SendDgramCompletion
kd> * This appears to be a timer belonging to netbt (NetBIOS over TCP/IP)
It seemed a bit odd that this memory block appears in the midst of a network driver.
At this point I decided to stop analyzing – I had identified the driver that owned the timer object and I had found that the KWAIT_BLOCK pointed into the middle of another driver. My theory (not proven from the dump) was that the wireless network card was experiencing some sort of problem. As it turns out, the laptop was actually suffering from a bad CD-ROM drive and I suspect the device removal/arrival notifications may have been perturbing the Plug and Play system – which showed up elsewhere. Regardless, the analysis is indicative of the usual kind of failure we see around drivers – an inability to deal with “extraordinary” circumstances.
So my friend isn’t any happier (he tried to get Dell to fix the CD-ROM drive, but they won’t do that until he runs the diagnostics on it – which, of course, come on a CD-ROM). So, my next task is to go fix his CD-ROM drive, or perhaps argue convincingly with the tech support guy that I really do know what I’m talking about.
Until next time, Ciao!