One of the toughest problems in driver development is knowing when your driver works “correctly.” For most of us, just getting the damn driver to work more or less predictably with the hardware is usually evidence enough that our driver is operating properly. Maybe we run some long-running usage tests to try to stress our driver a bit, and if there’s no system crash, we figure everything must be fine.
Driver testing is, admittedly, somewhat of a mysterious art. What tests need to be run? How do we know when we’ve done enough testing? What testing options are available? These are some of the questions this article will attempt to answer.
Before we start telling you what to test, however, it’s important to identify exactly what it is we’re attempting to determine when we’re testing a driver. The two main categories of things one can test in a driver are:
1. Functional Correctness
2. Implementation Correctness
Testing for Functional Correctness means determining if the driver achieves its goal of operating properly. In other words, does the driver appear to provide the required functionality? If the driver is a device driver, does the device appear to work properly under the conditions that are specified for its use? We’ll only briefly discuss testing for Functional Correctness in this article.
What this article focuses on is how to test for Implementation Correctness. Implementation Correctness attempts to determine if the driver was implemented properly, and thus behaves in a stable and predictable manner when interacting with the greater operating system environment. It attempts to determine if there are latent programming errors in the driver that do not (directly or obviously) impact functionality but may eventually, in specific and perhaps rare cases, impact overall system stability.
Testing for Implementation Correctness is typically much more difficult than testing for Functional Correctness. For example, by pounding on a device driver it’s usually pretty clear whether or not it does what it’s supposed to. On the other hand, it’s not always so clear whether the methods used to achieve the functionality are appropriate. That’s why testing for Implementation Correctness will be our focus.
Testing for Functional Correctness
While we’re not going to focus on Functional Correctness in this article, at least a few words on this topic seem to be in order.
Functional testing for drivers of standard Win2K devices can be validated using the Microsoft Hardware Compatibility Tests (HCTs). These tests have been developed by the Windows Hardware Quality Labs (WHQL, pronounced “wick-le” which rhymes with “pickle”) and are the tests (perhaps among others) that a driver and hardware group must pass before receiving the “Designed for Windows 2000” logo. But are the HCTs enough and are they the complete list of tests that you need to run to have confidence in the functional correctness of your driver?
And what if you’re implementing a driver for a unique type of device? Let’s say you’ve got a satellite telemetry device in your backyard that you’re writing a driver for in your spare time. How do you go about testing such a device?
The key to functional testing is making an exhaustive tour of the specified functionality provided by the driver. Of course, being a disciplined engineering organization, you’ve developed a set of functional requirements and probably even an interface specification for your driver. That being the case, the best sort of tests that you can create entail having a developer not involved in the driver’s implementation write a series of tests that exercise every option of every function specified in your interface specification.
Using an independent developer to write these tests avoids having them (unwittingly, of course) contaminated by knowledge of the implementation. Tests should test both positive and negative functionality. Where error return codes are specified, tests should be developed to specifically attempt to provoke these errors.
There’s a lot more to testing Functional Correctness, of course, but those are at least the basics.
But Does It Do It RIGHT?
Testing for Implementation Correctness doesn’t just test whether or not your driver does the right thing, it tests whether or not your driver does what it does correctly. As we’ve already said, this isn’t easy. But by methodically testing your driver for correctness before you release it, I guarantee you that you’ll reduce customer bug reports and increase customer satisfaction. After all, when something goes awry in your driver, the stability of the entire operating system is affected. And, gee whiz, for some reason customers really seem to hate it when their systems crash.
So now I’ve got to give you the bad news: Since software development is complicated (duh!) there is no magic bullet that’ll bless your driver as being implemented 100% correctly. The best we can do is run a series of tests that demonstrate (under the conditions in which they’ve been examined) that the driver appears to be correctly implemented.
So what can you test in terms of Implementation Correctness? Let’s start at the beginning.
Free And Checked Win2K
When you develop a driver, it is absolutely vital that you exercise its functionality as thoroughly as possible, on both the Free (Debug) and Checked (Retail) versions of the operating system. I cannot emphasize this strongly enough.
The Checked build of Win2K is supplied in your MSDN distribution and/or is available for download from the Microsoft web site. It is used specifically for system-level debugging, including attempts to identify difficult problems at customer sites.
The Checked build of the operating system has built into it a series of basic “reasonableness checks” that are designed to catch both gross driver offenses and many unusual or suspect practices. Most of these checks are implemented in the form of ASSERT(…) statements that are built into the NTOS and HAL source code by the Win2K development team. If you do something that appears to be wrong, a message indicating that an ASSERT failed will be displayed in the debugger, like that shown in Figure 1.
*** Assertion failed: Irp->IoStatus.Status != 0xffffffff
*** Source File: D:\nt\private\ntos\io\iosubs.c, line 3305
0:Break, Ignore, Terminate Process or Terminate Thread (bipt)? b
0:Execute '!cxr BD94B918' to dump context
Break instruction exception - code 80000003 (first chance)
804a3ce4 cc int 3
Figure 1 – Your Garden-Variety Assert Shown in WinDbg…You Mean You Don’t Have the Code at Line 3305 in IOSUBS.C?
To me, there is very little that is sloppier or more unprofessional looking than to have your released driver cause ASSERTS in the Checked build. I mean, the Checked build contains very basic tests for correctness. If you can’t pass those, it implies that you haven’t even done the basics in terms of testing. And, lest you think everyone already tests with the Checked build, take note: I’ve seen lots of supposedly stable systems where production drivers start throwing ASSERT()s as soon as the checked build is installed. It’s enough to make me retch.
On the other hand, it is possible for the Checked build to sometimes give you an ASSERT() that isn’t really an error. It’s important to keep in mind that the Checked build checks your driver’s actions and the state of the system to see if they appear to be “reasonable.” In some cases, you can do something that’s a bit unusual, but perfectly valid, and the Checked build will complain. This is rare, but when it happens it’s important for you to fully investigate and reconcile the ASSERT() with your driver’s actions. For the benefit of your customers, you probably want to document that the ASSERT() is expected in your driver’s release notes.
Please don’t mistake my advice to mean that you should only test with the Checked build, though. The road is littered with the bodies of developers who develop using only the Checked build, and then throw their drivers “over the wall” to the testing group… who do a set of tests that rapidly fail using the Free build. It’s important to exercise the Free build of your driver thoroughly on both the Free and Checked builds of the operating system.
Always Use Pool Tagging
Allocating data structures from pool and forgetting to return them is a problem that many driver writers are probably be familiar with. The solution to this problem is very easy. When writing your driver, always (and I do mean always) use pool tagging.
Pool tagging allows your driver to associate a four character alphanumeric “tag” with each block of pool that you allocate. You supply this tag when you allocate the pool block: You call ExAllocatePoolWithTag(), in place of plain old ExAllocatePool(). Win2K (and NT V4) keeps a list of all the allocation and free operations for each pool tag. The number of allocations and deallocations that have occurred, the number of individual allocations, and the amount of memory presently allocated per tag can be displayed using the console-mode “poolmon” utility in the DDK’s \BIN directory, or by using the free OSR utility PoolTag (go the http://www.osr.com, click the “Resources” tab, and then select the “Software To Download” section).
By providing different tags for each type data structure you allocate (and also perhaps using a different tag for each place a data structure is allocated), you can quickly and easily identify those data structures that never get returned. Exercise your driver for an extended period of time, and watch the pool allocations. When your driver exits, check to see if any pool allocations remain. Using pool tagging, you’ll quickly know if your driver is leaking any pool.
By default, pool tagging support is enabled in the Checked build of the operating system only. If you want support in the Free build, all you need to do is set the appropriate Win2K Global Flag (“Enable Pool Tagging”) using the Gflags utility from the Win2K DDK’s \BIN directory.
Interestingly enough, when you build the checked build of your driver you always use pool tagging. Check ntddk.h or wdm.h – Calls made to ExAllocatePool() in the checked build of your driver (when the symbol DBG is defined) turn into calls to ExAllocatePoolWithTag(). The default tag used is “DDK”.
Driver Verifier (discussed below) also implements extensive support for debugging pool leaks and errant pool write operations. However, using Driver Verifier most certainly does not alleviate the need for a driver using pool tagging.
Multiprocessor systems are widely available. By default, Win2K Professional supports up to two CPUs; Various versions of Win2K Server by default now support up to 32 CPUs. In case it’s not obvious, the drivers you develop must be fully re-entrant and multiprocessor safe.
In my experience, I have never met a human being who can implement truly MP-safe code without testing it. It therefore logically follows that you must test your code rigorously to ensure MP safety. This is one of the most important things you can do to ensure your driver’s Implementation Correctness.
The problems we’re most hoping to catch here are deadlocks. These subtle little timing problems can be really hard to find, and when you do find them they can be almost impossible to reproduce. Also, in doing MP testing, you’re looking for incorrect or insufficient locking of shared data structures. Luckily, these problems are typically easier to provoke than deadlocks.
How can you best test your code for MP safety? Unfortunately, this is one category where Win2K (and even Driver Verifier) doesn’t yet give us much help. Aside from using OSR DDK (on NT V4, see below) your best bet is to push your driver really hard on an MP system with as many CPUs as you can talk your manager into buying. Be sure to initiate as many parallel I/O operations in your driver as you can (assuming your driver supports this). Exercise as much varied functionality, in parallel, as possible. Even at this, you’re still working in “random walk” mode. That is, you’re just hoping to by luck stumble upon an error.
Performing MP testing on systems with lots of CPUs is invaluable, and will speed up the process of finding problems immensely. Testing on a two CPU system is barely acceptable. It seems to me that testing on a four CPU system is about a factor of magnitude more effective than testing on a dual CPU system. And testing on an eight-processor system (gee, if I could ever get one) is probably even a factor of magnitude better than testing on a quad.
Another very helpful trick in MP testing is to write private macros that track and “value add” the KeAcquireSpinLock() and KeReleaseSpinLock() functions. How to do this is, however, quite a bit beyond the scope of this article.
The big gun of driver testing is Driver Verifier (see Figure 2). This utility enables code in the operating system itself that performs a special set of checks on drivers. The checks range from the simple and previously done (such as checking to ensure that you don’t try to call KeRaiseIrql() to lower your IRQL – This has always caused a bug check in the Checked build) to the new, interesting, and complex (such as incorrectly copying an I/O Stack Location and then calling IoCallDriver()). No matter, the checks performed are all useful.
Figure 2 – Driver Verifier
Driver Verifier can help to pinpoint problems that can be very difficult to locate otherwise. For example, Driver Verifier can be set to notice if you attempt to reference pageable memory at IRQL >= DISPATCH_LEVEL. And it will notice even if that memory was resident (i.e. not paged out) when you last referenced it.
The Driver Verifier manager executable image “verifier.exe” is supplied in the DDK’s \TOOLS directory. But it’s also supplied in the %SystemRoot%\System32\ directory on the regular Free (Retail) distribution kit. So, if you don’t run it it’s OK: Your customers will.
Note that you can run Driver Verifier on either the Free or Checked build of the system, and with either the Free or Checked build of your driver. However, it’s typically more useful to run Driver Verifier on the Checked build of the operating system, with a kernel debugger running. For almost all of the problems that Driver Verifier identifies, when a problem is detected an explanatory message is displayed (if the Checked build of the system is running and a kernel debugger is connected), and then the system crashes. That’s right. Driver Verifier doesn’t give you a second chance – When it finds a bug, it calls KeBugCheckEx() and causes a Blue Screen.
When you run “verifier.exe” you’re greeted with a list of drivers that can be verified, and a set of choices of verification types. You can select one or more drivers to verify. Then, you select the type of verification that you want to perform. See section 3.1 in the DDK (“Driver Verifier”) for all the details. I’ll just list a few of the things that Driver Verifier checks.
If you don’t select any of the verification type check boxes, you get what Driver Verifier calls Automatic Checks. The list of checks performed is provided in the DDK. The types of checks performed are things the Checked build normally checks. However, there are several additions: For example, when you return a block of pool a check is made to see if that block contains an unused timer block. Driver Verifier also checks to ensure that requests to allocate and free pool blocks are performed at appropriate IRQLs. Again, these checks are fundamental, but they are useful nonetheless.
The really good stuff starts when one of the verification type check boxes is selected. For example, checking special pool causes drivers that are being verified to have their pool allocations treated exceptionally. Each request to allocate pool (either paged or non-paged) results in the allocation being rounded up to the next full page size. The returned block of pool is then aligned at the end of the last page of the allocated region, and the virtual page (PTE) following the block to be returned is set to “no access” in the memory manager. The result of this is an immediate trap (and bugcheck) the moment your driver attempts to write past the end of the allocated block. A memory scribble, caught in the act!
Assuming you don’t try to write past the end of your allocation, when you return the block of pool, the virtual addresses that were used to map the block are also set to “no access.” As a result, any access by your driver to a pool block that has already been returned to pool is immediately caught.
Checking the force IRQL checking check box in Driver Verifier, causes all pageable code and data in both your driver and the system (and this includes paged pool) to be marked as “trimmed” from the working set every time you raise your IRQL to DISPATCH_LEVEL or higher (such as when you acquire a spin lock). As a result, any attempt to reference one of the pageable pages will result in a page fault at elevated IRQL. And an ensuing system crash, of course.
The pool tracking check box in Driver Verifier simply enabled pool tag tracking, even in the Free build of Win2K where it is normally turned off. This allows you to do pool tag tracking using PoolTag, PoolMon, or even Driver Verifier manager in either the Free or the Checked build of Win2K.
To ensure that all your interactions with IRPs are as proper as they should be, check the I/O Verification check box. This checking enables all sorts of tests for creating, passing along, and completing IRPs. Level 1 includes the basics, such as ensuring that when you call IoCompleteRequst(), you’re really calling it with a pointer to an IRP. It also ensures that your dispatch routine returns at the same IRQL at which it was called (which wouldn’t be the case, for example, if you leave your dispatch routine inadvertently holding a spin lock).
Level 2 testing enables a whole series of more aggressive tests, including tests for specific WDM IRPs. For example, these tests will catch your driver if you call IoCompleteRequest() on a PNP IRP with STATUS_NOT_SUPPORTED, such as if you fail to pass along to an underlying bus driver an IRP that your driver doesn’t understand. Another cool problem that’s caught at this level is when your driver has a completion routine, but doesn’t mark an IRP pending that it has sent to a lower driver using IoCallDriver(), when the lower driver has returned STATUS_PENDING.
The tests for I/O Verification Level 2 are complex, obscure, and very useful. Like the Checked build, they also occasionally complain about unusual, but perfectly valid, behavior in a driver. As a result, not every Level 2 failure results in a call to KeBugCheck(). Some Level 2 test failures just print messages in the debugger.
Driver Verifier is clearly one of the best ways to test the correctness of your driver. You might not think you need this type of testing until you actually enable it and put your driver through its paces. It’s amazing the sorts of things it turns up. It’s best when you use this level of testing as a development tool, and not just leave it to the testing folks to let it loose on your code. Driver Verifier can turn up errors that can only be found by careful checks made from within the operating system.
There are some down sides to Driver Verifier, too, however. It’s being expanded, so just when you think you’ve passed all the tests in the universe, it adds yet more checks. It’s not yet available for NDIS drivers. However, sources tell us that NDIS support may be available by the time Whistler ships. Also, like any test tool, Driver Verifier can sometimes be a little too strict. It catches as errors a whole group of things that fall into the category of “you shouldn’t NORMALLY be doing this… well, unless you know what you’re doing.” One example of this is trying to release a spin lock that was acquired by calling KeAcquireSpinLockRaiseToSynch(). OK, it’s not a supported function…we understand that. But that doesn’t mean there aren’t legitimate reasons to call it (witness the fact that it exists at all!). And when you try to return that spin lock, Driver Verifier blue screens the system.
Our review of methods of testing Implementation Correctness wouldn’t be complete without a brief discussion of the two third-party tools that are available: OSR DDK (by OSR, obviously), and Bounds Checker for Drivers (by Compuware).
Both of these tools replicate, to some extent, the features that Driver Verifier provides. However, they both also provide additional features, such as automatic driver tracing. Bounds Checker also fully supports validation of NDIS drivers.
OSR DDK, especially, provides a level of checking that neither Driver Verifier nor Bounds Checker for Drivers provides. It does things like spin lock tracking, to automatically detect improper enforcement of the locking hierarchy (that will potentially cause deadlocks). OSR DDK also knows in detail about driver data structures, and cross checks all the “little things” in your driver. It does this by learning about your driver as your driver executes. So, for example, let’s say you have a PCI device. OSR DDK uses that little bit of knowledge to cross-check everything your driver does. When you call IoConnectInterrupt(), for example, OSR DDK will make sure that you specify your interrupt is both LevelTriggered and Shareable, because that’s how PCI devices should work.
OSR DDK is certainly cool, but it presently only supports NT V4 standard kernel mode drivers. And OSR DDK does not support NDIS drivers.
In addition to all of the above, the Win2K DDK contains some additional tools (not surprisingly, in the DDK \TOOLS directory) that can be used for correctness testing. Check out disabler, a tool that sends tons of Start/Query Remove/Remove sequences of PNP IRPs.
When you put all the testing we’ve discussed together, you have a very strong set of tests that can help verify Implementation Correctness. While we can never be sure our drivers are written right, it’s also not smart to release a driver to a customer that’s barely been tested. As I mentioned previously, it’s always better to use the tests and tools I’ve described in this article while developing your driver. This is because it’s much easier to build correctness into your driver than to test it in after the fact.
The old days of “run it for a while, and if the system doesn’t crash… ship it!” are gone forever. Use the tests and tools I’ve described before letting your driver loose on the world. Your time and effort will undoubtedly be repaid handsomely by fewer bug reports and a more solid code base.