simple question about UNICODE_STRING

hello

i have recently started learning to develop drivers for windows and there was a simple question that i could not find direct answer to about the wdk structure UNICODE_STRING.
can i assume that every UNICODE_STRING in the nt system consists of an array of wide characters, and every single character is exactly 2 byte length - regardless of the letter/symbol/whatever it represents, and i can safely index into the array and use ++ and – on the pointers without having to worry about encoding and other things?
i asked the question because i was thinking of writing some minor string processing functions when i want to process strings at IRQL> PASSIVE_LEVEL (as i have seen, every unicode string function requires that level)

thank you

> can i assume that every UNICODE_STRING in the nt system consists of an array of wide characters,

and every single character is exactly 2 byte length - regardless of the letter/symbol/whatever it
represents, and i can safely index into the array and use ++ and – on the pointers without having to
worry about encoding and other things?

Yes.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

thanks

typedef struct _UNICODE_STRING {
USHORT Length;
USHORT MaximumLength;
PWSTR Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

Length
The length, in bytes, of the string stored in Buffer.

MaximumLength
The length, in bytes, of Buffer.

Buffer
Pointer to a buffer used to contain a string of wide characters.

WCHAR& CUString::operator(int idx) {
// accesses an individual WCHAR in CUString buffer
if (idx >= 0 && idx < uStr.MaximumLength/2)
return uStr.Buffer[idx];
else
return uStr.Buffer[0]; // got to return something
}

If the string is null-terminated, Length does not include the trailing null character.

The MaximumLength is used to indicate the length of Buffer so that if the string is passed to a conversion routine such as RtlAnsiStringToUnicodeString the returned string does not exceed the buffer size.

regards
pradish

i understand how to access each individual element, i just wanted to make sure there is no encoding there such as this: http://en.wikipedia.org/wiki/UTF-8#Description
meaning that if i have a character index (so index into the WCHAR buffer, not BYTE) into the UNICODE_STRING, i dont have to do some parsing from the begining, and can just read starting from that character under any circumstance

Windows is using UTF-16, which means in some cases a single printable character may occupy two 16-bit words.

ok, that contradicts what was stated earlier…
now which one is true?

Do you actually care about printable characters (code points) in the kernel? What are you doing with the string and what your index actually means?

printable characters are not really important for me, they are processed internally anyway.

for your next question consider the following:

DECLARE_CONST_UNICODE_STRING(String,L"\Device\HarddiskVolume1"); //internal macro in WDK, declares a unicode string and initializes it

//now lets say i would like to find the first backslash:

DWORD i= 0;
while(i< String.Length)
{
if(L"\"== String.Buffer[i])
{
break;
}
i++;
}

a reason why someone would want to write own string functions and not use the RtlUnicodeString* ones may be because they can only be called at PASSIVE_LEVEL and you dont want to run a worker item

http://en.wikipedia.org/wiki/UTF-16

It is variable length wchar string. Earlier implementations in windows were
not variable length. (UCS-2). You should not be parsing strings in the
kernel if you can possibly avoid it.

Mark Roddy

On Fri, Feb 1, 2013 at 9:42 AM, wrote:

> ok, that contradicts what was stated earlier…
> now which one is true?
>
> —
> NTDEV is sponsored by OSR
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

UTF-16 (and UTF-8) have a nice property that characters 0 through 127 cannot be part of any sequence. You can safely consider such a string as not having any multi-(W)CHAR sequence.

xxxxx@gmail.com wrote:

i have recently started learning to develop drivers for windows and there was a simple question that i could not find direct answer to about the wdk structure UNICODE_STRING.
can i assume that every UNICODE_STRING in the nt system consists of an array of wide characters, and every single character is exactly 2 byte length - regardless of the letter/symbol/whatever it represents, and i can safely index into the array and use ++ and – on the pointers without having to worry about encoding and other things?

Almost, but not quite. There are now more than 65,536 code points in
Unicode, so the UTF-16 encoding used by Windows is not enough. It is
possible for a character’s encoding to require two wchar_t. That’s
called a “surrogate pair”.

However, the characters beyond the first plane (that is, beyond U+FFFF)
are quite obscure.

i asked the question because i was thinking of writing some minor string processing functions when i want to process strings at IRQL> PASSIVE_LEVEL (as i have seen, every unicode string function requires that level)

What kind of processing? You can do comparisons without worrying about
the surrogate pairs. There isn’t any good way to sort characters in
different planes, so it’s not clear there is any sensible processing to
do other than a byte-wise comparison.

//now lets say i would like to find the first backslash:

As Alex pointed out, the ASCII characters (like backslash) are
guaranteed not to occur in a surrogate pair, so you don’t need any
special processing for this.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Butdon’t forget Code Red. It worked by encoding a character using a
Unicode character not in 0…127 that converted to a dot, allowing a
relative path to be inserted in the sequence. Note also that looking for
\ is often insufficient because / is legitimate in a number of contexts.
joe

xxxxx@gmail.com wrote:
> i have recently started learning to develop drivers for windows and
> there was a simple question that i could not find direct answer to about
> the wdk structure UNICODE_STRING.
> can i assume that every UNICODE_STRING in the nt system consists of an
> array of wide characters, and every single character is exactly 2 byte
> length - regardless of the letter/symbol/whatever it represents, and i
> can safely index into the array and use ++ and – on the pointers
> without having to worry about encoding and other things?

Almost, but not quite. There are now more than 65,536 code points in
Unicode, so the UTF-16 encoding used by Windows is not enough. It is
possible for a character’s encoding to require two wchar_t. That’s
called a “surrogate pair”.

However, the characters beyond the first plane (that is, beyond U+FFFF)
are quite obscure.

> i asked the question because i was thinking of writing some minor string
> processing functions when i want to process strings at IRQL>
> PASSIVE_LEVEL (as i have seen, every unicode string function requires
> that level)

What kind of processing? You can do comparisons without worrying about
the surrogate pairs. There isn’t any good way to sort characters in
different planes, so it’s not clear there is any sensible processing to
do other than a byte-wise comparison.

> //now lets say i would like to find the first backslash:

As Alex pointed out, the ASCII characters (like backslash) are
guaranteed not to occur in a surrogate pair, so you don’t need any
special processing for this.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> printable characters are not really important for me, they are processed

internally anyway.

for your next question consider the following:

DECLARE_CONST_UNICODE_STRING(String,L"\Device\HarddiskVolume1");
//internal macro in WDK, declares a unicode string and initializes it

//now lets say i would like to find the first backslash:

DWORD i= 0;
while(i< String.Length)
{
if(L"\"== String.Buffer[i])

That would be
if(L’\’ ==

The code given should not even compile.

{
break;
}
i++;
}

a reason why someone would want to write own string functions and not use
the RtlUnicodeString* ones may be because they can only be called at
PASSIVE_LEVEL and you dont want to run a worker item


NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> printable characters are not really important for me, they are processed

internally anyway.

for your next question consider the following:

DECLARE_CONST_UNICODE_STRING(String,L"\Device\HarddiskVolume1");
//internal macro in WDK, declares a unicode string and initializes it

//now lets say i would like to find the first backslash:

DWORD i= 0;
while(i< String.Length)

Sting.Length is the count in bytes, so this code is nonsense as written.

{
if(L"\"== String.Buffer[i])

if(L’\’ == String.Buffer[i / sizeof(WCHAR)])

{
break;
}
i++;
That would be
i+= sizeof(WCHAR)
}

a reason why someone would want to write own string functions and not use
the RtlUnicodeString* ones may be because they can only be called at
PASSIVE_LEVEL and you dont want to run a worker item


NTDEV is sponsored by OSR

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> Unicode, so the UTF-16 encoding used by Windows is not enough. It is

possible for a character’s encoding to require two wchar_t. That’s
called a “surrogate pair”.

Starting with what version does Windows support such characters?


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

sorry, i now see that i forgot to write string.length/2, but that the example was only to show what i meant under indexing

Unicode, so the UTF-16 encoding used by Windows is not enough. It is
possible for a character’s encoding to require two wchar_t. That’s
called a “surrogate pair”.

according to this: http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx post, windows uses wide characters, which can be considered ucs-2, and the pair is called supplementary pair

Maxim S. Shatskih wrote:

> Unicode, so the UTF-16 encoding used by Windows is not enough. It is
> possible for a character’s encoding to require two wchar_t. That’s
> called a “surrogate pair”.
Starting with what version does Windows support such characters?

Windows 2000.
http://msdn.microsoft.com/en-us/library/windows/desktop/dd374069.aspx


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

xxxxx@gmail.com wrote:

sorry, i now see that i forgot to write string.length/2, but that the example was only to show what i meant under indexing

> Unicode, so the UTF-16 encoding used by Windows is not enough. It is
> possible for a character’s encoding to require two wchar_t. That’s
> called a “surrogate pair”.
according to this: http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx post, windows uses wide characters, which can be considered ucs-2, and the pair is called supplementary pair

Windows in the 20th Century used UCS-2. All 21st century Windows
systems use UTF-16. The difference is only relevant in a user
interface. In most cases, the difference can be ignored.

Yes, the term is “supplementary pair”. My mistake. The first word of
such a pair is a “surrogate code point”.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.