BizTalk Utilities CV ,   Jobs ,   Code library
 
Home Page


Add/Edit your code items
Search the code library
Browse for the code library


SQL XML
How to use the ADO Stream object to retrieve SQL2000 XML
How to transfer XML into a plain text file?
Convert XDR to XML
Sort a table using the header of the table
Applying a XSLT stylesheet using the DBMS_XMLQuery package
N:N relationships with DBMS_XMLQuery Package
DBMS_XMLQuery package bind values to produce more than one xml file
Download data from a database in XML
Inserting XML file into a table using DBMS_XMLSave Package
Insert XML file into a table using ADO
How to call out to a COM component's method through XSLT
How to select DISTINCT items from XML via XSLT
How to convert Word (RTF) documents to XML for auto publication
Creating SQL Statements with XSLT
SQL straight to XML w/ transform
History Of XML
History Of XML And What Is XML.
Order Automation
You enjoy while you are away and let your machine do the job for you.
SQL Server 2000 User Defined Functions - A Powerful concept


 
 

<< SEOSystem.XML >>


By Pedro Gil
First Posted 08/10/2001
Times viewed 4579

Character encoding, a few words on the subject


This post contains attachments
v20010810181946.zip 

Summary I will try on a few words explain what is unicode and what are the traps to watch out when you are using a non US-ASCII character.

What is Unicode?

I've seen lots of confusion, and even the documentation sometimes brings a misleading belief of what in fact is Unicode.

Unicode is a worldwide character-encoding standard, published by the Unicode Consortium. Computers store numbers that represent a character, Unicode provides a unique number for every character. Unicode is compatible with the ISO/IEC 10646.

Now come the need to encode such characters and that made available several ways of doing just that, there are several ways of representing the same characters

So what the heck mean UCS-* and UTF-*

UCS is just the acronym of Universal Character Set specified by the ISO/IEC 10646. The number that accompanies this acronym indicates the number of octets that the character set is coded. UCS-2 (two octets) and UCS-4 (four octets)

UTF is the acronym for Unicode Transformation Format

In order to be compatible with older systems that didn't support Unicode, the Unicode Consortium, defined what they call Encoding Forms, that is the representation of the character in bits, the number that accompanies this acronym indicates the encoding form that is to be used. So UTF-8 (8-bit encoding form), UTF-16 (16-bit encoding form). Be aware that this doesn't mean that in the case of the UTF-8 it will always use only 8 bits.

UTF use bit-shifting techniques to encode Unicode characters, UTF-8 encodes each Unicode character as variable number of bytes (usually 3), a particularity of UTF-8 is that character below the 127 range in ASCII have the same byte values in ASCII and UTF-8. UTF-16 always encodes each Unicode character as two bytes.

Little-Endian and Big-Endian

Some microprocessors differ in the use of the least significant byte, So you have a different BOM (see below) to indicate this byte order. Sometimes you will see the UTF-16LE and UTF-16BE denomination to indicate this byte order.

What is a BOM?

BOM (Byte Order Mark), this is a encoding signature for the file. A particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order.

UTF-8 - EF BB BF - 

UTF-16LE - FF FE - ÿþ

UTF-16BE - FE FF - þÿ

This encoding signature is not to be displayed, any tool that support Unicode will understand this and will not show this to you nor consider it to be part of the text file. So you also have a problem that is how can I find out what BOM is being used. Well checking the Hexadecimal of the file or opening the file in a non-unicode text editor will give you those characters presented in the above table.

How are character represented

It’s important to understand how characters are encoded, all my samples use the character í. These characters along with other accentuated characters are very common on my natural language. I’ve opted for this one because it’s easy for me to write it, and it’s above the 127 range in ASCII, making him two bytes in UTF-8 (this will be important to show up some problems).

I've made this simple table to try and explain all this, hopefully you will understand it a little bit better…

Chr dec hex UCS dec UCS hex UTF8 dec UTF8 hex one by one Unicode Name

Í 146 92 237 00ED 195.173 C3AD Ã- Latin Small Letter I with Acute

For more information take a look at the following links.

http://www.unicode.org/unicode/standard/WhatIsUnicode.html

http://www.unicode.org/unicode/standard/principles.html

XML and character encoding

The W3C specify that all XML processor must read the UTF-8 and UTF-16 encoding. This text explain that to differentiate between UTF-8 and UTF-16 a BOM must be present, and that the BOM must be used by the parser as encoding signature. Other encoding may be supported, but no parser is required to have support for all of them, or one in particular, besides the UTF-8 and UTF-16. Well if this text is not enough to convince you that you should be using Unicode for all your XML files I don't known what else can I say. Well you should really give a good enough though every time you think using anything else besides Unicode (UTF-8 or UTF-16).

With IE and MSXML, there are two really common errors that happen when something is not correct with the three steps defined below.

An invalid character was found in text content (see my sample file error1.xml)

A parser found a character on your file that is not according the encoding declaration or the BOM specified for that file.

In the sample I've the character encoded with ISO-8859-1 and then speficy the UTF-8 on the encoding declaration, this will issue the error.

The reason that it doesn't issue the error below is that since the character is encoded with a one byte long encoding (ISO-8859-1), the parser at the beginning find this a valid character, finding later that can't understand the character that was used.

Switch from current encoding to specified encoding not supported

In a basic thinking this error is almost identical to the previous, the only thing is that the parser understands that the real encoding on the file is diferent from the one in the encoding declaration.

What the error is trying to tell you is that it can’t make the switch from the file encoding to the one you specify on the encoding declaration.

In error2.xml I've a file encoded as UTF-8 and a text encoding of UTF-16, since UTF-16 must always be two bytes, the parser known forehand that something is wrong with the encoding, the error3.xml is the same problem the other way around.

Declarations

XML and Text Declaration

[23] XMLDecl ::= ’<?xml’ VersionInfo [p.12] EncodingDecl

[p.38] ? SDDecl [p.15] ? S [p.8] ? ’?>’

[77] TextDecl ::= ’<?xml’ VersionInfo [p.12] ? EncodingDecl

[p.38] S [p.8] ? ’?>’

sample: <?xml version=1.0 encoding ='UTF-8'?>

The encoding declaration is part of the text declaration, in this form:

[80] EncodingDecl ::= S [p.8] ’encoding’ Eq

[p.12] (’’ EncName [p.38]

’’ | ’ EncName [p.38]

’ )

Sample: encoding ='UTF-8'

Auto-Detecting the encoding

The W3C defined a non-normative (Appendix F).

There are 3 steps rules to check this.

  1. If there is a BOM, then this defines the file encoding.
  2. If there isn't then look at the Encoding Declaration.
  3. If none of the above meet then assume UTF-8.

Since all XML files must start with a text declaration and that start with <? You can look at those characters to try and find out which encoding is used.

So what about character entities?

Character entity permits you to write a character without actually having to write the character. The drawback is that you don't want to actually write a character entity for each of those accentuated characters, this maybe nice feature for a once-in-time character like the trademark sign, but not for writing actual language.

You can specify your entities in three ways:

By name &iacute;

By decimal number &#237;

By hexadecimal number &#x00ed;

See the sample file entities.xml

For more information take a look at the following links.

http://www.w3.org/TR/REC-xml#charencoding

http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

XSLT and the Output

Since XSLT is used to transform XML into other files formats (including XML), it's important to be able to specify the encoding that the developer of the XSLT wants on the destination.

The W3Cdefines a <xsl:output> element for this propose, and a encoding attribute to specify the pretended encoding of the output.

<!-- Category: top-level-element -->

<xsl:output

method = xml | html | text | qname-but-not-ncname

version = nmtoken

encoding = string

omit-xml-declaration = yes | no

standalone = yes | no

doctype-public = string

doctype-system = string

cdata-section-elements = qnames

indent = yes | no

media-type = string

/>

A XSLT Processor may perform the serialization of the result tree, but it's not required to. In the situation where the XSLT Processor does make the serialization of the result tree, it should, but it's not required to, to respect the definition in the <xsl:output> element.

For more information take a look at the following links.

http://www.w3.org/TR/1999/REC-xslt-19991116#output

Other tools, how do they handle this?

The Unicode Consortium defines a set of rules for a Unicode enabled application, every application that claim to be unicode compliant should observe this set of rules.

The problem is that exist a large number of tools, applications and operating systems that still aren't Unicode compliant, so whenever you are trying to handle unicode files, certify yourself that you are using a unicode compliant application.

One of my first problems concerning Unicode and XML was because I was using a text editor that didn't support Unicode. One of these editors is the windows 95/98 notepad, this tool doesn't support Unicode so it will always write the file as ASCII based on the codepage that you have on selected on your Windows environment. The notepad that ships with Windows 2000 already has support for Unicode, you can define the encoding choosing one of the options in the encoding combo box, options are (ANSI, Unicode, Unicode Big-Endian and UTF-8).

Beware of the interpretation of MS of the word Unicode, almost every time that MS refers to Unicode they mean UTF-16.

This is referring to the part of saving files, when you are opening them this too can bring problems. Windows 2000 Notepad will read the BOM, and if present will respect it, but if no BOM is present, then we will scan the file trying to guest is encoding. If it decide this is an UTF-8 encoded, when you save it will put a BOM in the file.

Even if the tool support unicode, it can have different implementation, on the grey areas, for instance the editor I'm using, if no BOM is present and you open a UTF-8 encoded file, it will not scan the file to see if it's UTF-8 or not, like notepad 2000 does. So if during the edition you write a character like the í it will encode that according Windows codepage in a single byte encoding form.

So it's important that you check out your editor specification how it handles unicode, you can save yourself a lot of problem if you understand how it does it.

Objects/Interfaces encoding

Another frequent problem with encoding, are the objects/interfaces, and the way they handle character encoding. Well we could extend this article to more objects/interfaces, but since that would take forever. I recommend you to read careful the documentation of those and check out how they handle this situation, and even so, you will find that sometimes documentation is wrong, but that is maybe just a small part of it, even so you really must try it yourself.

MSXML 3sp1 DOMDocument
  1. Load/ LoadXml

    Those two methods serve to load a xml into the DOMDocument.

    Carefull that the MSXML 3 SDK documentation, mention there that the LoadXml, only handles a UCS-2 or UTF-16 encoded string, but if you look at the KB Article Q275883 http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP you will noticed that is says that this restriction is removes starting on MSXML 3.0.

    Sub DOMLoadXml()

    Dim docOut As MSXML2.DOMDocument

    Dim stm As ADODB.stream

    Dim strPath As String

    Dim bolRet As Boolean

    strPath = GetPath(CurrentDb.Name)

    Set stm = New ADODB.stream

    stm.Open

    stm.Charset = UTF-8

    stm.LoadFromFile strPath & encUTF8.xml

    'Juts to check it out

    Debug.Print stm.Charset

    stm.Position = 0

    Set docOut = New MSXML2.DOMDocument

    bolRet = docOut.loadXML(stm.ReadText)

    If Not bolRet Then

    MsgBox The xml string was not loaded

    End If

    docOut.Save strPath & test.xml

    End Sub

    If you check the test.xml file you will see that the character í is encoded as 0xC3AD, meaning that it's been read from UTF-8 and saved in UTF-8 too.

  2. Save

    The save method bases the decision of the character encoding in the encoding declaration. I've seen that in the previous snippet the file was saved in UTF-8, let see if it does is job with other encoding. If no encoding declaration is found then it will be encoded as UTF-8.

    Private Sub DOMSave()

    Dim docOut As MSXML2.DOMDocument

    Dim domPI As MSXML2.IXMLDOMProcessingInstruction

    Dim eleRoot As MSXML2.IXMLDOMElement

    Dim eleChild As MSXML2.IXMLDOMElement

    Dim strPath As String

    Set docOut = New MSXML2.DOMDocument

    Set domPI = docOut.createProcessingInstruction(xml, version='1.0' encoding='UTF-16')

    docOut.appendChild domPI

    Set eleRoot = docOut.createElement(Character)

    docOut.appendChild eleRoot

    Set eleChild = eleRoot.appendChild(docOut.createElement(chr))

    eleChild.Text = í

    strPath = GetPath(CurrentDb.Name)

    docOut.Save strPath & test.xml

    End Sub

    If you check the test.xml file you will see that the file as a BOM (0xFFFE) and that the character í is encoded as 0xED00, meaning that it's been saved as UTF-16.

  3. xml

    The xml read-only property retrieves the node and all is child’s.

    The part that concerns this article, this property will remove the encoding declaration and any BOM, of any file that is not UTF-16 encoded, this is handy to pass to the LoadXml.

    Private Sub DOMxml()

    Dim docIn As MSXML2.DOMDocument

    Dim strPath As String

    Dim lpBuffer() As Byte

    strPath = GetPath(CurrentDb.Name)

    Set docIn = New MSXML2.DOMDocument

    docIn.async = False

    docIn.Load strPath & encUTF16.xml

    Debug.Print docIn.xml

    lpBuffer = docIn.xml

    Debug.Print 0x & Right$(00 & Hex(lpBuffer(0)), 2) _

    & Right$(00 & Hex(lpBuffer(1)), 2) _

    & Right$(00 & Hex(lpBuffer(2)), 2) _

    & Right$(00 & Hex(lpBuffer(3)), 2)

    End Sub

    Looking at the first 4 bytes of this string show us that indeed it’s encoding was changed to UTF-16.

  4. transformNode

    The transformNode Method takes a node and is children’s and transforms them according to the supplied XSLT stylesheet given.

    This will always return a string encoded as UTF-16 and will add the Encoding declaration with UTF-16 too, that way we doesn’t observe what was specified in the xsl:output element of the XSLT stylesheet.

    Sub DOMTransformNode()

    Dim docXML As MSXML2.DOMDocument

    Dim docXSL As MSXML2.DOMDocument

    Dim docOut As MSXML2.DOMDocument

    Dim strPath As String

    strPath = GetPath(CurrentDb.Name)

    Set docXML = New MSXML2.DOMDocument

    docXML.async = False

    docXML.Load strPath & encUTF8.xml

    Set docXSL = New MSXML2.DOMDocument

    docXSL.async = False

    docXSL.Load strPath & output_UTF8.xsl

    MsgBox docXML.transformNode(docXSL)

    End Sub

  5. transformNodeToObject

This does method does almost the same thing as transformNode, but it redirects the output to a object that is passed as the second parameter, in the topic of this text, I’ve to say that it’s much better, since transformNodeToObject, does respect the xsl:output element and it’s encoding attribute. So you can pass a UTF-8 file and convert it to UTF-16 and vice-versa.

Sub DOMTransformNodeToObject()

Dim docXML As MSXML2.DOMDocument

Dim docXSL As MSXML2.DOMDocument

Dim docOut As MSXML2.DOMDocument

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set docXML = New MSXML2.DOMDocument

docXML.async = False

docXML.Load strPath & encUTF16.xml

Set docXSL = New MSXML2.DOMDocument

docXSL.async = False

docXSL.Load strPath & output_UTF8.xsl

Set docOut = New MSXML2.DOMDocument

Call docXML.transformNodeToObject(docXSL, docOut)

docOut.Save strPath & test.xml

End Sub

IXSLProcessor
  1. transform / output

Another way to transform a xml file with a xslt stylesheet is using the IXSLProcessor. The transform method it’s the way to start the transformation, so what we are really interrested in here is how the output property works. This property can lead to different results depending in the way that you read it’s content, MSXML 3 SDK gives a brief note where they state that if you read sequentially from the output property then the output generated is always UTF-16 encoded and the xsl:output element is ignored.

Sub TranfWithIXSLProcessor()

Dim xtp As MSXML2.XSLTemplate

Dim xpr As MSXML2.IXSLProcessor

Dim ftdXSL As MSXML2.FreeThreadedDOMDocument

Dim docXML As MSXML2.DOMDocument

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set docXML = New MSXML2.DOMDocument

docXML.async = False

docXML.Load strPath & encUTF8.xml

Set ftdXSL = New MSXML2.FreeThreadedDOMDocument

ftdXSL.async = False

ftdXSL.Load strPath & Output_UTF8.xsl

Set xtp = New MSXML2.XSLTemplate

Set xtp.stylesheet = ftdXSL

Set xpr = xtp.createProcessor()

xpr.input = docXML

xpr.transform

MsgBox xpr.output

End Sub

So the result will be UTF-16 encoded, and a encoding declaration is added too.

BUT, you can redirect your output to a object, for instance a ADOStream and that way the transformation will respect the xsl:output element and is encoding attribute.

Sub TranfWithIXSLProcessorToStream_v2()

Dim xtp As MSXML2.XSLTemplate

Dim xpr As MSXML2.IXSLProcessor

Dim ftdXSL As MSXML2.FreeThreadedDOMDocument

Dim docXML As MSXML2.DOMDocument

Dim stmOut As ADODB.stream

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set docXML = New MSXML2.DOMDocument

docXML.async = False

docXML.Load strPath & encUTF16.xml

Set ftdXSL = New MSXML2.FreeThreadedDOMDocument

ftdXSL.async = False

ftdXSL.Load strPath & Output_UTF8.xsl

Set xtp = New MSXML2.XSLTemplate

Set xtp.stylesheet = ftdXSL

Set xpr = xtp.createProcessor()

xpr.input = docXML

Set stmOut = New ADODB.stream

stmOut.Open

stmOut.Charset = UTF-8

xpr.output = stmOut

xpr.transform

stmOut.Position = 0

MsgBox stmOut.ReadText

End Sub

MXXMLWriter and SAXXMLReader

MXXMLWriter was a nice variety of properties that let you control how the output is going to be, beware that this large flexibility, can bring you unusable files.

The first thing to understand is the output, by default this will be redirected to a string, but you can specify any object that supports the IStream interface. When using this from VB you must understand the the strings in VB are always UTF-16 encoded, so if you want your file encoded in a diferent way you will have to go with the IStream interface. So if you have the output redirected to a string the folowing properties will be ignored (byteOrderMark and encoding).

I will be using the ADO Stream, to show you this properties.

OmitXMLDeclaration - Control if the Text Declaration is outputted or not.

ByteOrderMark - Control if the BOM is outputted or not.

encoding - Defines the output encoding.

Private Sub SaxReaderWriter()

Dim rdr As New MSXML2.SAXXMLReader

Dim wrt As New MSXML2.MXXMLWriter

Dim stm As ADODB.stream

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set rdr.contentHandler = wrt

wrt.omitXMLDeclaration = False

wrt.byteOrderMark = True

wrt.encoding = UTF-8

Set stm = New ADODB.stream

stm.Open

'You need to set the Stream Chartset, explain below

stm.Charset = UTF-8

wrt.output = stm

rdr.parseURL strPath & encUTF16.xml

wrt.Flush

stm.SaveToFile strPath & test.xml, adSaveCreateOverWrite

End Sub

On the SAXXMLReader you can obtain this properties by using the getProperty and you can insert them by using the putProperty.

ADO 2.6 Recordset
  1. save (adPersistXML)

ADO uses UTF-8 when saving a file using the adPersistXML.

Sub PersistADOtoXML()

Dim cnn As ADODB.Connection

Dim rst As ADODB.Recordset

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set cnn = New ADODB.Connection

cnn.Open Provider=Microsoft.Jet.OLEDB.4.0; & _

Data Source= & CurrentDb.Name & ;

Set rst = New ADODB.Recordset

rst.Open SELECT * FROM tblTest, cnn

If Len(Dir$(strPath & test.xml)) > 0 Then

Kill strPath & test.xml

End If

rst.Save strPath & test.xml, adPersistXML

End Sub

Stream
  1. LoadFromFile / SaveToFile

    LoadFromFile method loads the content of a file in the stream, you can load any encoding you want, but by default when you open a stream object it’s default encoding is UTF-16. so if you wish to load a file in another encoding you must change the charset of the stream.

    Sub ReadSaveFileInUTF8()

    Dim stm As ADODB.stream

    Dim strPath As String

    strPath = GetPath(CurrentDb.Name)

    Set stm = New ADODB.stream

    stm.Open

    stm.Charset = UTF-8

    stm.LoadFromFile strPath & encUTF8.xml

    stm.SaveToFile strPath & test.xml, adSaveCreateOverWrite

    End Sub

  2. Charset

The default value for the charset of a stream is UTF-16, and you can set this property to tell the stream that it’s receiving a that type of encoded text, I’ve used this in the previous sample to indicate that I was loading a UTF-8 encoded file.

One of the other properties that the documentation states is that this property can do the translation from one encoding form to another, well in some samples that I’ve made this can cause you serious troubles, and leave you with a useless file. For instance let’s say that you want to convert a UTF-16 file to UTF-8, you could use some code like this

Sub TranlateToUTF8()

'3 charset snippet

'ERROR Don't Use this code, jut to expose the error

Dim stm As ADODB.stream

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set stm = New ADODB.stream

stm.Open

stm.LoadFromFile strPath & encUTF16.xml

stm.Charset = UTF-8

stm.SaveToFile strPath & test.xml, adSaveCreateOverWrite

End Sub

Well this will really damage your file, besides de fact that you now habe two BOM’s on for UTF-8 0xEF BB BF and the one that allready was there for UTF-16 0xFF FE, and the file is still two bytes (UTF-16 encoded).

Mixing both the recordset object and the stream object, would by KB Article http://support.microsoft.com/support/kb/articles/Q259/5/55.ASP , lead to a problem, and that’s understandable explanation on that KB article, but with the version of ADO (2.60.6526.0) that I’ve I couldn’t reach the problem described on the text.

Sub PersistStreamToXML()

Dim cnn As ADODB.Connection

Dim rst As ADODB.Recordset

Dim stm As ADODB.stream

Dim strPath As String

strPath = GetPath(CurrentDb.Name)

Set cnn = New ADODB.Connection

cnn.Open Provider=Microsoft.Jet.OLEDB.4.0; & _

Data Source= & CurrentDb.Name & ;

Set rst = New ADODB.Recordset

rst.Open SELECT * FROM tblTest, cnn

Set stm = New ADODB.stream

rst.Save stm, adPersistXML

rst.Close

Set rst = Nothing

MsgBox stm.Charset

stm.SaveToFile strPath & test.xml, adSaveCreateOverWrite

Set stm = Nothing

End Sub

This output a perfect UTF-16 encoded file…

Final Comment

Well I could go forever with this part of the article, since there are so many objects and interfaces, that is impossible to cover them all in just one article, I've tried to talk about the more obvious ones, that at least I'm using. Also I made a deliberated choice of only describe the MSXML parser, other will behave differently, it's up to you to be advise to read their documentation, and hope no to be bitten by encoding problem. I'm aware that it was just a few words on the subject as the titles of this state.

RESOURCES:

From all of them I would like to point out Michael Kaplan Book, if you don't have it IMHO it's a very helpful resource, with the i18n subject

http://www.i18nwithvb.com/

http://www.amazon.com/exec/obidos/ASIN/0672319772/002-9388292-4866449

A good article XML International in XML Developer (October 2000), by Michael Kaplan.

http://www.xmldevelopernewsletter.com

I've also found the article by Chris Lovett in MSDN a good resource.

http://msdn.microsoft.com/library/en-us/dnxml/html/xmlencodings.asp

MSXML 3 SDK download it at

http://download.microsoft.com/download/xml/SDK/3.0/WIN98Me/EN-US/xmlsdk.exe

KB Article http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP

Fundamental for Specifications

http://www.unicode.org/

http://www.w3.org/


Rate this article on a scale of 1 to 10 (1 votes, average 5)

Your vote :  

<< SEOSystem.XML >>





Leave a comment for this article
Your name
Your email (optional)
Your comment
Optional: Upload an attachment
Enter the code shown:

 
 

    Email TopXML