Unicode string utility functions

xavier.batlle

Member
Local time
Today, 03:41
Joined
Sep 1, 2023
Messages
78
VBA string functions: Len(),Right(), Instr(), etc. don’t work as expected with Unicode strings that contains code points above 65535, so I developed some string functions to manage this kind of strings.

I know that Unicode characters above 65535 are rarely or never used, but I think it’s important to know how to deal with them, just in case you need it.
Note:
Some Operating Systems or some fonts don't show some characters as expected!

1724597811416.png
 

Attachments

Last edited:
I've never used Unicode. What are composite characters?
 
I asked chatty to explain the problem and this is what chatty said:-

The problem highlighted in the image and accompanying text concerns how VBA (Visual Basic for Applications) handles Unicode strings, specifically those that contain characters with code points above 65,535 (which includes certain emojis, rare symbols, and some Asian characters).

Problem Explanation:
VBA Limitations with Unicode:Standard VBA string functions like Len(), Right(), Instr(), etc., are not designed to properly handle Unicode characters with code points above 65,535. These characters are part of what's known as the "Supplementary Multilingual Plane" (SMP) in Unicode.
Because VBA doesn't natively support these characters correctly, functions like Len() might return incorrect string lengths, and functions like Instr() might fail to locate or properly manipulate these characters.

Incorrect Handling of Special Characters:
The issue arises because standard VBA functions may treat these high code point characters incorrectly, often splitting them or counting them incorrectly. This can lead to bugs, especially if your code relies on precise string manipulation or if you're working with complex scripts or emoji.

Solution:
To address this, the author of the text created custom Unicode string functions that correctly handle these characters.

These functions include:
LenUnicode(): Correctly calculates the length of a string, accounting for characters with code points above 65,535.

LeftUnicode(), RightUnicode(), MidUnicode(): These functions correctly extract portions of strings, taking into account high Unicode code points.

InstrUnicode(), InstrRevUnicode():
These functions correctly search for substrings within a string, recognizing characters that standard VBA functions might mishandle.

Key Points:
ContainsUnicodeAbove65K(): This function checks if the string contains any characters with code points above 65,535.

ContainsUnicodeVariationSelectors(): Checks for specific Unicode characters used to modify the appearance of preceding characters.

Why This Matters:
Even though these high code point Unicode characters are rare, they're increasingly used in modern text processing (e.g., emojis, special symbols). Ensuring that your VBA applications can handle them properly is important for creating robust and future-proof software.

In summary,
the custom functions provided by the author are designed to work around the limitations of standard VBA string functions, ensuring that all Unicode characters are handled correctly, regardless of their code point.

Please note:-
Someone that understands this issue should read through chatty's notes and point out any errors.
 
Last edited:
I've never used Unicode. What are composite characters?
Perhaps I should be used the nomenclature "decomposed form of a character" instead of "composite character"

Some references (I can't post links) :
1724614134788.png




1724614362075.png
 
Last edited:
I will add that if you run into files that are UTF-n encoded (n=2, 8, 16, and maybe others I haven't run across), you will sometimes run into the type of characters being described. You CAN ask Notepad to save UTF-n files as ANSI text but it doesn't always do very well at it.

I first ran into UTF-8 with my genealogy database when Ancestry.COM changed their genealogy downloads (GEDCOM format) from ANSI to UTF-8 encoding. Didn't have Xavier's routines at the time - a few years ago - so I had to roll my own way of handling it. Solved it by turning ALL extended characters into something I would treat as a stand-alone non-printing control character and then my semantics parser could handle it.
 
You CAN ask Notepad to save UTF-n files as ANSI text but it doesn't always do very well at it.
For context: A Unicode encoded file may contain non-ASCII characters from multiple different ANSI codepages. It is then simply impossible to save this file using just one ANSI codepage.
Even if all non-ASCII characters of a file can be correctly represented in a single ANSI codepage, it is beyond the capabilities of simple algorithms to determine the correct one with absolute certainty.
 

Users who are viewing this thread

Back
Top Bottom