Two problems with String::formatted (one a buffer overflow!)

Discussion and support for general JUCE issues

Two problems with String::formatted (one a buffer overflow!)

Postby TomSwirly » Fri Apr 13, 2012 9:41 pm

I know that Jules doesn't like String::formatted - but I have an internationalized(*) application which has a lot of strings like: "Unable to open file %s with error code %s" and there really isn't another way to do it.

First, there's a buffer overflow. The buffer used is 256 bytes long - if the results of String::formatted are greater than 256 bytes, it simply writes into "random memory". This 256 character limitation isn't documented, and as you know, buffer overflows are very dangerous...

So far, I haven't hit this limitation - I found it while debugging the next problem - but since I use String::formatted to make fairly long error messages that include full file paths it is a matter of certainty that if enough people use my program, someone will get an error with a file with a very long path and I'll have something completely unpredictable happen.

My second issue is that String::formatted seemed to work fine on the Mac to print strings, but I got non UTF-8 ("garbage") characters generated on the PC.

After some debugging, I narrowed it down to the fact that Juce's String::formatted only seems to accept wide characters on the PC, and only narrow characters on the Mac!

The following code works fine, but requires different actual calls on Mac and on PC...
Code: Select all
String MAIL_SUBJECT("Support Request: %s");
String title = "Some title here";

const char* narrow = s.toUTF8().getAddress();
const wchar_t* wide = s.toWideCharPointer();

String res =
#if JUCE_WINDOWS
    String::formatted(MAIL_SUBJECT, wide);
#else
    String::formatted(MAIL_SUBJECT, narrow);
#endif
The code sample works fine on both Mac and PC. If I change the condition to !JUCE_WINDOWS, it works incorrectly on both Mac and PC, so I can't use just one of the two alternatives...

This doesn't seem right. Is there a better way to do this, or am I missing something obvious?

(* - admittedly, we haven't prepared a translation yet, but it's all set up to do so...)
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby jules » Fri Apr 13, 2012 10:04 pm

Damn, I hate that function.
User avatar
jules
Fearless Leader
 
Posts: 17189
Joined: Mon Sep 06, 2004 9:03 am
Location: London, UK

Re: Two problems with String::formatted (one a buffer overfl

Postby TomSwirly » Fri Apr 13, 2012 10:56 pm

:-)

I knew that. But there really isn't an alternative for internationalized applications. You need SOME sort of templated output function - it's not just that word order changes from language to language, but if you try to split your messages up into tiny substrings and then put them together, it's almost impossible from the translator to do it correctly.

The only alternative is to use some full-scale templating language like Clearsilver - but that's a really heavy hammer for a problem that printf and its numerous inbred cousins do quite well.

I don't know how to fix the wide/narrow character problem, but the simple solution to the buffer overflow is to create another version where the first argument is a length, and then deprecate the original one.

Confess - you love C++, with all its warts. This is one of them - you should learn to love its expediency and its history, and accept its gnarliness.
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby jfitzpat » Sat Apr 14, 2012 5:02 pm

Really, formatted (like sprintf) kind of sucks for localization too.

Sure, you can take "%d nuns dancing on the head of %d pins" and translate it to some other western languages, but the order of the items inserted is fixed. That makes for really convoluted grammar in some languages depending on the subject. You generally want something like tokens: "%1 nuns dancing on the head of %2 pins". That way if the language is most natural with something like "On pins of %2, %1 nuns gyrate...", you can do it.

juce::String already has all the members you'd need to put together a pretty spiffy localized parameter string class.
User avatar
jfitzpat
JUCE UberWeenie
 
Posts: 251
Joined: Tue Jan 10, 2012 6:29 am
Location: Glendale, California

Re: Two problems with String::formatted (one a buffer overfl

Postby TomSwirly » Sat Apr 14, 2012 5:19 pm

Yes, absolutely good point. You absolutely can't be sure that the order of terms is the same. I'm planning to have six languages - English, French, German, Indonesian, Spanish, Italian - because I know the first five fairly well and the last one is dead easy - and in these languages the order of nouns is basically the same - but Japanese would be very important and I'm fairly sure that its noun order can be different.

Hmm... makes me think here a bit. In particular, you only need to have the equivalent of %s - because for these messages, numbers are rare, and because you can just pre-format them as strings. So all you really need is %1, %2, %3 and nothing else.

Didn't I send Jules something like this about a year ago?! But I can't find it.

I might whip something out this afternoon...

(* - that is, me)
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby TomSwirly » Sat Apr 14, 2012 5:42 pm

Well, interesting - I ran into a limitation of juce::String that's preventing me from doing a really good job on this.

The issue is that there's no efficient way to set a single character in a string! operator[] returns a const juce_wchar and there seems to be no setter method - so building a string by adding one character at a time is potentially quadratic in time, even if you have a good maximum bound in advance as to the length of the string.

Such a setter (not necessarily operator[]) should be added to juce::String. For parsing purposes, you often need to do this...
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby TheVinn » Sat Apr 14, 2012 5:55 pm

TomSwirly wrote:Such a setter (not necessarily operator[]) should be added to juce::String. For parsing purposes, you often need to do this...


That's crazy talk... if the underlying juce::String is UTF8 or UTF16 encoded then there is no 1:1 mapping between logical characters and physical positions in the memory block used to store the string. Attempting to discover the physical index of a logical character would run in O(N) where N ~= logical index, and then actually changing the character would be either O(1) or O(N) where N ~= logical index depending on the difference between the original code point and the new code point.

However, a manly way to resolve this would be to provide a non-const operator[] ONLY for a juce::String which uses UTF32 (since there is a 1:1 mapping). I believe you can do this yourself by calling toUTF32(), removing the const, and doing the work using UTF32 code points. Then you could convert it back I suppose, and wrap this all up in a nice interface that hides the mess.
Open Source: LayerEffects, VFLib, SimpleDJ, DSP Filters, LuaBridge, JUCE, FreeType, TagLib
"This isn't a big project, it shouldn't take long." - Jules
User avatar
TheVinn
JUCE UberWeenie
 
Posts: 2975
Joined: Sat Aug 29, 2009 11:31 am
Location: Marina del Rey, California

Re: Two problems with String::formatted (one a buffer overfl

Postby TomSwirly » Sat Apr 14, 2012 6:47 pm

> if the underlying juce::String is UTF8 or UTF16 encoded

DOH! I should have realized that on my own, having done an awful lot with UTF-8 (ultra-quibble - note that there's a dash in the official name).

As an aside, I have zero understanding of why anyone would use UTF-16 - it seems to have the worst of all worlds, not being backward compatible with ASCII, not having a predictable character size, and being twice as long as UTF-8 for coding "plain old ASCII" strings.
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby TheVinn » Sat Apr 14, 2012 7:07 pm

TomSwirly wrote:I have zero understanding of why anyone would use UTF-16


It's a great choice if you want to call the Unicode version of the Win32 API functions.

In fact, it's the only choice.
Open Source: LayerEffects, VFLib, SimpleDJ, DSP Filters, LuaBridge, JUCE, FreeType, TagLib
"This isn't a big project, it shouldn't take long." - Jules
User avatar
TheVinn
JUCE UberWeenie
 
Posts: 2975
Joined: Sat Aug 29, 2009 11:31 am
Location: Marina del Rey, California

Re: Two problems with String::formatted (one a buffer overfl

Postby TomSwirly » Sat Apr 14, 2012 8:15 pm

:-( Quite so. Yes, I vaguely knew that, but that doesn't mean it's rational!

Perhaps I should have spoken slightly differently and said, "What was going through the mind of the people who invented UTF-16 is beyond me."
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby jfitzpat » Sat Apr 14, 2012 8:18 pm

TomSwirly wrote:but Japanese would be very important and I'm fairly sure that its noun order can be different.


Yes, you can see it for yourself with something like Google Translate. The # pins would be first in Japanese, and Hebrew too. German would be something like 10 Nonnen tanzen... in this case, but I've run into grammatical problems with technical phrases before.
User avatar
jfitzpat
JUCE UberWeenie
 
Posts: 251
Joined: Tue Jan 10, 2012 6:29 am
Location: Glendale, California

Re: Two problems with String::formatted (one a buffer overfl

Postby jules » Sat Apr 14, 2012 8:22 pm

What was going through the mind of the people who invented UTF-16 is beyond me.


If microsoft hadn't used it for their entire win32 API then probably nobody else would ever have bothered with it. I bet there's a MS employee somewhere who really regrets making that decision..
User avatar
jules
Fearless Leader
 
Posts: 17189
Joined: Mon Sep 06, 2004 9:03 am
Location: London, UK

Re: Two problems with String::formatted (one a buffer overfl

Postby dave96 » Sat Apr 14, 2012 8:37 pm

TomSwirly wrote:there's no efficient way to set a single character in a string!

Not sure about how efficient it is but doesn't String:: replaceSection do what you want?
dave96
JUCE UberWeenie
 
Posts: 447
Joined: Sat Dec 27, 2008 9:29 pm

Re: Two problems with String::formatted (one a buffer overfl

Postby TheVinn » Sat Apr 14, 2012 8:45 pm

dave96 wrote:doesn't String:: replaceSection do what you want?


It sure does but he's looking for super-linear performance (as would I). Look at replaceSection:

Code: Select all
   while (i < index) {
      //...
      ++insertPoint;
      ++i;
   }


Just like I predicted, it is O(N) with N ~= insert index. And the rest of the function has to do the dirty work of adding up the size of each of some of the remaining code points, then a bunch of memory shuffles.
Open Source: LayerEffects, VFLib, SimpleDJ, DSP Filters, LuaBridge, JUCE, FreeType, TagLib
"This isn't a big project, it shouldn't take long." - Jules
User avatar
TheVinn
JUCE UberWeenie
 
Posts: 2975
Joined: Sat Aug 29, 2009 11:31 am
Location: Marina del Rey, California

Re: Two problems with String::formatted (one a buffer overfl

Postby TomSwirly » Sat Apr 14, 2012 8:54 pm

TheVinn gets it exactly.

I frankly shouldn't worry for my own code, because my strings are a) short b) don't have too many segments.

But a lot of the naïve ways to build strings are quadratic in the number of pieces or the length of the string - something that bites you down the road when you run your code on an entire document and wonder why it never comes back...
User avatar
TomSwirly
JUCE UberWeenie
 
Posts: 569
Joined: Mon Feb 08, 2010 11:19 pm

Next

Return to General JUCE discussion

Who is online

Users browsing this forum: Google Feedfetcher and 2 guests