`About \| Buy Stuff \| News \| Products \| Rants \| Search \| Security`

Home » Resources » Rants

Apple File Systems II

More to things than meets the 'i'?

Get It

Try It

Linus Torvalds' casual remarks about Apple's file system caused a bit of an uproar in the 'Mac community'. Considering he had to be talking about the egregious 'HFS' family of file systems it's hardly surprising. But there can be more to things than meets the 'i' - for it turns out Linus' SCCS git trips up with Apple's method of storing Unicode. Most of what's discussed refers back to an Apple 'technical note' from 5 March 2004 describing the fundamentals of their file systems. http://developer.apple.com/technotes/tn/tn1150.html The difficulties with Apple stem from the different Unicode decomposition functions in use.

I did some research on this point, since if we really are going to be compatible with MacOS X's crappy HFS+ system, we need to know what the decomposition algorithm actually is. Turns out, there are *two* of them. Kevin didn't know what he was talking about. In fact, different versions of Mac OS X use different normalization algorithms. Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1. Mac OS X 10.3 and later use decompositions based on Unicode 3.2.

As Apple's normalisation isn't static there can be compatibility issues moving hard drives around. 'Fun fun fun', says the poster.

The OS X developers were being cheasy [sic] with how they implemented their B-tree algorithms and took the cheap easy way out. So yeah, 'crappy' is the only word that can be used for what OS X perpetuated on the world. A quick Google search shows it causes problems all over the stack for many different programs beyond just git including limewire and gnutella, Slim, and others. http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html http://forums.slimdevices.com/showthread.php?t=40582

Where's the Beef?

In a section of their technical note TN1150 Apple explain a bit more.

Canonical Decomposition
Unicode allows some sequences of characters to be represented by multiple, equivalent forms. For example, the character 'é' can be represented as the single Unicode character u+00E9 (latin small letter e with acute) or as the two Unicode characters u+0065 and u+0301 (the letter 'e' plus a combining acute symbol).

Here's where it gets REALLY interesting.

To reduce complexity in the B-tree key comparison routines (which have to compare Unicode strings) HFS Plus defines that Unicode strings will be stored in fully decomposed form with composing characters stored in canonical order. The other equivalent forms are illegal in HFS Plus strings. An implementation must convert these equivalent forms to the fully decomposed form before storing the string on disk.

Finally: a word on the 'Unicode decomposition table'.

The Unicode decomposition table contains a list of characters that are illegal as part of an HFS Plus string and the equivalent character(s) that must be used instead. Any character appearing in a column titled 'illegal' must be replaced by the character(s) in the column immediately to the right (titled 'replace with').

So basically you have to parse file names on the way in and the way out and your sorting gets screwed up. Great. But we also change how we sort, compose, decompose, and so forth. Because Unicode itself changes over time.

Note:
Mac OS versions 8.1 through 10.2.x used decompositions based on Unicode 2.1. Mac OS X version 10.3 and later use decompositions based on Unicode 3.2. Most of the characters whose decomposition changed are not used by any Mac encoding so they are unlikely to occur on an HFS Plus volume. The MacGreek encoding had the largest number of decomposition changes.

The Unicode decomposition table mentioned above indicates which decompositions were added, removed, or changed between Unicode 2.1 and Unicode 3.2.

String Comparisons

String comparisons can be fun too with HFS file systems.

Case-Insensitive String Comparison Algorithm
In HFS Plus and case-insensitive HFSX strings must be compared in a case-insensitive fashion. The Unicode standard does not strictly define upper and lower case equivalence although it does suggest some equivalences. The HFS Plus string comparison algorithm (defined below) includes a concrete case equivalence definition. An implementation must use the equivalence expressed by this algorithm.

Furthermore Unicode requires that certain formatting characters be ignored (skipped over) during string comparisons. The algorithm and tables used for case equivalence also arrange to ignore these characters. An implementation must ignore the characters that are ignored by this algorithm.

Note:
Case-sensitive HFSX volumes do not ignore the Unicode ignorable characters. Those characters are significant for the purposes of name comparion on case-sensitive HFSX.

git maintainers need special snippets of code to deal with Apple. They're thinking of controlling all this from a configuration file. The code - still to be written at the time of the KT posting - has to determine if it's running on an HFS partition and in such case take special 'precautions'. It can't check for the OS being run as Apple have today opened the door to more reasonable alternatives such as ZFS. But being as git is the brainchild of Linus himself and as he's still very much a part of the project it's not surprising he reacts to Apple's typical 'go our own way' methods with less than unbridled enthusiasm.