rsync and different UTF normalization in APFS vs HFS+

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

rsync and different UTF normalization in APFS vs HFS+

Ces VLC
Hi!

Some of my disks are HFS+ and others are APFS. I've been using rsync for years, in order to sync some folders across all my disks. There were no problems until APFS was introduced into the game.

Now, in filenames that have UTF international characters, I often hit the problem of rsync deleting a file and then rewriting it again, just because the UTF normalization is not the same in both disks. Other users have been reporting this (see here for example: https://superuser.com/questions/1513326/rsync-from-mac-os-to-synology-with-btrfs-having-issues-with-file-and-directories ).

For over a year I've been tolerating this because I considered it non-critical, but I feel I should fix it. However, I didn't find any posted solution that could address this in a convenient and proper way.

People suggest to use the --iconv flag, but... does this mean that you need to use different iconv settings depending on whether your transfer is APFS->HFS+ or HFS+->APFS? If affirmative, it would be a bit clumsy, IMHO (first detect the disk FS, then choose proper flags). 

Isn't there some way for dealing with this more conveniently, in a way that you don't need to check the disk FS before invoking rsync?

Thanks!

César

Reply | Threaded
Open this post in threaded view
|

Re: rsync and different UTF normalization in APFS vs HFS+

ryandesign2
Administrator
On Jul 3, 2020, at 04:53, Ces VLC wrote:

> Some of my disks are HFS+ and others are APFS. I've been using rsync for years, in order to sync some folders across all my disks. There were no problems until APFS was introduced into the game.
>
> Now, in filenames that have UTF international characters, I often hit the problem of rsync deleting a file and then rewriting it again, just because the UTF normalization is not the same in both disks. Other users have been reporting this (see here for example: https://superuser.com/questions/1513326/rsync-from-mac-os-to-synology-with-btrfs-having-issues-with-file-and-directories ).
>
> For over a year I've been tolerating this because I considered it non-critical, but I feel I should fix it. However, I didn't find any posted solution that could address this in a convenient and proper way.
>
> People suggest to use the --iconv flag, but... does this mean that you need to use different iconv settings depending on whether your transfer is APFS->HFS+ or HFS+->APFS? If affirmative, it would be a bit clumsy, IMHO (first detect the disk FS, then choose proper flags).
>
> Isn't there some way for dealing with this more conveniently, in a way that you don't need to check the disk FS before invoking rsync?

The issue I'm familiar with is that there can be several valid ways to represent certain strings of UTF-8 characters. (Characters comprised of several symbols can be composed or decomposed.) The designers of HFS+ picked one of those representations as the "correct" one and normalize such strings to that form when writing filenames to disk. HFS+ was unusual in that regard. Most Linux filesystems did not normalize and instead accepted whatever bytes the program gave it. This could result in the problem that a file created on Linux and moved to an HFS+ Mac might then have a different sequence of bytes for its filename, though they are the same characters. (Linux would also have the problem that two or more different filenames could be created that would each have different representations of the same characters.) The problem should not happen when moving a file from an HFS+ Mac to a Linux machine, since the Linux filesystem will accept the order of bytes that HFS+ used.

APFS changes things again, so maybe you will now see some similar types of problems when using HFS+ and APFS together, but I couldn't tell you under what conditions or in what way it would manifest or what to do about it. APFS certainly seems more complicated, since the behavior can vary based on which OS version you used to create the APFS volume and whether the volume is case-sensitive or case-insensitive:

https://mjtsai.com/blog/2017/06/27/apfs-native-normalization/

Here's some info direct from Apple, though it is a "retired" document so maybe a newer version is available:

https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html


Reply | Threaded
Open this post in threaded view
|

Re: rsync and different UTF normalization in APFS vs HFS+ (macports-users Digest, Vol 167, Issue 3)

Jim DeLaHunt-2

Ces VLC, Ryan:

On 2020-07-03 04:00, Ryan Schmidt wrote:
On Jul 3, 2020, at 04:53, Ces VLC wrote:

…in filenames that have UTF international characters, I often hit the problem of rsync deleting a file and then rewriting it again, just because the UTF normalization is not the same in both disks…. 

…People suggest to use the --iconv flag, but... does this mean that you need to use different iconv settings depending on whether your transfer is APFS->HFS+ or HFS+->APFS? If affirmative, it would be a bit clumsy, IMHO (first detect the disk FS, then choose proper flags). 

Isn't there some way for dealing with this more conveniently, in a way that you don't need to check the disk FS before invoking rsync?
The issue I'm familiar with is that there can be several valid ways to represent certain strings of UTF-8 characters….

I don't know nearly enough about rsync, so I hope Ces VLC finds a good answer and that I can use it too.   I don't know nearly as much as Ryan about macports, and I am grateful for all Ryan's work on macports. However, I do know a bit about Unicode, and I have recently read up a bit on filenames in APFS, HFS+, and ext3/4 of Linux. Let me try to explain the difference between filenames which I suspect Ces is encountering. I will say something similar to Ryan, but with important differences in terminology related to Unicode. I may get details of the file systems wrong. And, none of my examples are tested, so some of them may be incorrect.

Fundamental question: when is a filename {Na} on file system A the "same" as filename {Nb} on filesystem B? The answer is complex.

Fundamental fact: different filesystems store filenames as different data structures, with different semantics attached to the data. Comparing filename {Na} to {Nb} requires converting {Na} to the data structures used in filesystem B, and doing the appropriate kind of comparison.

  • HFS+ stores filenames as 16-bit code units with UTF-16BE semantics. The file system API receives filenames as an array of Unicode characters. It normalises the name to NFD(-ish) before writing. IIRC an HFS+ file system can be case-insensitive (more common) or case-sensitive.
  • APFS stores filenames as 8-bit code units with UTF-8 semantics, and also as a 22-bit hash. The file system API receives filenames as an array of Unicode characters. It does not normalise the name when writing; the filename's characters are preserved in the filesystem. It also computes the 22-bit hash from the filename. However, the filesystem can be configured to normalise the filename before using it to compute the hash. Thus the filesystem API can do normalisation-insensitive comparison of filenames, by comparing their hash values but not the filename code units.  See <https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf>, section "j_drec_hashed_key_t".
  • ext 3/4 stores filenames as 8-bit code units with no semantics (except that byte values 0x00 and 0x2F '/' are special). The Posix file system API receives filenames as 8-bit code units and writes them as is. The filename's bytes are preserved in the filesystem. Filename comparisons are 8-bit code unit to code unit, with no interpretation as Unicode, or Unicode normalisation. Thus comparisons are normalisation-sensitive.
  • I suspect (but haven't confirmed) that rsync transmits filenames as sequences of bytes, possibly converted to UTF-8 code units via --iconv, but without any normalisation.

Terminology:

  • Unicode character: an abstract concept named by a integer value between 0 and about 1.1 million (0x10FFFF).
  • Code unit: a unit of storage for characters. Unicode defines 8-bit, 16-bit, and 32-bit code units. The 16-bit and 32-bit code units have variants which map to bytes in "big-endian" (BE) and "little-endian" (LE) forms.
  • UTF (Unicode Transformation Format): an algorithm for mapping between Unicode characters and code unit sequences of various lengths. UTF-8 maps between Unicode characters and sequences of 1-4 8-bit code units. UTF-16BE maps between Unicode characters and sequences of 1-2 16-bit big-endian code units. UTF-32BE maps between Unicode characters and single 32-bit big-endian code units.
  • Normalisation: an algorithm for taking arbitrary Unicode character sequences and removing some differences in representation, so that they are more useful for certain operations. One of these operations is comparison for equality. Background: Unicode provides multiple ways to represent the same user-perceived writing system unit. e.g. U+2128 Angstrom Sign (Å), U+00C5 Latin Capital Letter A with Ring Above (Å), and U+0041 Latin Capital Letter A  U+030A Combining Ring Above (Å, i.e. A˚) are different for some purposes, but the same for other purposes, including normalisation. See UAX #15 Unicode Normalization Forms <http://www.unicode.org/reports/tr15/>.
  • NFD: a normalisation algorithm which mostly decomposes compound characters: U+00C5 (Å) becomes U+0041 U+030A (Å, i.e. A˚).
  • Sensitive and insensitive: whether a difference between characters is significant or not significant when testing for "is the same". File systems can be case-sensitive, in which case Case.txt and cAsE.Txt are different; or they can be case-insensitive, in which case the two names are the same. Similarly, file systems can be normalisation-sensitive, in which case 5Å.svg and 5Å.svg are different, or they can be normalisation-insensitive, in which case they are the same.
  • Preserving and not-preserving: whether a difference between characters, present when names are written to a file system, is still present when the file names are read back out of the file system. DOS 8.3 FAT filesystems are case-insenstive and case-non-preserving: write "case.txt", and you get back "CASE.TXT". Similarly, file systems can be normalisation-preserving or normalisation-non-preserving. If you write 5Å.svg and 5Å.svg to HFS+, which is normalisation-insenstive and normalisation-non-preserving, you get back 5Å.svg. If you write them to APFS, which is normalisation-preserving, though normalisation-insenstive, you get back the same 5Å.svg and 5Å.svg.

So, the challenge which Ces VLC is giving to rsync is, read a filename {Na} from APFS filesystem A as Unicode characters from A's API, convert the name to UTF-8 code units, don't touch normalisation, convert to a name {Nb} on HFS+ filesystem B using B's API, and save the file in B. HFS+ on B normalises it to {Nb_norm}. Later, read filename {Na} from A, convert it to UTF-8, convert it to name {Nb} on B; is this the "same" as the existing filename {Nb_norm} on B?

There is an Rsync FAQ which might be relevant: "rsync recopies the same files" <https://rsync.samba.org/FAQ.html#2>. it suggests fixing the problem using --iconv to specify filename conversions. I haven't looked into rsync enough to know if it will solve the problem. The impression I get is that rsync will not "first detect the disk FS, then choose proper flags". I suspect you will have to do that when you invoke rsync, using your knowledge of the source and destination filesystems.

I'm sorry this is so wordy, and I hope you see the distinctions I'm trying to explain. And, I hope this helps you figure out a solution. Please let the list know what you find out.

Best regards,
     —Jim DeLaHunt, software engineer, Vancouver, Canada

Reply | Threaded
Open this post in threaded view
|

Re: rsync and different UTF normalization in APFS vs HFS+ (macports-users Digest, Vol 167, Issue 3)

Ces VLC


On Sat, Jul 4, 2020 at 2:43 AM Jim DeLaHunt <[hidden email]> wrote:
>
> [...] I hope you see the distinctions I'm trying to explain. And, I hope this helps you figure out a solution. Please let the list know what you find out.
>

Thanks a lot, Ryan and Jim, for your messages and for the great information you provided. It's very complete, and, yes, Jim, what you described is the cause of the problem: rsync just transmits file names as verbatim raw sequences of bytes with no conversion at all.

IMHO, the correct way of fixing this shouldn't be by manually converting the encodings yourself with the '--iconv' flag, but actually with a flag for performing the check after normalization, which AFAIK doesn't exist (it wouldn't matter what normalization, just apply the same normalization to all file names before comparing them, and then discard the normalization). What I mean is, what's the purpose of rsync considering as different two files whose name is identical when being displayed in a terminal? Two identical text strings can be normalized in different ways (for example: accents in separated codes, or in composed codes), but they are the same text. So, if the text is the same, why consider them as different file names?

I don't understand why such '--normalize-before-compare' flag doesn't exist (I insist: no need to specify the normalization algorithm, just apply the same algorithm to all file names). It would fix all these problems in an elegant and clean way, and, BTW, this would be the behaviour everybody expects, if I'm not missing any point here.

Another surprising thing is that I don't see any serious alternatives to rsync. It's good because if everybody uses rsync, it will be better tested and more reputable, but it also has the bad side of what happens when it doesn't support the feature you need...

Kind regards,

César



Reply | Threaded
Open this post in threaded view
|

Re: rsync and different UTF normalization in APFS vs HFS+

Joshua Root-8
In reply to this post by Ces VLC
Ces VLC wrote:

> On Sat, Jul 4, 2020 at 2:43 AM Jim DeLaHunt <list+macports-users at jdlh.com>
> wrote:
>>
>> [...] I hope you see the distinctions I'm trying to explain. And, I hope
> this helps you figure out a solution. Please let the list know what you
> find out.
>>
>
> Thanks a lot, Ryan and Jim, for your messages and for the great information
> you provided. It's very complete, and, yes, Jim, what you described is the
> cause of the problem: rsync just transmits file names as verbatim raw
> sequences of bytes with no conversion at all.
>
> IMHO, the correct way of fixing this shouldn't be by manually converting
> the encodings yourself with the '--iconv' flag, but actually with a flag
> for performing the check after normalization, which AFAIK doesn't exist (it
> wouldn't matter what normalization, just apply the same normalization to
> all file names before comparing them, and then discard the normalization).
> What I mean is, what's the purpose of rsync considering as different two
> files whose name is identical when being displayed in a terminal? Two
> identical text strings can be normalized in different ways (for example:
> accents in separated codes, or in composed codes), but they are the same
> text. So, if the text is the same, why consider them as different file
> names?

Sounds like a perfectly valid feature request for the rsync project.

> I don't understand why such '--normalize-before-compare' flag doesn't exist
> (I insist: no need to specify the normalization algorithm, just apply the
> same algorithm to all file names). It would fix all these problems in an
> elegant and clean way, and, BTW, this would be the behaviour everybody
> expects, if I'm not missing any point here.

It probably just didn't come up before APFS became widespread on macOS.
And still doesn't come up if all your filenames are ASCII.

This behaviour has the slight disadvantage of being technically
incorrect on normalization-sensitive filesystems. On your typical Linux
system, it's entirely possible to have two filenames that differ only in
normalization. And you know if it's possible, then someone somewhere has
a workflow that depends on it.

It might make sense to have normalize-before-compare turned on by
default on Darwin, and off by default elsewhere, with a flag to enable
or disable as needed. As you say, it could sometimes be preferable
behaviour even on normalization-sensitive systems.

- Josh