[RFC] [PATCH] Windows Unicode Filename support

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC] [PATCH] Windows Unicode Filename support

Christian Seiler
Dear all,

Preface
=======

I'd like to contribute a patch to HDF5, and this appears to be the
appropriate place to send it to. If I am mistaken, I'd appreciate a
pointer where to go instead. (It would be great if the website could
have some prominent information about how to contribute to HDF5. Also, I
couldn't find it, does HDF5 have a version control repository, such as
Git or SVN?)


Problem description
===================

Windows has two different representations of filenames: 8-bit
fixed-width "ANSI" and 16-bit "Unicode" (effectively UTF-16). The 8-bit
representation depends on the locale settings of the computer; the lower
128 values correspond to ASCII, while the upper 128 values depend on the
locale settings of the computer; in Germany, for example, code page 1252
is typically used. (Very similar, but not identical to ISO-8859-1.)

When using standardized C / POSIX functions as HDF5 does (open, fopen,
etc.), which accept 8-bit strings, they will always assume the local
8-bit encoding. The problem is that the local 8-bit will never be able
to encode all possible filenames that the operating system supports, as
a fixed 8-bit encoding will never be able to encode all Unicode
characters. Furthermore, in some languages there are so many characters
that any fixed 8-bit encoding will never be able to represent all of them.

This in turn means that on Windows systems it is possible to have HDF5
fail to open a file if the file name (or the directory that contains it)
contains characters that are not representable in the local 8-bit
encoding of the system. For example, on a typical US Windows
installation it is not possible to use HDF5 to store files with names
that contain e.g. Japanese characters, even though the operating system
itself does support these.

To actually access all possible files Microsoft offers alternatives to
the standard functions that accept UTF-16 filenames in form of wchar_t
strings. There is _wopen() instead of open(), and _wfopen() instead of
fopen().


(For reference: other operating systems, such as Linux and Mac OS X,
always represent filenames as 8-bit strings; the operating system often
does not care about the precise encoding and leaves it up to the
software itself (though in practice this most likely will be UTF-8
nowadays), which means that the standard 8-bit APIs can always be used
to access any file on disk.)


Example consequences of this problem: GUI application, user chooses a
file from a "File Open" dialog, file name is converted appropriately and
passed to HDF5, HDF5 cannot load the file (that the user chose in the
same application) because the file (or a directory containing it)
contains characters that can't be represented in the local code page.




Rejected solutions
==================

The most obvious solution would be to simply provide additional
functions in HDF5 that also accept wchar_t filenames on Windows systems.
However, HDF5 has a large number of methods that simply pass through
file names (or maybe even manipulate them a bit) and this would lead to
a huge duplication of existing code, which I don't believe is a good
idea for the long-term maintenance of HDF5.


An alternative suggestion (see e.g. [1]) would be to always assume on
Windows systems that the filename supplied is encoded in UTF-8 (which,
due to being variable-length, can represent all possible characters) and
convert it to UTF-16 before passing it to the wide functions (_wopen,
_wfopen) directly. This has the advantage that now all filenames can be
represented. However this has the huge disadvantage that most software
does not expect HDF5 to accept UTF-8-encoded file names, and if a
program converts a string that it got from a "File Open" dialog into the
local 8bit codepage (as many programs would do now), any character in
the local code page beyond ASCII would cease to work (as UTF-8 encodes
them differently). For example, since the German umlauts Ä, Ö, Ü can be
represented in the local codepage, file names with these characters can
actually be opened on Windows systems with HDF5 at the moment (when
using German locale settings, at least), and this change would break
existing programs if it were to be added to HDF5 itself unconditionally.



Proposed solution
=================

I'd like to propose the following solution instead. It is based on the
UTF-8 encoding idea, but keeps compatibility with existing software.

  - Default behavior: HDF5 behaves as it currently does and calls the
    standard "ANSI" open(), fopen(), etc. functions. It will hence
    continue to work with characters in the local code page.

  - Add a boolean to the file access property list that may be used to
    indicate that the file name is in UTF-8 on Windows systems (the
    boolean will be ignored on all other operating systems):

      H5Pset_windows_unicode_filenames(fapl, TRUE);

  - Update the filesystem drivers to check for this flag, and if it
    is set to actually do a conversion from UTF-8 to UTF-16 and then
    call the corresponding wide functions.

The advantage is that current code doesn't break, but users who want to
properly support Windows can actually do so, they just need to ensure
they encode their filenames in UTF-8. The other main advantage is that
the patch is not very invasive.

I've attached (against 1.10.1) that implements this. The following is
currently supported:

  - Property list flag accessors:

         H5Pset_windows_unicode_filenames(fapl, value);
         H5Pget_windows_unicode_filenames(fapl, &value);

  - SEC2/Windows driver

  - Core driver

  - stdio driver

I've successfully tested this in the following constellation on a
Windows 10 system with German locale (using MinGW-w64/gcc7.2.0 as the
compiler, 64bit):

  - Flag not set, files with Umlauts, calling HDF5 with the file names
    encoded in the current codepage. (Compatibility check for existing
    software.)

  - Flag set, lots of different test cases (file names in pure ASCII,
    German Umlauts, Japanese characters, Hebrew characters, Arabic
    characters), calling HDF5 with the file names encoded in UTF-8
    and the flag set in the FAPL before calling the HDF5 functions.

I tested all three drivers (SEC2, Core, stdio) in both cases.

I also tested that the patch doesn't break on Linux (Debian 9, gcc
7.2.0, 64bit x86) to ensure that the patches don't harm non-Windows
platforms.

What should work, but I haven't tested it:

  - The FAMILY driver, as that just passes through the FAPL to the
    underlying driver, and since UTF-8 is ASCII-compatible, any
    manipulation done in the driver should be safe as well.

What I believe doesn't make sense to implement:

  - The direct I/O driver. It appears to contain some Windows code, but
    the CMake build system will never build it on Windows, so I left
    that out. If that is wrong and the direct I/O driver should work on
    Windows, I'll be happy to update the patch.

What I didn't implement yet:

  - C++, Fortran and Java wrappers for the FAPL flag getters/setters

  - External File Lists (EFL) support (H5Defl.c)

  - HDF5 plugin libraries (H5PL.c)

  - Logging driver (H5FDlog.c)

  - Cache logging (H5Clog.c)



Feedback is appreciated, and it would be fantastic if this could be
included in a future version of HDF5. I would be willing to help out
with the missing pieces. I do think that those can be added
incrementally, and the current patch already improves the state of
affairs on Windows quite a bit.

For the avoidance of doubt: my employer agrees to license these changes
under the same license that HDF5 1.10.1 is licensed under.



Best regards,
Christian

References:
[1]
https://tschoonj.github.io/blog/2014/11/06/hdf5-on-windows-utf-8-filenames-support/

_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

windows_unicode_filenames.patch (25K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] [PATCH] Windows Unicode Filename support

Andrey Paramonov
Hello Christian!

Your contribution does solve a very important issue! Thank you, and
hopefully the patch is incorporated into upstream.

Best wishes,
Andrey Paramonov

20.12.2017 16:05, Christian Seiler пишет:

> Dear all,
>
> Preface
> =======
>
> I'd like to contribute a patch to HDF5, and this appears to be the
> appropriate place to send it to. If I am mistaken, I'd appreciate a
> pointer where to go instead. (It would be great if the website could
> have some prominent information about how to contribute to HDF5. Also, I
> couldn't find it, does HDF5 have a version control repository, such as
> Git or SVN?)
>
>
> Problem description
> ===================
>
> Windows has two different representations of filenames: 8-bit
> fixed-width "ANSI" and 16-bit "Unicode" (effectively UTF-16). The 8-bit
> representation depends on the locale settings of the computer; the lower
> 128 values correspond to ASCII, while the upper 128 values depend on the
> locale settings of the computer; in Germany, for example, code page 1252
> is typically used. (Very similar, but not identical to ISO-8859-1.)
>
> When using standardized C / POSIX functions as HDF5 does (open, fopen,
> etc.), which accept 8-bit strings, they will always assume the local
> 8-bit encoding. The problem is that the local 8-bit will never be able
> to encode all possible filenames that the operating system supports, as
> a fixed 8-bit encoding will never be able to encode all Unicode
> characters. Furthermore, in some languages there are so many characters
> that any fixed 8-bit encoding will never be able to represent all of them.
>
> This in turn means that on Windows systems it is possible to have HDF5
> fail to open a file if the file name (or the directory that contains it)
> contains characters that are not representable in the local 8-bit
> encoding of the system. For example, on a typical US Windows
> installation it is not possible to use HDF5 to store files with names
> that contain e.g. Japanese characters, even though the operating system
> itself does support these.
>
> To actually access all possible files Microsoft offers alternatives to
> the standard functions that accept UTF-16 filenames in form of wchar_t
> strings. There is _wopen() instead of open(), and _wfopen() instead of
> fopen().
>
>
> (For reference: other operating systems, such as Linux and Mac OS X,
> always represent filenames as 8-bit strings; the operating system often
> does not care about the precise encoding and leaves it up to the
> software itself (though in practice this most likely will be UTF-8
> nowadays), which means that the standard 8-bit APIs can always be used
> to access any file on disk.)
>
>
> Example consequences of this problem: GUI application, user chooses a
> file from a "File Open" dialog, file name is converted appropriately and
> passed to HDF5, HDF5 cannot load the file (that the user chose in the
> same application) because the file (or a directory containing it)
> contains characters that can't be represented in the local code page.
>
>
>
>
> Rejected solutions
> ==================
>
> The most obvious solution would be to simply provide additional
> functions in HDF5 that also accept wchar_t filenames on Windows systems.
> However, HDF5 has a large number of methods that simply pass through
> file names (or maybe even manipulate them a bit) and this would lead to
> a huge duplication of existing code, which I don't believe is a good
> idea for the long-term maintenance of HDF5.
>
>
> An alternative suggestion (see e.g. [1]) would be to always assume on
> Windows systems that the filename supplied is encoded in UTF-8 (which,
> due to being variable-length, can represent all possible characters) and
> convert it to UTF-16 before passing it to the wide functions (_wopen,
> _wfopen) directly. This has the advantage that now all filenames can be
> represented. However this has the huge disadvantage that most software
> does not expect HDF5 to accept UTF-8-encoded file names, and if a
> program converts a string that it got from a "File Open" dialog into the
> local 8bit codepage (as many programs would do now), any character in
> the local code page beyond ASCII would cease to work (as UTF-8 encodes
> them differently). For example, since the German umlauts Ä, Ö, Ü can be
> represented in the local codepage, file names with these characters can
> actually be opened on Windows systems with HDF5 at the moment (when
> using German locale settings, at least), and this change would break
> existing programs if it were to be added to HDF5 itself unconditionally.
>
>
>
> Proposed solution
> =================
>
> I'd like to propose the following solution instead. It is based on the
> UTF-8 encoding idea, but keeps compatibility with existing software.
>
>   - Default behavior: HDF5 behaves as it currently does and calls the
>     standard "ANSI" open(), fopen(), etc. functions. It will hence
>     continue to work with characters in the local code page.
>
>   - Add a boolean to the file access property list that may be used to
>     indicate that the file name is in UTF-8 on Windows systems (the
>     boolean will be ignored on all other operating systems):
>
>       H5Pset_windows_unicode_filenames(fapl, TRUE);
>
>   - Update the filesystem drivers to check for this flag, and if it
>     is set to actually do a conversion from UTF-8 to UTF-16 and then
>     call the corresponding wide functions.
>
> The advantage is that current code doesn't break, but users who want to
> properly support Windows can actually do so, they just need to ensure
> they encode their filenames in UTF-8. The other main advantage is that
> the patch is not very invasive.
>
> I've attached (against 1.10.1) that implements this. The following is
> currently supported:
>
>   - Property list flag accessors:
>
>          H5Pset_windows_unicode_filenames(fapl, value);
>          H5Pget_windows_unicode_filenames(fapl, &value);
>
>   - SEC2/Windows driver
>
>   - Core driver
>
>   - stdio driver
>
> I've successfully tested this in the following constellation on a
> Windows 10 system with German locale (using MinGW-w64/gcc7.2.0 as the
> compiler, 64bit):
>
>   - Flag not set, files with Umlauts, calling HDF5 with the file names
>     encoded in the current codepage. (Compatibility check for existing
>     software.)
>
>   - Flag set, lots of different test cases (file names in pure ASCII,
>     German Umlauts, Japanese characters, Hebrew characters, Arabic
>     characters), calling HDF5 with the file names encoded in UTF-8
>     and the flag set in the FAPL before calling the HDF5 functions.
>
> I tested all three drivers (SEC2, Core, stdio) in both cases.
>
> I also tested that the patch doesn't break on Linux (Debian 9, gcc
> 7.2.0, 64bit x86) to ensure that the patches don't harm non-Windows
> platforms.
>
> What should work, but I haven't tested it:
>
>   - The FAMILY driver, as that just passes through the FAPL to the
>     underlying driver, and since UTF-8 is ASCII-compatible, any
>     manipulation done in the driver should be safe as well.
>
> What I believe doesn't make sense to implement:
>
>   - The direct I/O driver. It appears to contain some Windows code, but
>     the CMake build system will never build it on Windows, so I left
>     that out. If that is wrong and the direct I/O driver should work on
>     Windows, I'll be happy to update the patch.
>
> What I didn't implement yet:
>
>   - C++, Fortran and Java wrappers for the FAPL flag getters/setters
>
>   - External File Lists (EFL) support (H5Defl.c)
>
>   - HDF5 plugin libraries (H5PL.c)
>
>   - Logging driver (H5FDlog.c)
>
>   - Cache logging (H5Clog.c)
>
>
>
> Feedback is appreciated, and it would be fantastic if this could be
> included in a future version of HDF5. I would be willing to help out
> with the missing pieces. I do think that those can be added
> incrementally, and the current patch already improves the state of
> affairs on Windows quite a bit.
>
> For the avoidance of doubt: my employer agrees to license these changes
> under the same license that HDF5 1.10.1 is licensed under.
>
>
>
> Best regards,
> Christian
>
> References:
> [1]
> https://tschoonj.github.io/blog/2014/11/06/hdf5-on-windows-utf-8-filenames-support/ 
>
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [hidden email]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
Reply | Threaded
Open this post in threaded view
|

Re: [RFC] [PATCH] Windows Unicode Filename support

Elena Pourmal
In reply to this post by Christian Seiler
Dear Christian,

Thank you for the patch and your comments.

On Dec 20, 2017, at 7:05 AM, Christian Seiler <[hidden email]> wrote:

Dear all,

Preface
=======

I'd like to contribute a patch to HDF5, and this appears to be the
appropriate place to send it to. If I am mistaken, I'd appreciate a
pointer where to go instead. (It would be great if the website could
have some prominent information about how to contribute to HDF5. Also, I
couldn't find it, does HDF5 have a version control repository, such as
Git or SVN?)

You are absolutely correct that the link is impossible to find on the current support Website. I believe it was lost when the new https://www.hdfgroup.org site was created. 

On our <a href="https://confluence.hdfgroup.org/display/support/HDF&#43;Support&#43;Portal" class="">new support portal that we plan to launch anytime now, please see Community box and then click on Contributions to get to the Git repository and description of the current contribution process. Please let us know if this is still hard to find.

HDF software can be obtained from open the Git repository https://bitbucket.hdfgroup.org/projects.

Your patch indeed addresses very important problem on Windows. We will review the patch and will get back to you.

All,

Unicode support in HDF5 is long overdue and The HDF Group is well aware of it. 

As a first step we plan to switch the default string encoding from ASCII to UTF-8 in the next major release. 

In order to achieve full support for Unicode, we will need to enhance HDF5 file format to store encoding type when we store the strings that represent filenames (used by external links, VDSs, external storage, file drivers), opaque datatype tags, and compound datatype field names (I may be missing some other types of strings). We will need to make appropriate changes to the HDF5 source, tools, and testing code, and address backward/forward compatibility issues. As you understand, it is a pretty big job. 

In the past we received several patches to support opening files on Windows, and were very reluctant to accept them because of the Unicode support issues described above. I guess we need to come up with some compromise :-) 

Thank you for your support and interest in HDF5! Happy Holidays to everyone!

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal
Client Management Director
The HDF Group
1800 So. Oak St., Suite 203,
Champaign, IL 61820
www.hdfgroup.org
(217)531-6112 (office)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

 


 


Problem description
===================

Windows has two different representations of filenames: 8-bit
fixed-width "ANSI" and 16-bit "Unicode" (effectively UTF-16). The 8-bit
representation depends on the locale settings of the computer; the lower
128 values correspond to ASCII, while the upper 128 values depend on the
locale settings of the computer; in Germany, for example, code page 1252
is typically used. (Very similar, but not identical to ISO-8859-1.)

When using standardized C / POSIX functions as HDF5 does (open, fopen,
etc.), which accept 8-bit strings, they will always assume the local
8-bit encoding. The problem is that the local 8-bit will never be able
to encode all possible filenames that the operating system supports, as
a fixed 8-bit encoding will never be able to encode all Unicode
characters. Furthermore, in some languages there are so many characters
that any fixed 8-bit encoding will never be able to represent all of them.

This in turn means that on Windows systems it is possible to have HDF5
fail to open a file if the file name (or the directory that contains it)
contains characters that are not representable in the local 8-bit
encoding of the system. For example, on a typical US Windows
installation it is not possible to use HDF5 to store files with names
that contain e.g. Japanese characters, even though the operating system
itself does support these.

To actually access all possible files Microsoft offers alternatives to
the standard functions that accept UTF-16 filenames in form of wchar_t
strings. There is _wopen() instead of open(), and _wfopen() instead of
fopen().


(For reference: other operating systems, such as Linux and Mac OS X,
always represent filenames as 8-bit strings; the operating system often
does not care about the precise encoding and leaves it up to the
software itself (though in practice this most likely will be UTF-8
nowadays), which means that the standard 8-bit APIs can always be used
to access any file on disk.)


Example consequences of this problem: GUI application, user chooses a
file from a "File Open" dialog, file name is converted appropriately and
passed to HDF5, HDF5 cannot load the file (that the user chose in the
same application) because the file (or a directory containing it)
contains characters that can't be represented in the local code page.




Rejected solutions
==================

The most obvious solution would be to simply provide additional
functions in HDF5 that also accept wchar_t filenames on Windows systems.
However, HDF5 has a large number of methods that simply pass through
file names (or maybe even manipulate them a bit) and this would lead to
a huge duplication of existing code, which I don't believe is a good
idea for the long-term maintenance of HDF5.


An alternative suggestion (see e.g. [1]) would be to always assume on
Windows systems that the filename supplied is encoded in UTF-8 (which,
due to being variable-length, can represent all possible characters) and
convert it to UTF-16 before passing it to the wide functions (_wopen,
_wfopen) directly. This has the advantage that now all filenames can be
represented. However this has the huge disadvantage that most software
does not expect HDF5 to accept UTF-8-encoded file names, and if a
program converts a string that it got from a "File Open" dialog into the
local 8bit codepage (as many programs would do now), any character in
the local code page beyond ASCII would cease to work (as UTF-8 encodes
them differently). For example, since the German umlauts Ä, Ö, Ü can be
represented in the local codepage, file names with these characters can
actually be opened on Windows systems with HDF5 at the moment (when
using German locale settings, at least), and this change would break
existing programs if it were to be added to HDF5 itself unconditionally.



Proposed solution
=================

I'd like to propose the following solution instead. It is based on the
UTF-8 encoding idea, but keeps compatibility with existing software.

- Default behavior: HDF5 behaves as it currently does and calls the
  standard "ANSI" open(), fopen(), etc. functions. It will hence
  continue to work with characters in the local code page.

- Add a boolean to the file access property list that may be used to
  indicate that the file name is in UTF-8 on Windows systems (the
  boolean will be ignored on all other operating systems):

    H5Pset_windows_unicode_filenames(fapl, TRUE);

- Update the filesystem drivers to check for this flag, and if it
  is set to actually do a conversion from UTF-8 to UTF-16 and then
  call the corresponding wide functions.

The advantage is that current code doesn't break, but users who want to
properly support Windows can actually do so, they just need to ensure
they encode their filenames in UTF-8. The other main advantage is that
the patch is not very invasive.

I've attached (against 1.10.1) that implements this. The following is
currently supported:

- Property list flag accessors:

       H5Pset_windows_unicode_filenames(fapl, value);
       H5Pget_windows_unicode_filenames(fapl, &value);

- SEC2/Windows driver

- Core driver

- stdio driver

I've successfully tested this in the following constellation on a
Windows 10 system with German locale (using MinGW-w64/gcc7.2.0 as the
compiler, 64bit):

- Flag not set, files with Umlauts, calling HDF5 with the file names
  encoded in the current codepage. (Compatibility check for existing
  software.)

- Flag set, lots of different test cases (file names in pure ASCII,
  German Umlauts, Japanese characters, Hebrew characters, Arabic
  characters), calling HDF5 with the file names encoded in UTF-8
  and the flag set in the FAPL before calling the HDF5 functions.

I tested all three drivers (SEC2, Core, stdio) in both cases.

I also tested that the patch doesn't break on Linux (Debian 9, gcc
7.2.0, 64bit x86) to ensure that the patches don't harm non-Windows
platforms.

What should work, but I haven't tested it:

- The FAMILY driver, as that just passes through the FAPL to the
  underlying driver, and since UTF-8 is ASCII-compatible, any
  manipulation done in the driver should be safe as well.

What I believe doesn't make sense to implement:

- The direct I/O driver. It appears to contain some Windows code, but
  the CMake build system will never build it on Windows, so I left
  that out. If that is wrong and the direct I/O driver should work on
  Windows, I'll be happy to update the patch.

What I didn't implement yet:

- C++, Fortran and Java wrappers for the FAPL flag getters/setters

- External File Lists (EFL) support (H5Defl.c)

- HDF5 plugin libraries (H5PL.c)

- Logging driver (H5FDlog.c)

- Cache logging (H5Clog.c)



Feedback is appreciated, and it would be fantastic if this could be
included in a future version of HDF5. I would be willing to help out
with the missing pieces. I do think that those can be added
incrementally, and the current patch already improves the state of
affairs on Windows quite a bit.

For the avoidance of doubt: my employer agrees to license these changes
under the same license that HDF5 1.10.1 is licensed under.



Best regards,
Christian

References:
[1] https://tschoonj.github.io/blog/2014/11/06/hdf5-on-windows-utf-8-filenames-support/
<windows_unicode_filenames.patch>_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[hidden email]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5