* new-features.sgml: Add missing GB2312 and eucKR character sets.

* pathnames.sgml: Change "DOS devices" title to "Invalid filenames" and rephrase that section. Add section "Filenames with unusual (foreign) characters". Fix an emphasis. * setup-net.sgml: Integrate setup-locale section. * setup2.sgml: Add locale variables to section "Environment Variables". Add section "Internationalization".
2009-03-25 10:37:06 +00:00 · 2009-03-25 10:37:06 +00:00 · f276aab75a
parent 4747078502
commit f276aab75a
5 changed files with 352 additions and 11 deletions
--- a/winsup/doc/ChangeLog
+++ b/winsup/doc/ChangeLog
@ -1,3 +1,14 @@
+2009-03-25  Corinna Vinschen  <corinna@vinschen.de>
+
+	* new-features.sgml: Add missing GB2312 and eucKR character sets.
+	* pathnames.sgml: Change "DOS devices" title to "Invalid filenames"
+	and rephrase that section.
+	Add section "Filenames with unusual (foreign) characters".
+	Fix an emphasis.
+	* setup-net.sgml: Integrate setup-locale section.
+	* setup2.sgml: Add locale variables to section "Environment Variables".
+	Add section "Internationalization".
+
 2009-03-24  Corinna Vinschen  <corinna@vinschen.de>

 	* new-features.sgml: Add section about chaged (no)winsymlink default.
--- a/winsup/doc/new-features.sgml
+++ b/winsup/doc/new-features.sgml
@ -195,8 +195,9 @@
  in 1-16, except 12, "UTF-8", Windows codepages "CPxxx", with xxx in
  (437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125,
  1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258), "JIS", "SJIS",
-  "eucJP", "Big5".  The leading language and territory part (en_US) is not
-  used by Cygwin yet, but is required for POSIX compatibility.
+  "GB2312", "eucJP", "eucKR", and "Big5".  The leading language and territory
+  part (en_US, for instance) is not used by Cygwin yet, but is required
+  for POSIX compatibility.

 - Allow multiple concurrent read locks per thread for pthread_rwlock_t.

--- a/winsup/doc/pathnames.sgml
+++ b/winsup/doc/pathnames.sgml
@ -311,21 +311,25 @@ to be readable by the $USER user account itself.</para>

 </sect2>

-<sect2 id="pathnames-dosdevices"><title>DOS devices</title>
+<sect2 id="pathnames-dosdevices"><title>Invalid filenames</title>

 <para>Filenames invalid under Win32 are not necessarily invalid
-under Cygwin since release 1.7.0.  There are a couple of rules which
-apply to Windows filenames.  First of all, DOS device names like
+under Cygwin since release 1.7.0.  There are a few rules which
+apply to Windows filenames.  Most notably, DOS device names like
 <filename>AUX</filename>, <filename>COM1</filename>,
 <filename>LPT1</filename> or <filename>PRN</filename> (to name a few)
-cannot be used in a native Win32 application, even with an
-extension (<filename>prn.txt</filename>).  Cygwin can handle files with
-these names just fine.</para>
+cannot be used as filename or extension in a native Win32 application.
+So filenames like <filename>prn.txt</filename> or <filename>foo.aux</filename>
+are invalid filenames for native Win32 applications.</para>
+
+<para>This restriction doesn't apply to Cygwin applications.  Cygwin
+can create and access files with such names just fine.  Just don't try
+to use these files with native Win32 aqpplications...</para>

 </sect2>

 <sect2 id="pathnames-specialchars">
-<title>Special characters in filenames</title>
+<title>Forbidden characters in filenames</title>

 <para>Win32 filenames can't contain trailing dots and spaces for backward
 compatibility.  When trying to create files with trailing dots or spaces,
@ -346,6 +350,48 @@ are converted to special UNICODE characters in the range 0xf000 to 0xf0ff

 </sect2>

+<sect2 id="pathnames-unusual">
+<title>Filenames with unusual (foreign) characters</title>
+
+<para> Windows filesystems use the Unicode character set in the UTF-16
+encoding to store filename information.  If you don't use the UTF-8
+character set (see <xref linkend="setup-locale"></xref>) then there's a
+chance that a filename is using one or more characters which have no
+representation in the character set you're using.</para>
+
+<para>For instance, there are no chinese characters in the ISO-8859-1
+character set.  So, converting a filename containing a chinese character
+to ISO-8859-1 leaves you with a wrongly converted filename, for instance
+containing a question mark '?' as replacement for the unconvertable
+character.  When trying to access the file, Cygwin has to convert the
+filename back to UTF-16.  However, this doesn't result in the original
+filename because the question mark will not translate back to the original
+chinese character, but to a simple question mark instead.  This in turn
+results in strange "File not found" messages.</para>
+
+<note><para>To avoid this scenario altogether, just use always UTF-8 as
+character set.</para></note>
+
+<para>If you don't want or can't use UTF-8 as character set for whatever
+reason, you will nevertheless be able to access the file.  How does that
+work?  When Cygwin converts the filename from UTF-16 to your character
+set, it recognizes characters which can't be converted.  If that occurs,
+Cygwin replaces the non-convertible character with a special character
+sequence.  The sequence starts with an ASCII SO character (hex code
+0x0e, equivalent Control-N), followed by the UTF-8 representation of the
+character.  The result is a filename containing some ugly looking
+characters.  While it doesn't <emphasis>look</emphasis> nice, it
+<emphasis>is</emphasis> nice, because Cygwin knows how to convert this
+filename back to UTF-16.  The filename will be converted using your
+usual character set.  However, when Cygwin recognizes an ASCII SO
+character, it skips over the ASCII SO and handles the following bytes as
+a UTF-8 character.  Thus, the filename is symmetrically converted back to
+UTF-16 and you can access the file.</para>
+
+<para>Again, by using UTF-8 you can avoid this problem entirely.</para>
+
+</sect2>
+
 <sect2 id="pathnames-casesensitive">
 <title>Case sensitive filenames</title>

@ -369,7 +415,7 @@ HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\obcaseinsensitive
 this registry value also on Windows NT4 and Windows 2000, which usually
 both don't know this registry key.  If you want case-sensitivity on these
 systems, create that registry value and set it to 0.  On these systems
-(and *only* on these systems) you don't have to reboot to bring it
+(and <emphasis role='bold'>only</emphasis> on these systems) you don't have to reboot to bring it
 into effect, rather stopping all Cygwin processes and then restarting them
 is sufficient.</para>

--- a/winsup/doc/setup-net.sgml
+++ b/winsup/doc/setup-net.sgml
@ -254,6 +254,7 @@ Problems with Cygwin</ulink>.

 DOCTOOL-INSERT-setup-env
 DOCTOOL-INSERT-setup-maxmem
+DOCTOOL-INSERT-setup-locale
 DOCTOOL-INSERT-ntsec
 DOCTOOL-INSERT-setup-files
 </chapter>
--- a/winsup/doc/setup2.sgml
+++ b/winsup/doc/setup2.sgml
@ -13,12 +13,21 @@ The <envar>CYGWIN</envar> variable is used to configure many global
 settings for the Cygwin runtime system.  Initially you can leave
 <envar>CYGWIN</envar> unset or set it to <literal>tty</literal> (e.g.
 to support job control with ^Z etc...) using a syntax like this in the
-DOS shell, before launching bash.  </para>
+DOS shell, before launching bash.</para>

 <screen>
 <prompt>C:\&gt;</prompt> <userinput>set CYGWIN=tty notitle glob</userinput>
 </screen>

+<para>
+Locale support is controlled by the <envar>LANG</envar> and
+<envar>LC_xxx</envar> environment variables.  You can set all of them
+but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
+<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
+to the POSIX standard.  The first one found rules.  For a more detailed
+description see <xref linkend="setup-locale"></xref>.
+</para>
+
 <para>
 The <envar>PATH</envar> environment variable is used by Cygwin
 applications as a list of directories to search for executable files
@ -124,6 +133,279 @@ Run the program and it will output the maximum amount of allocatable memory.

 </sect1>

+<sect1 id="setup-locale"><title>Internationalization</title>
+
+<sect2 id="setup-locale-ov"><title>Overview</title>
+
+<para>
+Internationalization support is controlled by the <envar>LANG</envar> and
+<envar>LC_xxx</envar> environment variables.  You can set all of them
+but Cygwin itself only honors the variables <envar>LC_ALL</envar>,
+<envar>LC_CTYPE</envar>, and <envar>LANG</envar>, in this order, according
+to the POSIX standard.  The content of these variables should follow the
+POSIX standard for a locale specifier.  The correct form of a locale
+specifier is</para>
+
+<screen>
+  language[[_TERRITORY][.charset][@modifier]]
+</screen>
+
+<para>"language" is a lowercase two character string per ISO 639-1,
+"TERRITORY" is an uppercase two character string per ISO 3166, charset is
+one of a list of supported character sets, and the modifier doesn't matter
+here (though it might for some applications).  If you're interested in the
+exact description, you can find it in the online publication of the POSIX
+manual pages on the homepage of the
+<ulink url="http://www.opengroup.org/">Open Group</ulink>.</para>
+
+<para>Typical locale specifiers are</para>
+
+<screen>
+  "de_CH"	   language = German, territory = Switzerland, default charset
+  "fr_FR.UTF-8"    language = french, territory = France, charset = UTF-8
+  "ko_KR.eucKR"    language = korean, territory = South Korea, charset = eucKR
+</screen>
+
+<para>
+And let's not forget the default locale called "C" or "POSIX"
+which basically only supports plain ASCII code.  If the aforementioned
+environment variables are not set, or set to "C" or "POSIX", you get the
+default ASCII-only behaviour.
+</para>
+
+<para>
+Right now the language and territory content is not evaluated by Cygwin any
+further.  The only important part so far is the character set.  How does that
+work?
+</para>
+
+</sect2>
+
+<sect2 id="setup-locale-how"><title>How to set the locale</title>
+
+<itemizedlist mark="bullet">
+
+<listitem><para>
+The default locale is the "C" or "POSIX" locale.  In this locale, basically
+only ASCII characters are supported.  Even if one of the aforementioned
+environment variables are set to something else, it's the application's
+responsibility to call the function <function>setlocale</function>,
+typically like this</para>
+
+<screen>
+  setlocale (LC_ALL, "");
+</screen>
+
+<para>to switch to another locale according to the settings of the
+internationalization environment variables.
+</para></listitem>
+
+<listitem><para>
+Assuming you set one of the aforementioned environment variables to some
+valid POSIX locale value, other than "C" and "POSIX", and assuming you
+call an application which calls <function>setlocale</function> as above.</para>
+
+<para>Assuming further you're living in Japan.  So you might want to use
+the language code "ja" and the territory "JP", thus setting, say,
+<envar>LANG</envar> to "ja_JP".  You didn't set a character set, so
+what will Cygwin use now?  Easy!  It will use the default Windows ANSI
+codepage of your system, if it's supported by Cygwin.  Hopefully Cygwin
+supports all relevant default ANSI codepages...</para>
+
+<note><para>For a list of supported character sets, see
+<xref linkend="setup-locale-charsetlist"></xref>
+</para></note>
+</listitem>
+
+<listitem><para>
+You don't want to use the default Windows codepage as character set?
+In that case you have to specify the charset explicitely.  For instance,
+assume you're from Italy and don't want to use the default Windows codepage
+1252, but the more portable ISO-8859-15 character set.  What you can do is
+to set the <envar>LANG</envar> variable in the
+<filename>C:\cygwin\Cygwin.bat</filename> file which is the batch file
+to start a Cygwin session from the "Cygwin" desktop shortcut.</para>
+
+<screen>
+  @echo off
+
+  C:
+  chdir C:\cygwin\bin
+  set LANG=it_IT.ISO-8859-15
+  bash --login -i
+</screen>
+</listitem>
+
+<listitem><para>
+Most singlebyte or doublebyte charsets have a disadvantage.  Windows
+filesystems use the Unicode character set in the UTF-16 encoding to store filename information.  Not all characters
+from the Unicode character set are available in a singlebyte or doublebyte
+charset.  While Cygwin has a workaround to access files with unusual
+characters (see <xref linkend="pathnames-unusual"></xref>), a better
+workaround is to use always the UTF-8 character set.  UTF-8 is the only
+multibyte character set which can represent <emphasis>every</emphasis>
+Unicode character.</para>
+
+<screen>
+  set LANG=es_MX.UTF-8
+</screen>
+
+<para>For a description of the Unicode standard, see the homepage of the
+<ulink url="http://www.unicode.org/">Unicode Consortium</ulink>.
+</para></listitem>
+
+</itemizedlist>
+
+</sect2>
+
+<sect2 id="setup-locale-problems"><title>Potential Problems</title>
+
+<para>
+You can set the above internationalization variables not only in
+<filename>Cygwin.bat</filename> or in the Windows environment, but also
+in your Cygwin shell on the fly, even switch to yet another character
+set, and yet another.  In bash for instance:</para>
+
+<screen>
+  <prompt>bash$</prompt> export LC_CTYPE="nl_BE.UTF-8"
+</screen>
+
+<para>However, here's a problem.  At the start of the first Cygwin process
+in a session, the Windows environment has to be converted from UTF-16 to
+some singlebyte or multibyte charset.  If the internationalization environment
+variable hasn't been set <emphasis>before</emphasis> starting this process,
+Cygwin has to make an educated guess which charset to use to convert
+the environment itself.  The only reproducible way to do that in the absence
+of <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>, or <envar>LANG</envar>,
+is to use the current Windows ANSI codepage.</para>
+
+<para>As long as the environment only contains ASCII characters, this is
+no problem.  But if it does, and you're planning to use, say, UTF-8,
+the environment will result in invalid characters in the UTF-8 charset.
+This would be especially a problem in variables like <envar>PATH</envar>.</para>
+
+<note><para>Per POSIX, the name of an environment variable should only
+consist of valid ASCII characters, and only of uppercase letters, digits, and
+the underscore for maximum portablilty.</para></note>
+
+<para>And here's another problem when switching charsets on the fly.
+Symbolic links.  A symbolic link contains the filename of the target
+file the symlink points to.  When a symlink is created, the current
+character set is used to store the target filename.  If the target
+filename contains non-ASCII characters and you switch to another
+character set, the target filename of the symlink is now potentially
+an invalid character sequence in the new character set.  This behaviour
+is not different from the behaviour in other Operating Systems.  So,
+if you suddenly can't access a symlink anymore, maybe it's because you
+switched to another character set?
+</para>
+
+</sect2>
+
+<sect2 id="setup-locale-missing"><title>What does not work?</title>
+
+<para>
+Except for <envar>LC_ALL</envar>, <envar>LC_CTYPE</envar>,
+and <envar>LANG</envar>, all other LC_xxx environment variables,
+<envar>LC_COLLATE</envar>, <envar>LC_MESSAGES</envar>,
+<envar>LC_MONETARY</envar>, <envar>LC_NUMERIC</envar>,
+and <envar>LC_TIME</envar>, are ignored right now.  This means, while Cygwin
+supports different character sets, it does <emphasis>not</emphasis> support
+real localization so far.  There's no support for locale-specific monetary
+symbols, for a decimalpoint other than '.', no support for native time
+formats, and no support for native language sorting orders.
+</para>
+
+<para>However, internationalization is work in progress and we would be glad
+for coding help in this area.</para>
+
+</sect2>
+
+<sect2 id="setup-locale-charsetlist"><title>List of supported character sets</title>
+
+<para>Last but not least, here's the list of currently supported character
+sets.  The left-hand expression is the name of the charset, as you would use
+it in the internationalization environment variables as outlined above.
+</para>
+
+<para>The right-hand side is the number of the equivalent Windows
+codepage as well as the Windows name of the codepage.  They are only
+noted here for reference.  Don't try to use the bare codepage number or
+the Windows name of the codepage as charset in locale specifiers, unless
+they happen to be identical with the left-hand side.  Especially in case
+oif the "CPxxx" style charsets, always use them with the trailing "CP".</para>
+
+<para>This works:</para>
+
+<screen>
+  set LC_ALL=en_US.CP437
+</screen>
+
+<para>This does <emphasis>not</emphasis> work:</para>
+
+<screen>
+  set LC_ALL=en_US.437
+</screen>
+
+<para>You can find a full list of Windows codepages on the Microsoft MSDN page
+<ulink url="http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx">Code Page Identifiers</ulink>.</para>
+
+<screen>
+    Charset               Codepage
+
+    CP437                   437 (OEM United States)
+    CP720                   720 (DOS Arabic)
+    CP737                   737 (OEM Greek)
+    CP775                   775 (OEM Baltic)
+    CP850                   850 (OEM Latin 1, Western European)
+    CP852                   852 (OEM Latin 2, Central European)
+    CP855                   855 (OEM Cyrillic)
+    CP857                   857 (OEM Turkish)
+    CP858                   858 (OEM Latin 1 + Euro Symbol)
+    CP862                   862 (OEM Hebrew)
+    CP866                   866 (OEM Russian)
+    CP874                   874 (ANSI/OEM Thai)
+    CP1125                 1125 (OEM Ukraine)
+    CP1250                 1250 (ANSI Central European)
+    CP1251                 1251 (ANSI Cyrillic)
+    CP1252                 1252 (ANSI Latin 1, Western European)
+    CP1253                 1253 (ANSI Greek)
+    CP1254                 1254 (ANSI Turkish)
+    CP1255                 1255 (ANSI Hebrew)
+    CP1256                 1256 (ANSI Arabic)
+    CP1257                 1257 (ANSI Baltic)
+    CP1258                 1258 (ANSI/OEM Vietnamese)
+
+    ISO-8859-1            28591 (ISO-8859-1)
+    ISO-8859-2            28592 (ISO-8859-2)
+    ISO-8859-3            28593 (ISO-8859-3)
+    ISO-8859-4            28594 (ISO-8859-4)
+    ISO-8859-5            28595 (ISO-8859-5)
+    ISO-8859-6            28596 (ISO-8859-6)
+    ISO-8859-7            28597 (ISO-8859-7)
+    ISO-8859-8            28598 (ISO-8859-8)
+    ISO-8859-9            28599 (ISO-8859-9)
+    ISO-8859-10             -   (not available)
+    ISO-8859-11             -   (not available)
+    ISO-8859-13           28563 (ISO-8859-13)
+    ISO-8859-14             -   (not available)
+    ISO-8859-15           28565 (ISO-8859-15)
+    ISO-8859-16             -   (not available)
+
+    SJIS                    932 (ANSI/OEM Japanese)
+    GB2312                  936 (ANSI/OEM Simplified Chinese, GBK)
+    Big5                    950 (ANSI/OEM Traditional Chinese)
+    JIS                   50220 (ISO2022 Japanese w/o halfwidth Katakana)
+    eucJP                 51932 (EUC Japanese)
+    eucKR                 51949 (EUC Korean)
+
+    UTF-8                 65001 (UTF-8)
+</screen>
+
+</sect2>
+
+</sect1>
+
 <sect1 id="setup-files"><title>Customizing bash</title>

 <para>