diff --git a/winsup/doc/ChangeLog b/winsup/doc/ChangeLog index a85e8168c..b49413d31 100644 --- a/winsup/doc/ChangeLog +++ b/winsup/doc/ChangeLog @@ -1,3 +1,14 @@ +2009-03-25 Corinna Vinschen + + * new-features.sgml: Add missing GB2312 and eucKR character sets. + * pathnames.sgml: Change "DOS devices" title to "Invalid filenames" + and rephrase that section. + Add section "Filenames with unusual (foreign) characters". + Fix an emphasis. + * setup-net.sgml: Integrate setup-locale section. + * setup2.sgml: Add locale variables to section "Environment Variables". + Add section "Internationalization". + 2009-03-24 Corinna Vinschen * new-features.sgml: Add section about chaged (no)winsymlink default. diff --git a/winsup/doc/new-features.sgml b/winsup/doc/new-features.sgml index cae0b492e..3889efa6d 100644 --- a/winsup/doc/new-features.sgml +++ b/winsup/doc/new-features.sgml @@ -195,8 +195,9 @@ in 1-16, except 12, "UTF-8", Windows codepages "CPxxx", with xxx in (437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125, 1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258), "JIS", "SJIS", - "eucJP", "Big5". The leading language and territory part (en_US) is not - used by Cygwin yet, but is required for POSIX compatibility. + "GB2312", "eucJP", "eucKR", and "Big5". The leading language and territory + part (en_US, for instance) is not used by Cygwin yet, but is required + for POSIX compatibility. - Allow multiple concurrent read locks per thread for pthread_rwlock_t. diff --git a/winsup/doc/pathnames.sgml b/winsup/doc/pathnames.sgml index 97706e99a..722c98b80 100644 --- a/winsup/doc/pathnames.sgml +++ b/winsup/doc/pathnames.sgml @@ -311,21 +311,25 @@ to be readable by the $USER user account itself. -DOS devices +Invalid filenames Filenames invalid under Win32 are not necessarily invalid -under Cygwin since release 1.7.0. There are a couple of rules which -apply to Windows filenames. First of all, DOS device names like +under Cygwin since release 1.7.0. There are a few rules which +apply to Windows filenames. Most notably, DOS device names like AUX, COM1, LPT1 or PRN (to name a few) -cannot be used in a native Win32 application, even with an -extension (prn.txt). Cygwin can handle files with -these names just fine. +cannot be used as filename or extension in a native Win32 application. +So filenames like prn.txt or foo.aux +are invalid filenames for native Win32 applications. + +This restriction doesn't apply to Cygwin applications. Cygwin +can create and access files with such names just fine. Just don't try +to use these files with native Win32 aqpplications... -Special characters in filenames +Forbidden characters in filenames Win32 filenames can't contain trailing dots and spaces for backward compatibility. When trying to create files with trailing dots or spaces, @@ -346,6 +350,48 @@ are converted to special UNICODE characters in the range 0xf000 to 0xf0ff + +Filenames with unusual (foreign) characters + + Windows filesystems use the Unicode character set in the UTF-16 +encoding to store filename information. If you don't use the UTF-8 +character set (see ) then there's a +chance that a filename is using one or more characters which have no +representation in the character set you're using. + +For instance, there are no chinese characters in the ISO-8859-1 +character set. So, converting a filename containing a chinese character +to ISO-8859-1 leaves you with a wrongly converted filename, for instance +containing a question mark '?' as replacement for the unconvertable +character. When trying to access the file, Cygwin has to convert the +filename back to UTF-16. However, this doesn't result in the original +filename because the question mark will not translate back to the original +chinese character, but to a simple question mark instead. This in turn +results in strange "File not found" messages. + +To avoid this scenario altogether, just use always UTF-8 as +character set. + +If you don't want or can't use UTF-8 as character set for whatever +reason, you will nevertheless be able to access the file. How does that +work? When Cygwin converts the filename from UTF-16 to your character +set, it recognizes characters which can't be converted. If that occurs, +Cygwin replaces the non-convertible character with a special character +sequence. The sequence starts with an ASCII SO character (hex code +0x0e, equivalent Control-N), followed by the UTF-8 representation of the +character. The result is a filename containing some ugly looking +characters. While it doesn't look nice, it +is nice, because Cygwin knows how to convert this +filename back to UTF-16. The filename will be converted using your +usual character set. However, when Cygwin recognizes an ASCII SO +character, it skips over the ASCII SO and handles the following bytes as +a UTF-8 character. Thus, the filename is symmetrically converted back to +UTF-16 and you can access the file. + +Again, by using UTF-8 you can avoid this problem entirely. + + + Case sensitive filenames @@ -369,7 +415,7 @@ HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\obcaseinsensitive this registry value also on Windows NT4 and Windows 2000, which usually both don't know this registry key. If you want case-sensitivity on these systems, create that registry value and set it to 0. On these systems -(and *only* on these systems) you don't have to reboot to bring it +(and only on these systems) you don't have to reboot to bring it into effect, rather stopping all Cygwin processes and then restarting them is sufficient. diff --git a/winsup/doc/setup-net.sgml b/winsup/doc/setup-net.sgml index 165924d07..57c3fb185 100644 --- a/winsup/doc/setup-net.sgml +++ b/winsup/doc/setup-net.sgml @@ -254,6 +254,7 @@ Problems with Cygwin. DOCTOOL-INSERT-setup-env DOCTOOL-INSERT-setup-maxmem +DOCTOOL-INSERT-setup-locale DOCTOOL-INSERT-ntsec DOCTOOL-INSERT-setup-files diff --git a/winsup/doc/setup2.sgml b/winsup/doc/setup2.sgml index 4ae4d4fd3..20718b955 100644 --- a/winsup/doc/setup2.sgml +++ b/winsup/doc/setup2.sgml @@ -13,12 +13,21 @@ The CYGWIN variable is used to configure many global settings for the Cygwin runtime system. Initially you can leave CYGWIN unset or set it to tty (e.g. to support job control with ^Z etc...) using a syntax like this in the -DOS shell, before launching bash. +DOS shell, before launching bash. C:\> set CYGWIN=tty notitle glob + +Locale support is controlled by the LANG and +LC_xxx environment variables. You can set all of them +but Cygwin itself only honors the variables LC_ALL, +LC_CTYPE, and LANG, in this order, according +to the POSIX standard. The first one found rules. For a more detailed +description see . + + The PATH environment variable is used by Cygwin applications as a list of directories to search for executable files @@ -124,6 +133,279 @@ Run the program and it will output the maximum amount of allocatable memory. +Internationalization + +Overview + + +Internationalization support is controlled by the LANG and +LC_xxx environment variables. You can set all of them +but Cygwin itself only honors the variables LC_ALL, +LC_CTYPE, and LANG, in this order, according +to the POSIX standard. The content of these variables should follow the +POSIX standard for a locale specifier. The correct form of a locale +specifier is + + + language[[_TERRITORY][.charset][@modifier]] + + +"language" is a lowercase two character string per ISO 639-1, +"TERRITORY" is an uppercase two character string per ISO 3166, charset is +one of a list of supported character sets, and the modifier doesn't matter +here (though it might for some applications). If you're interested in the +exact description, you can find it in the online publication of the POSIX +manual pages on the homepage of the +Open Group. + +Typical locale specifiers are + + + "de_CH" language = German, territory = Switzerland, default charset + "fr_FR.UTF-8" language = french, territory = France, charset = UTF-8 + "ko_KR.eucKR" language = korean, territory = South Korea, charset = eucKR + + + +And let's not forget the default locale called "C" or "POSIX" +which basically only supports plain ASCII code. If the aforementioned +environment variables are not set, or set to "C" or "POSIX", you get the +default ASCII-only behaviour. + + + +Right now the language and territory content is not evaluated by Cygwin any +further. The only important part so far is the character set. How does that +work? + + + + +How to set the locale + + + + +The default locale is the "C" or "POSIX" locale. In this locale, basically +only ASCII characters are supported. Even if one of the aforementioned +environment variables are set to something else, it's the application's +responsibility to call the function setlocale, +typically like this + + + setlocale (LC_ALL, ""); + + +to switch to another locale according to the settings of the +internationalization environment variables. + + + +Assuming you set one of the aforementioned environment variables to some +valid POSIX locale value, other than "C" and "POSIX", and assuming you +call an application which calls setlocale as above. + +Assuming further you're living in Japan. So you might want to use +the language code "ja" and the territory "JP", thus setting, say, +LANG to "ja_JP". You didn't set a character set, so +what will Cygwin use now? Easy! It will use the default Windows ANSI +codepage of your system, if it's supported by Cygwin. Hopefully Cygwin +supports all relevant default ANSI codepages... + +For a list of supported character sets, see + + + + + +You don't want to use the default Windows codepage as character set? +In that case you have to specify the charset explicitely. For instance, +assume you're from Italy and don't want to use the default Windows codepage +1252, but the more portable ISO-8859-15 character set. What you can do is +to set the LANG variable in the +C:\cygwin\Cygwin.bat file which is the batch file +to start a Cygwin session from the "Cygwin" desktop shortcut. + + + @echo off + + C: + chdir C:\cygwin\bin + set LANG=it_IT.ISO-8859-15 + bash --login -i + + + + +Most singlebyte or doublebyte charsets have a disadvantage. Windows +filesystems use the Unicode character set in the UTF-16 encoding to store filename information. Not all characters +from the Unicode character set are available in a singlebyte or doublebyte +charset. While Cygwin has a workaround to access files with unusual +characters (see ), a better +workaround is to use always the UTF-8 character set. UTF-8 is the only +multibyte character set which can represent every +Unicode character. + + + set LANG=es_MX.UTF-8 + + +For a description of the Unicode standard, see the homepage of the +Unicode Consortium. + + + + + + +Potential Problems + + +You can set the above internationalization variables not only in +Cygwin.bat or in the Windows environment, but also +in your Cygwin shell on the fly, even switch to yet another character +set, and yet another. In bash for instance: + + + bash$ export LC_CTYPE="nl_BE.UTF-8" + + +However, here's a problem. At the start of the first Cygwin process +in a session, the Windows environment has to be converted from UTF-16 to +some singlebyte or multibyte charset. If the internationalization environment +variable hasn't been set before starting this process, +Cygwin has to make an educated guess which charset to use to convert +the environment itself. The only reproducible way to do that in the absence +of LC_ALL, LC_CTYPE, or LANG, +is to use the current Windows ANSI codepage. + +As long as the environment only contains ASCII characters, this is +no problem. But if it does, and you're planning to use, say, UTF-8, +the environment will result in invalid characters in the UTF-8 charset. +This would be especially a problem in variables like PATH. + +Per POSIX, the name of an environment variable should only +consist of valid ASCII characters, and only of uppercase letters, digits, and +the underscore for maximum portablilty. + +And here's another problem when switching charsets on the fly. +Symbolic links. A symbolic link contains the filename of the target +file the symlink points to. When a symlink is created, the current +character set is used to store the target filename. If the target +filename contains non-ASCII characters and you switch to another +character set, the target filename of the symlink is now potentially +an invalid character sequence in the new character set. This behaviour +is not different from the behaviour in other Operating Systems. So, +if you suddenly can't access a symlink anymore, maybe it's because you +switched to another character set? + + + + +What does not work? + + +Except for LC_ALL, LC_CTYPE, +and LANG, all other LC_xxx environment variables, +LC_COLLATE, LC_MESSAGES, +LC_MONETARY, LC_NUMERIC, +and LC_TIME, are ignored right now. This means, while Cygwin +supports different character sets, it does not support +real localization so far. There's no support for locale-specific monetary +symbols, for a decimalpoint other than '.', no support for native time +formats, and no support for native language sorting orders. + + +However, internationalization is work in progress and we would be glad +for coding help in this area. + + + +List of supported character sets + +Last but not least, here's the list of currently supported character +sets. The left-hand expression is the name of the charset, as you would use +it in the internationalization environment variables as outlined above. + + +The right-hand side is the number of the equivalent Windows +codepage as well as the Windows name of the codepage. They are only +noted here for reference. Don't try to use the bare codepage number or +the Windows name of the codepage as charset in locale specifiers, unless +they happen to be identical with the left-hand side. Especially in case +oif the "CPxxx" style charsets, always use them with the trailing "CP". + +This works: + + + set LC_ALL=en_US.CP437 + + +This does not work: + + + set LC_ALL=en_US.437 + + +You can find a full list of Windows codepages on the Microsoft MSDN page +Code Page Identifiers. + + + Charset Codepage + + CP437 437 (OEM United States) + CP720 720 (DOS Arabic) + CP737 737 (OEM Greek) + CP775 775 (OEM Baltic) + CP850 850 (OEM Latin 1, Western European) + CP852 852 (OEM Latin 2, Central European) + CP855 855 (OEM Cyrillic) + CP857 857 (OEM Turkish) + CP858 858 (OEM Latin 1 + Euro Symbol) + CP862 862 (OEM Hebrew) + CP866 866 (OEM Russian) + CP874 874 (ANSI/OEM Thai) + CP1125 1125 (OEM Ukraine) + CP1250 1250 (ANSI Central European) + CP1251 1251 (ANSI Cyrillic) + CP1252 1252 (ANSI Latin 1, Western European) + CP1253 1253 (ANSI Greek) + CP1254 1254 (ANSI Turkish) + CP1255 1255 (ANSI Hebrew) + CP1256 1256 (ANSI Arabic) + CP1257 1257 (ANSI Baltic) + CP1258 1258 (ANSI/OEM Vietnamese) + + ISO-8859-1 28591 (ISO-8859-1) + ISO-8859-2 28592 (ISO-8859-2) + ISO-8859-3 28593 (ISO-8859-3) + ISO-8859-4 28594 (ISO-8859-4) + ISO-8859-5 28595 (ISO-8859-5) + ISO-8859-6 28596 (ISO-8859-6) + ISO-8859-7 28597 (ISO-8859-7) + ISO-8859-8 28598 (ISO-8859-8) + ISO-8859-9 28599 (ISO-8859-9) + ISO-8859-10 - (not available) + ISO-8859-11 - (not available) + ISO-8859-13 28563 (ISO-8859-13) + ISO-8859-14 - (not available) + ISO-8859-15 28565 (ISO-8859-15) + ISO-8859-16 - (not available) + + SJIS 932 (ANSI/OEM Japanese) + GB2312 936 (ANSI/OEM Simplified Chinese, GBK) + Big5 950 (ANSI/OEM Traditional Chinese) + JIS 50220 (ISO2022 Japanese w/o halfwidth Katakana) + eucJP 51932 (EUC Japanese) + eucKR 51949 (EUC Korean) + + UTF-8 65001 (UTF-8) + + + + + + Customizing bash