Courier Unicode Library

Courier Unicode Library This library implements several algorithms related to the Unicode Standard: Look up uppercase, lowercase, and titlecase equivalents of a unicode character. Implementation of grapheme and work breaking rules. Implementation of line breaking rules. Several ancillary functions, like looking up the unicode character that corresponds to some HTML 4.0 entity (such as &, for example), and determining the normal width or a double-width status of a unicode character. Also, an adaptation of the iconv 3 API for this unicode library. Look up the Unicode script property. Look up the category property. This library also implements C++ bindings for these algorithms.

Current status The current release of the Courier Unicode library is based on the Unicode 8.0.0 standard.

Installation and usage Download the current version of the library from http://www.courier-mta.org/download.html#unicode. After unpacking the tarball, run the configure script, which takes the usual options, followed by make, then make install. To use the library, #include <courier-unicode.h> and link with -lcourier-unicode. The C++ compiler must have C++11 support. Minimum usable version of gcc appears to be gcc 4.4 with the -std=c++0x flag. Current versions of gcc use C++11, or higher, by default and do not require extra flags. Like with all C++ code, the same compiler, and flags, must be used to build code that uses this library that was used to build the library itself. The Courier Unicode library installs an autoconf macro to probe for C++11 support. In your configure.ac

AX_COURIER_UNICODE_VERSION AX_COURIER_UNICODE_CXXFLAGS AC_SUBST(COURIER_UNICODE_CXXFLAGS)

Then, in Makefile.am:

AM_CXXFLAGS = @COURIER_UNICODE_CXXFLAGS@

The AX_COURIER_UNICODE_VERSION macro checks the minimum library version. AX_COURIER_UNICODE_CXXFLAGS sets COURIER_UNICODE_CXXFLAGS to the appropriate option for older gcc compilers that require an option to enable C++11 support. The starting point for the library documentation is courier-unicode 7. Refer to the included manual pages, and the HTML version of the man pages for more information.

Manual pages

C manual pages SamVarshavchikAuthorCourier Unicode Library courier-unicode 7 courier-unicode Courier Unicode Library #include <courier-unicode.h> DESCRIPTION This library implements several algorithms related to the Unicode Standard. This library uses iconv 3 to convert text in a given character set to unicode. Any character set displayed by iconv --list can be specified for the corresponding character set parameter. Additionally, courier-unicode.h defines a special character string unicode_x_imap_modutf7 that specifies the pseudo-character set for the modified-UTF7 encoding used in IMAP. This string can also be appended by a space, and up to fifteen additional US-ASCII characters. The resulting character set also encodes these additional characters, in addition to unicode characters, with modified-UTF7. The C++ compiler must have C++11 support. Minimum usable version of gcc appears to be gcc 4.4 with the -std=c++0x flag. Current versions of gcc use C++11, or higher, by default and do not require extra flags. Consult the packaging documentation for the Courier Unicode Library for information on any compiler flags that are needed to build software that links with this library. SEE ALSO unicode_convert 3, unicode_default_chset 3, unicode_html40ent_lookup 3, unicode_category_lookup 3, unicode_grapheme_break 3, unicode_line_break 3, unicode_script 3, unicode_word_break 3, unicode_uc 3, unicode::iconvert::convert 3, unicode::iconvert::convert_tocase 3, unicode::iconvert::fromu 3, unicode::iconvert::tou 3, unicode::tolower 3, unicode::linebreak 3, unicode::wordbreak 3. SamVarshavchikAuthorCourier Unicode Library unicode_convert 3 unicode_u_ucs4_native unicode_u_ucs2_native unicode_convert_init unicode_convert unicode_convert_deinit unicode_convert_tocbuf_init unicode_convert_tou_init unicode_convert_fromu_init unicode_convert_uc unicode_convert_tocbuf_toutf8_init unicode_convert_tocbuf_fromutf8_init unicode_convert_toutf8 unicode_convert_fromutf8 unicode_convert_tobuf unicode_convert_tou_tobuf unicode_convert_fromu_tobuf unicode character set conversion #include <courier-unicode.h> extern const char unicode_u_ucs4_native[]; extern const char unicode_u_ucs2_native[]; unicode_convert_handle_t unicode_convert_init const char *src_chset const char *dst_chset void *cb_arg int unicode_convert unicode_convert_handle_t handle const char *text size_t cnt int unicode_convert_deinit unicode_convert_handle_t handle int *errptr unicode_convert_handle_t unicode_convert_tocbuf_init const char *src_chset const char *dst_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate unicode_convert_handle_t unicode_convert_tocbuf_toutf8_init const char *src_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate unicode_convert_handle_t unicode_convert_tocbuf_fromutf8_init const char *dst_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate unicode_convert_handle_t unicode_convert_tou_init const char *src_chset char32_t **ucptr_ret size_t *ucsize_ret int nullterminate unicode_convert_handle_t unicode_convert_fromu_init const char *dst_chset char **cbufptr_ret size_t *cbufsize_ret int nullterminate int unicode_convert_uc unicode_convert_handle_t handle const char32_t *text size_t cnt char *unicode_convert_toutf8 const char *text const char *charset int *error char *unicode_convert_fromutf8 const char *text const char *charset int *error char *unicode_convert_tobuf const char *text const char *charset const char *dstcharset int *error int unicode_convert_toubuf const char *text size_t text_l const char *charset char32_t **uc size_t *ucsize int *error int unicode_convert_fromu_tobuf const char32_t *utext size_t utext_l const char *charset char **c size_t *csize int *error DESCRIPTION unicode_u_ucs4_native[] contains the string UCS-4BE or UCS-4LE, matching the native char32_t endianness. unicode_u_ucs2_native[] contains the string UCS-2BE or UCS-2LE, matching the native char32_t endianness. unicode_convert_init(), unicode_convert(), and unicode_convert_deinit() are an adaption of th iconv 3 API that uses the same calling convention as the other algorithms in this unicode library, with some value-added features. These functions use iconv 3 to effect the actual character set conversion. unicode_convert_init() returns a non-NULL handle for the requested conversion, or NULL if the requested conversion is not available. unicode_convert_init() takes a pointer to the output function that receives receives converted character text. The output function receives a pointer to the converted character text, and the number of characters in the converted text. The output function gets repeatedly called, until it receives the entire converted text. The character text to convert gets passed, repeatedly, to unicode_convert(). Each call to unicode_convert() results in the output function getting invoked, zero or more times, with each successive part of the converted text. Finally, unicode_convert_deinit() stops the conversion and deallocates the conversion handle. It's possible that a call to unicode_convert_deinit() results in some additional calls to the output function, passing the remaining, final parts, of the converted text, before unicode_convert_deinit() deallocates the handle, and returns. The output function should return 0 normally. A non-0 return indicates n error condition. unicode_convert_deinit() returns non-zero if any previous invocation of the output function returned non-zero (this includes any invocations of the output function resulting from this call, or prior unicode_convert() calls), or 0 if all invocations of the output function returned 0. If the errptr is not NULL, *errptr gets set to non-zero if there were any conversion errors -- if there was any text that could not be converted to the destination character text. unicode_convert() also returns non-zero if it calls the output function and it returns non-zero, however the conversion handle remains allocated, so unicode_convert_deinit() must still be called, to clean that up. Collecting converted text into a buffer Call unicode_convert_tocbuf_init() instead of unicode_convert_init(), then call unicode_convert() and unicode_convert_deinit() normally. The parameters to unicode_convert_init() specify the source and the destination character sets. unicode_convert_tocbuf_toutf8_init() is just an alias that specifies UTF-8 as the destination character set. unicode_convert_tocbuf_fromutf8_init() is just an alias that specifies UTF-8 as the source character st. These functions supply an output function that collects the converted text into a malloc()ed buffer. If unicode_convert_deinit() returns 0, *cbufptr_ret gets initialized to a malloc()ed buffer, and the number of converted characters, the size of the malloc()ed buffer, get placed into *cbufsize_ret. If the converted string is an empty string, *cbufsize_ret gets set to 0, but *cbufptr_ret still gets initialized (to a dummy malloced buffer). A non-zero nullterminate places a trailing \0 character after the converted string (this is included in *cbufsize_ret). Converting between character sets and unicode unicode_convert_tou_init() converts character text into a char32_t buffer. It works just like unicode_convert_tocbuf_init(), except that only the source character set gets specified and the output buffer is a char32_t buffer. nullterminate terminates the converted unicode characters with a U+0000. unicode_convert_fromu_init() converts char32_ts to the output character set, and also works like unicode_convert_tocbuf_init(). Additionally, in this case, unicode_convert_uc() works just like unicode_convert() except that the input sequence is a char32_t sequence, and the count parameter is th enumber of unicode characters. One-shot conversions unicode_convert_toutf8() converts the specified text in the specified text into a UTF-8 string, returning a malloced buffer. If error is not NULL, even if unicode_convert_toutf8() returns a non NULL value *error gets set to a non-zero value if a character conversion error has occured, and some characters could not be converted. unicode_convert_fromutf8() does a similar conversion from UTF-8 text to the specified character set. unicode_convert_tobuf() does a similar conversion between two different character sets. unicode_convert_tou_tobuf() calls unicode_convert_tou_init(), feeds the character string through unicode_convert(), then calls unicode_convert_deinit(). If this function returns 0, *uc and *ucsize are set to a malloced buffer+size holding the unicode char array. unicode_convert_fromu_tobuf() calls unicode_convert_fromu_init(), feeds the unicode array through unicode_convert_uc(), then calls unicode_convert_deinit(). If this function returns 0, *c and *csize are set to a malloced buffer+size holding the char array. SEE ALSO courier-unicode 7, unicode_convert_tocase 3, unicode_default_chset 3. SamVarshavchikAuthorCourier Unicode Library unicode_default_chset 3 unicode_default_chset unicode_locale_chset return the system character set name #include <courier-unicode.h> const char *unicode_default_chset const char *unicode_locale_chset DESCRIPTION unicode_default_chset() returns the name of the system environment character set (usually nl_langinfo(CODESET), or from some suitable environment variable). unicode_locale_chset() returns the name of the current application locale's character set. SEE ALSO courier-unicode 7, unicode_convert_tocase 3. SamVarshavchikAuthorCourier Unicode Library unicode_html40ent_lookup 3 unicode_html40ent_lookup look up unicode character for an HTML 4.0 entity #include <courier-unicode.h> char32_t unicode_html40ent_lookup const char *entity DESCRIPTION unicode_html40ent_lookup() returns the unicode character represented by an HTML 4.0 entity. The entity is a string, such as quot, in which case unicode_html40ent_lookup() returns 34. Additionally, unicode_html40ent_lookup() parses a numerical entity given as #decimal or #xhex. unicode_html40ent_lookup() returns 0 if the entity is not a known entity that represents a single unicode character. SEE ALSO courier-unicode 7, unicode_convert_tocase 3. SamVarshavchikAuthorCourier Unicode Library unicode_category_lookup 3 unicode_category_lookup unicode_isalnum unicode_isalpha unicode_isblank unicode_isdigit unicode_isgraph unicode_islower unicode_ispunct unicode_isspace unicode_isupper unicode character categorization #include <courier-unicode.h> uint32_t unicode_category_lookup char32_t c int unicode_isalnum char32_t c int unicode_isalpha char32_t c int unicode_isblank char32_t c int unicode_isdigit char32_t c int unicode_isgraph char32_t c int unicode_islower char32_t c int unicode_ispunct char32_t c int unicode_isspace char32_t c int unicode_isupper char32_t c DESCRIPTION unicode_category_lookup() looks up the unicode character's categorization. unicode_category_lookup() returns a 32 bit value. The value's UNICODE_CATEGORY_1 bits specify the first level of the unicode character's category, with UNICODE_CATEGORY_2, UNICODE_CATEGORY_3, and UNICODE_CATEGORY_4 bits specifying the 2nd, 3rd, and 4th level, if given. A value of 0 for each corresponding bit set indicates that no category is specified for this level, for this character; otherwise the possible values are defined in <courier-unicode.h>. The remaining functions implement comparable equivalents of their non-unicode versions in the standard C library, as follows: unicode_isalnum() Returns non-0 for all unicode_isalpha() or unicode_isdigit(). unicode_isalpha() Returns non-0 for all UNICODE_CATEGORY_1_LETTER. unicode_isblank() Return non-0 for TAB, and all UNICODE_CATEGORY_2_SPACE. unicode_isdigit() Returns non-0 for all UNICODE_CATEGORY_1_NUMBER | UNICODE_CATEGORY_2_DIGIT, only (no third categories). unicode_isgraph() Returns non-0 for all codepoints above SPACE which are not unicode_isspace(). unicode_islower() Returns non-0 for all unicode_isalpha() for which the character is equal to unicode_lc 3 of itself. unicode_ispunct() Returns non-0 for all UNICODE_CATEGORY_1_PUNCTUATION. unicode_isspace() Returns non-0 for unicode_isblank() or for unicode characters with linebreaking properties of BK, CR, LF, NL, and SP. unicode_isupper() Returns non-0 for all unicode_isalpha() for which the character is equal to unicode_uc 3 of itself. SEE ALSO courier-unicode 7, unicode_convert_tocase 3. SamVarshavchikAuthorCourier Unicode Library unicode_grapheme_break 3 unicode_grapheme_break unicode grapheme cluster boundary rules #include <courier-unicode.h> int unicode_grapheme_break char32_t a char32_t b DESCRIPTION unicode_grapheme_break() returns non-zero if there is a grapheme break between the two unicode characters a and b. SEE ALSO TR-29, courier-unicode 7, unicode_convert_tocase 3, unicode_line_break 3, unicode_word_break 3. SamVarshavchikAuthorCourier Unicode Library unicode_script 3 unicode_script unicode script property #include <courier-unicode.h> unicode_script_t unicode_script char32_t ch DESCRIPTION unicode_script() looks up the script property of the specified unicode character, and returns it. The unicode_script_t enumeration encodes possible unicode script values. unicode_script_unknown gets returned for a unicode character with an unknown script property. SEE ALSO TR-24, courier-unicode 7. SamVarshavchikAuthorCourier Unicode Library unicode_line_break 3 unicode_lb_init unicode_lb_set_opts unicode_lb_next unicode_lb_next_cnt unicode_lb_end unicode_lbc_init unicode_lbc_set_opts unicode_lbc_next unicode_lbc_next_cnt unicode_lbc_end calculate mandatory or allowed line breaks #include <courier-unicode.h> unicode_lb_info_t unicode_lb_init int (*cb_func)(int, void *) void *cb_arg void unicode_lb_set_opts unicode_lb_info_t lb int opts int unicode_lb_next unicode_lb_info_t lb char32_t c int unicode_lb_next_cnt unicode_lb_info_t lb const char32_t *cptr size_t cnt int unicode_lb_end unicode_lb_info_t lb unicode_lbc_info_t unicode_lbc_init int (*cb_func)(int, char32_t, void *) void *cb_arg void unicode_lbc_set_opts unicode_lbc_info_t lb int opts int unicode_lbc_next unicode_lb_info_t lb char32_t c int unicode_lbc_next_cnt unicode_lb_info_t lb const char32_t *cptr size_t cnt int unicode_lbc_end unicode_lb_info_t lb DESCRIPTION These functions implement the unicode line breaking algorithm. Invoke unicode_lb_init() to initialize the line breaking algorithm. The first parameter is a callback function. The second parameter is an opaque pointer. The callback function gets invoked with two parameters. The first parameter is one of three values: UNICODE_LB_MANDATORY, UNICODE_LB_NONE, or UNICODE_LB_ALLOWED, as described below. The second parameter is the opaque pointer that was passed to unicode_lb_init(); the opaque pointer is not subject to any further interpretation by these functions. unicode_lb_init() returns an opaque handle. Repeated invocations of unicode_lb_next(), passing the handle and one unicode character at a time, defines a sequence of unicode characters over which the line breaking algorithm calculation takes place. unicode_lb_next_cnt() is a shortcut for invoking unicode_lb_next() repeatedly over an array cptr containing cnt unicode characters. unicode_lb_end() denotes the end of the unicode character sequence. After the call to unicode_lb_end() the line breaking unicode_lb_info_t handle is no longer valid. Between the call to unicode_lb_init() and unicode_lb_end(), the callback function gets invoked exactly once for each unicode character given to unicode_lb_next() or unicode_lb_next_cnt(). Usually each call to unicode_lb_next() results in the callback function getting invoked immediately, but it does not have to be. It's possible that a call to unicode_lb_next() returns without invoking the callback function, and some subsequent call to unicode_lb_next() (or unicode_lb_end()) invokes the callback function more than once, to catch up. The contract is that before unicode_lb_end() returns, the callback function gets invoked the exact number of times as the number of characters in the unicode sequence defined by the intervening calls to unicode_lb_next() and unicode_lb_next_cnt(), unless an error occurs. Each call to the callback function reports the calculated line breaking status of the corresponding character in the unicode character sequence: UNICODE_LB_MANDATORY A line break is MANDATORY before the corresponding character. UNICODE_LB_NONE A line break is PROHIBITED before the corresponding character. UNICODE_LB_ALLOWED A line break is OPTIONAL before the corresponding character. The callback function should return 0. A non-zero value indicates to the line breaking algorithm that an error has occured. unicode_lb_next() and unicode_lb_next_cnt() return zero either if they never invoked the callback function, or if each call to the callback function returned zero. A non zero return from the callback function results in unicode_lb_next() and unicode_lb_next_cnt() immediately returning the same value. unicode_lb_end() must be invoked to destroy the line breaking handle even if unicode_lb_next() and unicode_lb_next_cnt() returned an error indication. It's also possible that, under normal circumstances, unicode_lb_end() invokes the callback function one or more times. The return value from unicode_lb_end() has the same meaning as from unicode_lb_next() and unicode_lb_next_cnt(); however in all cases after unicode_lb_end() returns the line breaking handle is no longer valid. Alternative callback function unicode_lbc_init(), unicode_lbc_next(), unicode_lbc_next_cnt(), unicode_lbc_end() are alternative functions that implement the same algorithm. The only difference is that the callback function receives an extra parameter, the unicode character value to which the line breaking status applies to, passed through from the input unicode character sequence. Options unicode_lb_set_opts() and unicode_lbc_set_opts() enable non-default options for the line breaking algorithm. These functions must be called immediately after unicode_lb_init() or unicode_lbc_init(), and before any other function. opts is a bitmask that can contain the following values: UNICODE_LB_OPT_PRBREAK Enables a modified LB24 rule. This prevents plus signs, as in C++ from breaking. This flag adds the following rules to the LB24 rule:

PR x PR AL x PR ID x PR

UNICODE_LB_OPT_SYBREAK Tailored breaking rules for the / character. This prevents breaking after the / character (think URLs); including an exception to the x SY rule in LB13. This flag adds the following rules to the LB24 rule:

SY x EX SY x AL SY x ID SP ÷ SY, which takes precedence over "x SY".

UNICODE_LB_OPT_DASHWJ This flag reclassifies U+2013 and U+2014 as class WJ, prohibiting breaks before and after the m-dash and the n-dash unicode characters. SEE ALSO courier-unicode 7, unicode::linebreak 3, TR-14 SamVarshavchikAuthorCourier Unicode Library unicode_word_break 3 unicode_wb_init unicode_wb_next unicode_wb_next_cnt unicode_wb_end unicode_wbscan_init unicode_wbscan_next unicode_wbscan_end calculate word breaks #include <courier-unicode.h> unicode_wb_info_t unicode_wb_init int (*cb_func)(int, void *) void *cb_arg int unicode_wb_next unicode_wb_info_t wb char32_t c int unicode_wb_next_cnt unicode_wb_info_t wb const char32_t *cptr size_t cnt int unicode_wb_end unicode_wb_info_t wb unicode_wbscan_info_t unicode_wbscan_init int unicode_wbscan_next unicode_wbscan_info_t wbs char32_t c size_t unicode_wbscan_end unicode_wbscan_info_t wbs DESCRIPTION These functions implement the unicode word breaking algorithm. Invoke unicode_wb_init() to initialize the word breaking algorithm. The first parameter is a callback function. The second parameter is an opaque pointer. The callback function gets invoked with two parameters. The second parameter is the opaque pointer that was given to unicode_wb_init(); and the opaque pointer is not subject to any further interpretation by these functions. unicode_wb_init() returns an opaque handle. Repeated invocations of unicode_wb_next(), passing the handle, and one unicode character defines a sequence of unicode characters over which the word breaking algorithm calculation takes place. unicode_wb_next_cnt() is a shortcut for invoking unicode_wb_next() repeatedly over an array cptr containing cnt unicode characters. unicode_wb_end() denotes the end of the unicode character sequence. After the call to unicode_wb_end() the word breaking unicode_wb_info_t handle is no longer valid. Between the call to unicode_wb_init() and unicode_wb_end(), the callback function gets invoked exactly once for each unicode character given to unicode_wb_next() or unicode_wb_next_cnt(). Usually each call to unicode_wb_next() results in the callback function getting invoked immediately, but it does not have to be. It's possible that a call to unicode_wb_next() returns without invoking the callback function, and some subsequent call to unicode_wb_next() (or unicode_wb_end()) invokes the callback function more than once, to catch things up. The contract is that before unicode_wb_end() returns, the callback function gets invoked the exact number of times as the number of characters in the unicode sequence defined by the intervening calls to unicode_wb_next() and unicode_wb_next_cnt(), unless an error occurs. Each call to the callback function reports the calculated wordbreaking status of the corresponding character in the unicode character sequence. If the parameter to the callback function is non zero, a word break is permitted before the corresponding character. A zero value indicates that a word break is prohibited before the corresponding character. The callback function should return 0. A non-zero value indicates to the word breaking algorithm that an error has occured. unicode_wb_next() and unicode_wb_next_cnt() return zero either if they never invoked the callback function, or if each call to the callback function returned zero. A non zero return from the callback function results in unicode_wb_next() and unicode_wb_next_cnt() immediately returning the same value. unicode_wb_end() must be invoked to destroy the word breaking handle even if unicode_wb_next() and unicode_wb_next_cnt() returned an error indication. It's also possible that, under normal circumstances, unicode_wb_end() invokes the callback function one or more times. The return value from unicode_wb_end() has the same meaning as from unicode_wb_next() and unicode_wb_next_cnt(); however in all cases after unicode_wb_end() returns the line breaking handle is no longer valid. Word scan unicode_wbscan_init(), unicode_wbscan_next() and unicode_wbscan_end scan for the next word boundary in a unicode character sequence. unicode_wbscan_init() obtains a handle, then unicode_wbscan_next() gets repeatedly invoked to define the unicode character sequence. unicode_wbscan_end() deallocates the handle and returns the number of leading characters in the unicode character sequence up to the first word break. A non-0 return value from unicode_wbscan_next() indicates that the word boundary is already known, and any further calls to unicode_wbscan_next() will be ignored. unicode_wbscan_end() must still be called, to obtain the unicode character count. SEE ALSO TR-29, courier-unicode 7, unicode::wordbreak 3, unicode_convert_tocase 3, unicode_line_break 3, unicode_grapheme_break 3. SamVarshavchikAuthorCourier Unicode Library unicode_uc 3 unicode_uc unicode_lc unicode_tc unicode_convert_tocase unicode uppercase, lowercase, and titlecase character lookup #include <courier-unicode.h> char32_t unicode_uc char32_t c char32_t unicode_lc char32_t c char32_t unicode_tc char32_t c char *unicode_convert_tocase const char *str const char *charset char32_t (*first_char_func)(uncode_char) char32_t (*char_func)(uncode_char) DESCRIPTION unicode_uc(), unicode_lc(), unicode_tc() return the uppercase, lowercase, or the titlecase equivalent of the unicode character c. If this character does not have an uppercase, lowercase, or a titlecase equivalent, these functions return c, the same character. unicode_convert_tocase() takes the string str in the character set charset. first_char_func and char_func, each, should be unicode_uc, unicode_lc, or unicode_tc. unicode_convert_tocase() returns a malloc()ed buffer. The first unicode character in str gets processed by first_char_func, and all other characters by char_func. SEE ALSO courier-unicode 7, unicode_convert 3, unicode_default_chset 3, unicode_html40ent_lookup 3, unicode_category_lookup 3, unicode_grapheme_break 3, unicode_word_break 3, unicode_line_break 3.

C++ manual pages SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::convert 3 unicode::iconvert::convert unicode::ucs_4 unicode::ucs_2 unicode::utf_8 unicode::iso_8859_1 unicode character set conversion #include <courier-unicode.h> extern const char unicode::ucs_4[]; extern const char unicode::ucs_2[]; extern const char unicode::utf_8[]; extern const char unicode::iso_8859_1[]; std::string unicode::iconvert::convert const std::string &text const std::string &srccharset const std::string &dstcharset std::string unicode::iconvert::convert const std::string &text const std::string &srccharset const std::string &dstcharset bool &errflag std::string unicode::iconvert::convert const std::vector<char32_t> &text const std::string &dstcharset std::string unicode::iconvert::convert const std::vector<char32_t> &text const std::string &dstcharset bool &errflag bool unicode::iconvert::convert const std::string &text const std::string &charset std::vector<char32_t> &text DESCRIPTION The overloaded unicode::convert::convert() functions convert: A text string between two different character sets, returning the new string. A vector of unicode characters (not null-terminated) to a character string in a supported character set. Initialize a vector of unicode characters, passed by reference, by converting a text string in a given character set to unicode. These functions use iconv 3, and can use any character set that's supported by iconv 3. Use unicode::ucs_2 and unicode::ucs_4 to specify the 16 and the 32 bit unicode octet in native byte order. Use unicode::utf_8 and unicode::iso_8859_1 to specify these two standard character sets. The overloaded versions that pass a reference to a bool set the flag to true if some characters could not be converted. The overloaded version that initializes a unicode vector returns the bool flag, instead. SEE ALSO courier-unicode 7, unicode::convert::convert_tocase 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::convert_tocase 3 unicode::iconvert::convert_tocase unicode uppercase, lowercase, and titlecase conversion #include <courier-unicode.h> std::string unicode::iconvert::convert_tocase const std::string &text const std::string &charset char32_t (*first_char_func)(char32_t) char32_t (*char_func)(char32_t) std::string unicode::iconvert::convert_tocase const std::string &text const std::string &charset bool &err char32_t (*first_char_func)(char32_t) char32_t (*char_func)(char32_t) DESCRIPTION The overloaded unicode::convert::convert_tocase() function converts the text parameter, in the charset characters to lowercase, uppercase, and titlecase. text gets converted, internally, into unicode. first_char_func and char_func are either: unicode_lc, unicode_uc, or unicode_tc. If the converted text string is not empty, first_char_func converts the first unicode character in the text string, and char_func converts any remaining characters. unicode_lc converts its character to lowercase, unicode_uc to uppercase, and unicode_tc to titlecase. Finally, the unicode string gets converted back to charset, which gets returned. The optional err parameter gets set to true if an error was encounted converting the text string to or from unicode. SEE ALSO courier-unicode 7, unicode::convert::convert 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::fromu 3 unicode::iconvert::fromu template for converting text sequence from unicode #include <courier-unicode.h> output_iter_t unicode::iconvert::fromu::convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset output_iter_t output_iter bool &errflag void unicode::iconvert::fromu::convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset std::string &out_buf bool &errflag std::pair<std::string, bool> unicode::iconvert::fromu::convert const std::u32string &text const std::string &charset DESCRIPTION These template functions convert unicode characters to text in the given character set. beg_iter and end_iter define an input sequence of char32_ts. They get converted to unicode characters. output_iter is an output iterator that convert() iterates over chars in the specified character set. convert() returns the value of the output iterator after iterating over the converted character sequence. err_flag gets set to true if unicode text could not be converted to the requested character set, or false for a successful conversion. An overloaded convert() puts the text string into a std::string, instead of using an output iterator. Finally, a single std::u32string specifies the character string, instead of a beginning and an ending iterator. SEE ALSO courier-unicode 7, unicode::convert::convert 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::iconvert::tou 3 unicode::iconvert::tou template for converting text sequence to unicode #include <courier-unicode.h> output_iter_t convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset bool &errflag output_iter_t output_iter bool convert input_iter_t beg_iter input_iter_t end_iter const std::string &charset std::u32string &out_buf std::pair<std::u32string, bool> convert const std::string &text const std::string &charset DESCRIPTION These template functions convert text in a given character set to unicode characters. beg_iter and end_iter define an input sequence of chars in the charset character set. They get converted to unicode characters. output_iter is an output iterator that convert() iterates over char32_ts. convert() returns the value of the output iterator after iterating over the converted character sequence. errflag, passed by reference, gets set to true if some character could not be converted to unicode, from the specified character set, and false if the conversion completed without errors. An overloaded convert() puts the unicode character sequence into a vector of char32_ts, instead of an output sequence, and returned the error flag. Finally, a single std::string specifies the character string, instead of a beginning and an ending iterator, and returns a std::pair with the converted unicode text in a vector, and the error flag. SEE ALSO courier-unicode 7, unicode::convert::convert 3, unicode_convert 3, iconv 3. SamVarshavchikAuthorCourier Unicode Library unicode::linebreak 3 unicode::linebreak_callback_base unicode::linebreak_callback_save_buf unicode::linebreakc_callback_base unicode::linebreak_iter unicode::linebreakc_iter unicode line-breaking rules #include <courier-unicode.h> class linebreak : public unicode::linebreak_callback_base { public: using unicode::linebreak_callback_base::operator<<; using unicode::linebreak_callback_base::operator(); int callback(int linebreak_code) { // ... } }; char32_t c; std::u32string buf; linebreak compute_linebreak; compute_linebreak.set_opts(UNICODE_LB_OPT_SYBREAK); compute_linebreak << c; compute_linebreak(buf); compute_linebreak(buf.begin(), buf.end()); compute_linebreak.finish(); // ... unicode::linebreak_callback_save_buf linebreaks; std::list<int> lb=linebreaks.lb_buf; class linebreakc : public unicode::linebreakc_callback_base { public: using unicode::linebreak_callback_base::operator<<; using unicode::linebreak_callback_base::operator(); int callback(int linebreak_code, char32_t ch) { // ... } }; // ... std::u32string buf; typedef unicode::linebreak_iter<std::u32string::const_iterator> iter_t; iter_t beg_iter(buf.begin(), buf.end()), end_iter; beg_iter.set_opts(UNICODE_LB_OPT_SYBREAK); std::vector<int> linebreaks; std::copy(beg_iter, end_iter, std::back_insert_iterator<std::vector<int>>(linebreaks)); // ... typedef unicode::linebreakc_iter<std::u32string::const_iterator> iter_t; iter_t beg_iter(buf.begin(), buf.end()), end_iter; beg_iter.set_opts(UNICODE_LB_OPT_SYBREAK); std::vector<std::pair<int, char32_t>> linebreaks; std::copy(beg_iter, end_iter, std::back_insert_iterator<std::vector<int>>(linebreaks)); DESCRIPTION unicode::linebreak_callback_base is a C++ binding for the unicode line-breaking rule implementation described in unicode_line_break 3. Subclass unicode::linebreak_callback_base and implement callback() that's virtually inherited from unicode::linebreak_callback_base. The callback() callback function receives the output values from the line-breaking algorithm, the UNICODE_LB_MANDATORY, UNICODE_LB_NONE, or the UNICODE_LB_ALLOWED value, for each unicode character. callback() should return 0. A non-zero return reports an error, that stops the line-breaking algorithm. See unicode_line_break 3 for more information. The alternate unicode::linebreakc_callback_base interface uses a virtually inherited callback() that receives two parameters, the line-break code value, and the corresponding unicode character. The input unicode characters for the line-breaking algorithm are provided by the << operator, one unicode character at a time; or by the () operator, passing either a container, or a beginning and an ending iterator value for an input sequence of unicode characters. finish() indicates the end of the unicode character sequence. set_opts sets line-breaking options (see unicode_lb_set_opts() for more information). unicode::linebreak_callback_save_buf is a subclass that implements callback() by saving the linebreaks codes into a std::list. The linebreak_iter template implements an input iterator over ints. The template parameter is an input iterator over unicode chars. The constructor's parameters are a beginning and an ending iterator value for a sequence of char32_t. This constructs the beginning iterator value for a sequence of ints consisting of line-break values (UNICODE_LB_MANDATORY, UNICODE_LB_NONE, or UNICODE_LB_ALLOWED) corresponding to each char32_t in the underlying sequence. The default constructor creates the ending iterator value for the sequence. The iterator implements a set_opts() methods that sets the options for the line-breaking algorithm. The linebreakc_iter template implements a similar input iterator, with the difference that it ends up iterating over a std::pair of line-breaking values and the corresponding char32_t from the underlying input sequence. SEE ALSO courier-unicode 7, unicode_line_break 3. SamVarshavchikAuthorCourier Unicode Library unicode::tolower 3 unicode::tolower unicode::toupper unicode version of tolower 3 and toupper 3 #include <courier-unicode.h> std::string unicode::tolower const std::string &string std::string unicode::tolower const std::string &string const std::string &charset std::u32string unicode::tolower const std::u32string &u std::string unicode::toupper const std::string &string std::string unicode::toupper const std::string &string const std::string &charset std::u32string unicode::toupper const std::u32string &u DESCRIPTION These functions convert the string parameter, in charset or unicode_default_chset 3, to unicode, replace each character with unicode_lc 3 or unicode_uc 3, then convert it back to the same character set, returning the resulting string. Passing a const std::u32string & directly also converts it accordingly, returning the converted unicode string. SEE ALSO courier-unicode 7. SamVarshavchikAuthorCourier Unicode Library unicode::wordbreak 3 unicode::wordbreak_callback_base unicode::wordbreak_callback_base unicode word-breaking rules #include <courier-unicode.h> class wordbreak : public unicode::wordbreak_callback_base { public: using unicode::wordbreak_callback_base::operator<<; using unicode::wordbreak_callback_base::operator(); int callback(bool flag) { // ... } }; char32_t c; std::u32string buf; wordbreak compute_wordbreak; compute_wordbreak << c; compute_wordbreak(buf); compute_wordbreak(buf.begin(), buf.end()); compute_wordbreak.finish(); // ... unicode_wordbreakscan scan; scan << c; size_t nchars=scan.finish(); DESCRIPTION unicode::wordbreak_callback_base is a C++ binding for the unicode word-breaking rule implementation described in unicode_word_break 3. Subclass unicode::wordbreak_callback_base and implement callback() that's virtually inherited from unicode::wordbreak_callback_base. The callback() callback function receives the output values from the word-breaking algorithm, namely a bool indicating whether a word break exists before the unicode character in the underlying input sequence. callback() should return 0. A non-zero return reports an error, that stops the word-breaking algorithm. See unicode_word_break 3 for more information. The input unicode characters for the word-breaking algorithm are provided by the << operator, one unicode character at a time; or by the () operator, passing either a container, or a beginning and an ending iterator value for an input sequence of unicode characters. finish() indicates the end of the unicode character sequence. unicode::wordbreakscan is a C++ binding for the unicode_wbscan_init(), unicode_wbscan_next() and unicode_wbscan_end methods described in unicode_word_break 3. Its << iterates over the unicode characters, and finish() indicates the number of characters before the first unicode word break. The << iterator returns a bool indicating when the first word break has already been found, so further calls are not necessary. SEE ALSO courier-unicode 7, unicode_word_break 3.

COPYING The Courier Unicode Library is free software, distributed under the terms of the GPL, version 3: